Sistemi Embedded - I Parte (2011-2012) PDF

Embedded Systems Design: A Unified
Hardware/Software Introduction
Chapter 1: Introduction
Outline
Embedded systems overview
What are they?
Design challenge optimizing design metrics

Technologies
Processor technologies
IC technologies
Design technologies

Hardware/Software Introduction, (c) 2000 Vahid/Givargis

Computing systems are everywhere
Most of us think of desktop computers
PCs
Laptops
Mainframes
Servers
But theres another type of computing system

Far more common...


Embedded computing systems
Computing systems embedded within
electronic devices
Hard to define. Nearly any computing
system other than a desktop computer
Billions of units produced yearly, versus
millions of desktop units
Perhaps 50 per household and per
automobile

Computers are in here...

and here...
and even here...
Lots more of these,

though they cost a lot
less each.
A short list of embedded systems

Anti-lock brakes
Auto-focus cameras
Automatic teller machines
Automatic toll systems
Automatic transmission
Avionic systems
Battery chargers
Camcorders
Cell phones
Cell-phone base stations
Cordless phones
Cruise control
Curbside check-in systems
Digital cameras
Disk drives
Electronic card readers
Electronic instruments
Electronic toys/games
Factory control
Fax machines
Fingerprint identifiers
Home security systems
Life-support systems
Medical testing systems
Modems
MPEG decoders
Network cards
Network switches/routers
On-board navigation
Pagers
Photocopiers
Point-of-sale systems
Portable video games
Printers
Satellite phones
Scanners
Smart ovens/dishwashers
Speech recognizers
Stereo systems
Teleconferencing systems
Televisions
Temperature controllers
Theft tracking systems
TV set-top boxes
VCRs, DVD players
Video game consoles
Video phones
Washers and dryers
And the list goes on and on

Some common characteristics of embedded

systems
Single-functioned
Executes a single program, repeatedly
Tightly-constrained
Low cost, low power, small, fast, etc.
Reactive and real-time

Continually reacts to changes in the systems environment
Must compute certain results in real-time without delay

An embedded system example -- a digital

camera
Digital camera chip
CCD
CCD preprocessor
Pixel coprocessor
D2A
A2D
lens
JPEG codec
Microcontroller
Multiplier/Accum
DMA controller
Memory controller
Display ctrl
ISA bus interface
UART
LCD ctrl
Single-functioned -- always a digital camera

Tightly-constrained -- Low cost, low power, small, fast
Reactive and real-time -- only to a small extent


Obvious design goal:
Construct an implementation with desired functionality
Key design challenge:

Simultaneously optimize numerous design metrics
Design metric
A measurable feature of a systems implementation
Optimizing design metrics is a key challenge


Common metrics
Unit cost: the monetary cost of manufacturing each copy of the system,
excluding NRE cost
NRE cost (Non-Recurring Engineering cost): The one-time

monetary cost of designing the system
Size: the physical space required by the system

Performance: the execution time or throughput of the system
Power: the amount of power consumed by the system
Flexibility: the ability to change the functionality of the system without
incurring heavy NRE cost


Common metrics (continued)
Time-to-prototype: the time needed to build a working version of the
system
Time-to-market: the time required to develop a system to the point that it

can be released and sold to customers
Maintainability: the ability to modify the system after its initial release
Correctness, safety, many more

10
Design metric competition -- improving one

may worsen others
Expertise with both software
and hardware is needed to
optimize design metrics
Power
Performance
Size
NRE cost
CCD
Digital camera chip

A2D
CCD preprocessor
Pixel coprocessor
D2A
lens
JPEG codec
Microcontroller
Multiplier/Accum
DMA controller
Memory controller
Display ctrl
ISA bus interface
Not just a hardware or

software expert, as is common
A designer must be
comfortable with various
technologies in order to choose
the best for a given application
and constraints
UART
LCD ctrl
Hardware
Software

11
Time-to-market: a demanding design metric
Revenues ($)
Time required to develop a

product to the point it can be
sold to customers
Market window
Period during which the
product would have highest
sales
Time (months)

Average time-to-market
constraint is about 8 months
Delays can be costly
12
Losses due to delayed market entry

Simplified revenue model
Revenues ($)
Peak revenue
Product life = 2W, peak at W

Time of market entry defines a
triangle, representing market
penetration
Triangle area equals revenue
Peak revenue from

delayed entry
On-time
Market fall
Market rise
Delayed
Loss
D
On-time
entry
The difference between the ontime and delayed triangle areas
2W
W
Time
Delayed
entry

13
Losses due to delayed market entry (cont.)

Area = 1/2 * base * height
Revenues ($)
Peak revenue
Peak revenue from
delayed entry
On-time
Market fall
Market rise
Delayed
On-time
entry
Delayed
entry
2W
W
Time

On-time = 1/2 * 2W * W
Delayed = 1/2 * (W-D+W)*(W-D)
Percentage revenue loss =

(D(3W-D)/2W2)*100%
Try some examples
Lifetime 2W=52 wks, delay D=4 wks

(4*(3*26 4)/2*26^2) = 22%
Lifetime 2W=52 wks, delay D=10 wks
(10*(3*26 10)/2*26^2) = 50%
Delays are costly!
14
NRE and unit cost metrics

Costs:
Unit cost: the monetary cost of manufacturing each copy of the system,
excluding NRE cost
NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of
designing the system
total cost = NRE cost + unit cost * # of units
per-product cost
= total cost / # of units
= (NRE cost / # of units) + unit cost
Example
NRE=$2000, unit=$100
For 10 units
total cost = $2000 + 10*$100 = $3000
per-product cost = $2000/10 + $100 = $300
Amortizing NRE cost over the units results in an
additional $200 per unit
15
NRE and unit cost metrics

Compare technologies by costs -- best depends on quantity
Technology A: NRE=$2,000, unit=$100
Technology B: NRE=$30,000, unit=$30
Technology C: NRE=$100,000, unit=$2
$200,000
B
C
$120,000
$80,000
A
B
$160
p er p rod uc t c ost
$160,000
tota l c ost (x1000)
$200
$120
$80
$40
$40,000
$0
$0
0
800
1600
2400
Numb er of units (volume)
800
1600
2400
Numb er of units (volume)
But, must also consider time-to-market

16
The performance design metric

Widely-used measure of system, widely-abused
Clock frequency, instructions per second not good measures
Digital camera example a user cares about how fast it processes images, not
clock speed or instructions per second
Latency (response time)

Time between task start and end
e.g., Cameras A and B process images in 0.25 seconds
Throughput
Tasks per second, e.g. Camera A processes 4 images per second
Throughput can be more than latency seems to imply due to concurrency, e.g.
Camera B may process 8 images per second (by capturing a new image while
previous image is being stored).
Speedup of B over A = Bs performance / As performance

Throughput speedup = 8/4 = 2
17
Three key embedded system technologies

Technology
A manner of accomplishing a task, especially using technical
processes, methods, or knowledge
Three key technologies for embedded systems

Processor technology
IC technology
Design technology

18
The architecture of the computation engine used to implement a
systems desired functionality
Processor does not have to be programmable
Processor not equal to general-purpose processor
Controller
Datapath
Controller
Datapath
Controller
Datapath
Control
logic and
State register
Control logic
and State
register
Registers
Control
logic
index
Register
file
Custom
ALU
State
register
IR
PC
General
ALU
IR
total
+
PC
Data
memory
Program
memory
Assembly code
for:
Data
memory
Data
memory
Program memory
Assembly code
for:
total = 0
for i =1 to
total = 0
for i =1 to
General-purpose
Application-specific
Single-purpose (hardware)

19
Processors vary in their customization for the problem at hand
Desired
functionality
General-purpose
processor
total = 0
for i = 1 to N loop
total += M[i]
end loop
Application-specific
processor
Single-purpose
processor
20
General-purpose processors
Programmable device used in a variety of
applications
Also known as microprocessor
Features
Program memory
General datapath with large register file and
general ALU
User benefits
Low time-to-market and NRE costs
High flexibility
Pentium the most well-known, but

there are hundreds of others
Controller
Datapath
Control
logic and
State register
Register
file
IR
PC
Program
memory
General
ALU
Data
memory
Assembly code
for:
total = 0
for i =1 to

21
Single-purpose processors
Digital circuit designed to execute exactly
one program
a.k.a. coprocessor, accelerator or peripheral
Features
Contains only the components needed to
execute a single program
No program memory
Controller
Datapath
Control
logic
index
total
State
register
Data
memory
Benefits
Fast
Low power
Small size
22
Application-specific processors
Programmable processor optimized for a
particular class of applications having
common characteristics
Compromise between general-purpose and
single-purpose processors
Controller
Datapath
Control
logic and
State register
Registers
Custom
ALU
IR
PC
Features
Program
memory
Program memory
Optimized datapath
Special functional units
Data
memory
Assembly code
for:
total = 0
for i =1 to
Benefits
Some flexibility, good performance, size and
power
23
IC technology
The manner in which a digital (gate-level)
implementation is mapped onto an IC
IC: Integrated circuit, or chip
IC technologies differ in their customization to a design
ICs consist of numerous layers (perhaps 10 or more)
IC technologies differ with respect to who builds each layer and
when
IC package
IC
source
gate
oxide
channel
drain
Silicon substrate

24
IC technology
Three types of IC technologies
Full-custom/VLSI
Semi-custom ASIC (gate array and standard cell)
PLD (Programmable Logic Device)

25
Full-custom/VLSI
All layers are optimized for an embedded systems
particular digital implementation
Placing transistors
Sizing transistors
Routing wires
Benefits
Excellent performance, small size, low power
Drawbacks
High NRE cost (e.g., $300k), long time-to-market
Har
26
Semi-custom
Lower layers are fully or partially built
Designers are left with routing of wires and maybe placing
some blocks
Benefits
Good performance, good size, less NRE cost than a fullcustom implementation (perhaps $10k to $100k)
Drawbacks
Still require weeks to months to develop

27
PLD (Programmable Logic Device)

All layers already exist
Designers can purchase an IC
Connections on the IC are either created or destroyed to
implement desired functionality
Field-Programmable Gate Array (FPGA) very popular
Benefits
Low NRE costs, almost instant IC availability
Drawbacks
Bigger, expensive (perhaps $30 per unit), power hungry,
slower
28
Moores law
The most important trend in embedded systems
Predicted in 1965 by Intel co-founder Gordon Moore
IC transistor capacity has doubled roughly every 18 months
for the past several decades
10,000
1,000
Logic transistors
per chip
(in millions)
100
10
1
0.1
Note:
logarithmic scale
0.01
0.001

29
Moores law
Wow
This growth rate is hard to imagine, most people
underestimate
How many ancestors do you have from 20 generations ago
i.e., roughly how many people alive in the 1500s did it take to make
you?
220 = more than 1 million people
(This underestimation is the key to pyramid schemes!)

30
Graphical illustration of Moores law

1981
1984
1987
1990
1993
1996
1999
2002
10,000
transistors
150,000,000
transistors
Leading edge
chip in 1981
Leading edge
chip in 2002
Something that doubles frequently grows more quickly

than most people realize!
A 2002 chip can hold about 15,000 1981 chips inside itself
Embedded
b
Systems Design: A Unified
31
Design Technology
The manner in which we convert our concept of
desired system functionality into an implementation
Compilation/
Synthesis
Compilation/Synthesis:
Automates exploration and
insertion of implementation
details for lower level.
Libraries/IP: Incorporates predesigned implementation from

lower abstraction level into
higher level.
Test/Verification: Ensures correct

functionality at each level, thus
reducing costly iterations
between levels.
Libraries/
IP
Test/
Verification
System
specification
System
synthesis
Hw/Sw/
OS
Model simulat./
checkers
Behavioral
specification
Behavior
synthesis
Cores
Hw-Sw
cosimulators
RT
specification
RT
synthesis
RT
components
HDL simulators
Logic
specification
Logic
synthesis
Gates/
Cells
Gate
simulators
To final implementation

32
Design productivity exponential increase

100,000
1,000
100
10
1
Productivity
(K) Trans./Staff Mo.
10,000
2009
0.01
2007
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
1985
1983
0.1
Exponential increase over the past few decades

33
The co-design ladder

In the past:
Hardware and software
design technologies were
very different
Recent maturation of
synthesis enables a unified
view of hardware and
software
Hardware/software
codesign
Sequential program code (e.g., C, VHDL)

Behavioral synthesis
(1990's)
Compilers
(1960's,1970's)
Register transfers
Assembly instructions
RT synthesis
(1980's, 1990's)
Assemblers, linkers
(1950's, 1960's)
Logic equations / FSM's
Machine instructions
Logic synthesis
(1970's, 1980's)
Logic gates
Microprocessor plus
program bits: software
Implementation
VLSI, ASIC, or PLD

implementation: hardware
The choice of hardware versus software for a particular function is simply a tradeoff among various
design metrics, like performance, power, size, NRE cost, and especially flexibility; there is no
fundamental difference between what hardware or software can implement.
34
Independence of processor and IC

technologies
Basic tradeoff
General vs. custom
With respect to processor technology or IC technology
The two technologies are independent
General,
providing improved:
Generalpurpose
processor
ASIP
Singlepurpose
processor
Flexibility
Maintainability
NRE cost
Time- to-prototype
Time-to-market
Cost (low volume)
Customized,
providing improved:
Power efficiency
Performance
Size
Cost (high volume)
PLD
Semi-custom
Full-custom

35
Design productivity gap

While designer productivity has grown at an impressive rate
over the past decades, the rate of improvement has not kept
pace with chip capacity
Logic transistors
per chip
(in millions)
10,000
100,000
1,000
10,000
100
10
1000
Gap
IC capacity
10
0.1
0.01
0.001

100
Productivity
(K) Trans./Staff-Mo.
productivity
0.1
0.01
36
Design productivity gap

1981 leading edge chip required 100 designer months
10,000 transistors / 100 transistors/month
2002 leading edge chip requires 30,000 designer months

150,000,000 / 5000 transistors/month
Designer cost increase from $1M to $300M
Logic transistors
per chip
(in millions)
10,000
100,000
1,000
10,000
100
10
1000
100
Gap
IC capacity
1
0.1
10
1
productivity
0.01
Productivity
(K) Trans./Staff-Mo.
0.1
0.001
0.01

37
The mythical man-month

The situation is even worse than the productivity gap indicates
In theory, adding designers to team reduces project completion time

In reality, productivity per designer decreases due to complexities of team management
and communication
In the software community, known as the mythical man-month (Brooks 1975)
At some point, can actually lengthen project completion time! (Too many cooks)
1M transistors, 1
designer=5000 trans/month
Each additional designer
reduces for 100 trans/month
So 2 designers produce 4900
trans/month each
60000
50000
40000
30000
20000
10000
16
16
19
18
23
24
Months until completion
43
Individual
0

Team
15
10
20
30
Number of designers
40
38
Summary
Embedded systems are everywhere
Key challenge: optimization of design metrics
Design metrics compete with one another
A unified view of hardware and software is necessary to

improve productivity
Three key technologies
Processor: general-purpose, application-specific, single-purpose
IC: Full-custom, semi-custom, PLD
Design: Compilation/synthesis, libraries/IP, test/verification

39
Embedded Systems Design: A Unified Hardware/Software

Introduction
Chapter 10: IC Technology
Outline
Anatomy of integrated circuits
Full-Custom (VLSI) IC Technology
Semi-Custom (ASIC) IC Technology
Programmable Logic Device (PLD) IC Technology

CMOS transistor
Source, Drain
Diffusion area where electrons can flow
Can be connected to metal contacts (vias)
Gate
Polysilicon area where control voltage is applied
Oxide
Si O2 Insulator so the gate voltage cant leak

End of the Moores Law?

Every dimension of the MOSFET has to scale
(PMOS) Gate oxide has to scale down to
Increase gate capacitance
Reduce leakage current from S to D
Pinch off current from source to drain
Current gate oxide thickness is about 2.5-3nm
Thats about 25 atoms!!!
IC package
IC
source
gate
oxide
channel
drain
Silicon substrate


20Ghz +
FinFET has been manufactured to
18nm
Still acts as a very good transistor
Simulation shown that it can be scaled

to 10nm
Quantum effect start to kick in
Reduce mobility by ~10%
Ballistic transport become significant

Increase current by about ~20%

NAND
Metal layers for routing (~10)

PMOS dont like 0
NMOS dont like 1
A stick diagram form the basis for mask sets

Silicon manufacturing steps

Tape out
Send design to manufacturing
Spin
One time through the manufacturing process
Photolithography
Drawing patterns by using photoresist to form barriers for deposition

Full Custom
Very Large Scale Integration (VLSI)
Placement
Place and orient transistors
Routing
Connect transistors
Sizing
Make fat, fast wires or thin, slow wires
May also need to size buffer
Design Rules
simple rules for correct circuit function
Metal/metal spacing, min poly width
Full Custom
Best size, power, performance
Hand design
Horrible time-to-market/flexibility/NRE cost
Reserve for the most important units in a processor
ALU, Instruction fetch
Physical design tools

Less optimal, but faster

10
Semi-Custom
Gate Array
Array of prefabricated gates

place and route
Higher density, faster time-to-market
Does not integrate as well with full-custom
Standard Cell
A library of pre-designed cell

Place and route
Lower density, higher complexity
Integrate great with full-custom

Hardware/Software
Introduction, (c) 2000 Vahid/Givargis
d
11
Semi-Custom
Most popular design style
Jack of all trade
Good
Power, time-to-market, performance,
NRE cost, per-unit cost, area
Master of none
Integrate with full custom for
critical regions of design

12

13
Programmable Logic Device

Programmable Logic Device
Programmable Logic Array, Programmable Array Logic, Field Programmable
Gate Array
All layers already exist

Designers can purchase an IC
To implement desired functionality
Connections on the IC are either created or destroyed to implement
Benefits
Very low NRE costs
Great time to market
Drawback
High unit cost, bad for large volume
Power
Except special PLA
slower
800-6400 usable gates

5-15 ns delay, up to 125 MHz
(2004)
Few $s price
14
Xilinx FPGA

15
Configurable Logic Block (CLB)

16
I/O Block

17

Chapter 8: State Machine and

Concurrent Process Model
Outline
Models vs. Languages
State Machine Model
FSM/FSMD
HCFSM and Statecharts Language
Program-State Machine (PSM) Model
Concurrent Process Model

Communication
Synchronization
Implementation
Dataflow Model
Real-Time Systems
Introduction
Describing embedded systems processing behavior
Can be extremely difficult
Complexity increasing with increasing IC capacity
Past: washing machines, small games, etc.
Hundreds of lines of code
Today: TV set-top boxes, Cell phone, etc.
Hundreds of thousands of lines of code
Desired behavior often not fully understood in beginning

Many implementation bugs due to description mistakes/omissions
English (or other natural language) common starting point

Precise description difficult to impossible
Example: Motor Vehicle Code thousands of pages long...

An example of trying to be precise in English

California Vehicle Code
Right-of-way of crosswalks
21950. (a) The driver of a vehicle shall yield the right-of-way to a pedestrian crossing
the roadway within any marked crosswalk or within any unmarked crosswalk at an
intersection, except as otherwise provided in this chapter.
(b) The provisions of this section shall not relieve a pedestrian from the duty of using
due care for his or her safety. No pedestrian shall suddenly leave a curb or other place
of safety and walk or run into the path of a vehicle which is so close as to constitute
an immediate hazard. No pedestrian shall unnecessarily stop or delay traffic while in a
marked or unmarked crosswalk.
(c) The provisions of subdivision (b) shall not relieve a driver of a vehicle from the
duty of exercising due care for the safety of any pedestrian within any marked
crosswalk or within any unmarked crosswalk at an intersection.
All that just for crossing the street (and theres much more)!

Models and languages

How can we (precisely) capture behavior?
We may think of languages (C, C++), but computation model is the key
Common computation models:

Sequential program model
Statements, rules for composing statements, semantics for executing them
Communicating process model

Multiple sequential programs running concurrently
State machine model

For control dominated systems, monitors control inputs, sets control outputs
Dataflow model
For data dominated systems, transforms input data streams into output streams
Object-oriented model
For breaking complex software into simpler, well-defined pieces
Models vs. languages

Poetry
Recipe
Story
State
machine
Sequent.
program
Dataflow
English
Spanish
Japanese
C++
Java
Models
Languages
Recipes vs. English
Sequential programs vs. C
Computation models describe system behavior

Conceptual notion, e.g., recipe, sequential program
Languages capture models

Concrete form, e.g., English, C
Variety of languages can capture one model

E.g., sequential program model C,C++, Java
One language can capture variety of models

E.g., C++ VHTXHQWLDOSURJUDPPRGHOREMHFW-oriented model, state machine model
Certain languages better at capturing certain computation models

Text versus Graphics

Models versus languages not to be confused with text
versus graphics
Text and graphics are just two types of languages
Text: letters, numbers
Graphics: circles, arrows (plus some letters, numbers)
X = 1;
X=1
Y = X + 1;
Y=X+1

Introductory example: An elevator controller

Partial English description
Simple elevator
controller
Request Resolver
resolves various floor
requests into single
requested floor
Unit Control moves
elevator to this requested
floor
Try capturing in C...

Move the elevator either up or down

to reach the requested floor. Once at
the requested floor, open the door for
at least 10 seconds, and keep it open
until the requested floor changes.
Ensure the door is never open while
moving. Dont change directions
unless there are no higher requests
when moving up or no lower requests
when moving down
System interface
up
Unit
Control
down
open
floor
req
Request
Resolver
...
b1
b2
bN
up1
up2
dn2
up3
dn3
buttons
inside
elevator
up/down
buttons on
each
floor
...
dnN
Elevator controller using a sequential

program model
Sequential program model
Inputs: int floor; bit b1..bN; up1..upN-1; dn2..dnN;
Outputs: bit up, down, open;
Global variables: int req;
void UnitControl()
{
up = down = 0; open = 1;
while (1) {
while (req == floor);
open = 0;
if (req > floor) { up = 1;}
else {down = 1;}
while (req != floor);
up = down = 0;
open = 1;
delay(10);
}
}
void RequestResolver()
{
while (1)
...
req = ...
...
}
void main()
{
Call concurrently:
UnitControl() and
RequestResolver()
}
System interface
Partial English description
Move the elevator either up or down
to reach the requested floor. Once at
the requested floor, open the door for
at least 10 seconds, and keep it open
until the requested floor changes.
Ensure the door is never open while
moving. Dont change directions
unless there are no higher requests
when moving up or no lower requests
when moving down
up
Unit
Control
down
open
floor
req
Request
Resolver
You might have come up with something having

even more if statements.
...
b1
b2
bN
up1
up2
dn2
up3
dn3
buttons
inside
elevator
up/down
buttons on
each
floor
...
dnN

Finite-state machine (FSM) model

Trying to capture this behavior as sequential program is a bit
awkward
Instead, we might consider an FSM model, describing the system
as:
Possible states
E.g., Idle, GoingUp, GoingDn, DoorOpen
Possible transitions from one state to another based on input

E.g., req > floor
Actions that occur in each state

E.g., In the GoingUp state, u,d,o,t = 1,0,0,0 (up = 1, down, open, and
timer_start = 0)
Try it...

10
Finite-state machine (FSM) model

UnitControl process using a state machine
req > floor
u,d,o, t = 1,0,0,0
GoingUp
!(req > floor)

timer < 10
req > floor

!(timer < 10)
u,d,o,t = 0,0,1,0
Idle
req == floor
u,d,o,t = 0,1,0,0
DoorOpen
u,d,o,t = 0,0,1,1
req < floor

!(req<floor)
GoingDn
u is up, d is down, o is open

req < floor
t is timer_start

11
Formal definition
An FSM is a 6-tuple F<S, I, O, F, H, s0>
S is a set of all states {s0, s1, , sl}

I is a set of inputs {i0, i1, , im}
O is a set of outputs {o0, o1, , on}
F is a next-state function (S x I S)
H is an output function (S O)
s0 is an initial state
Moore-type
Associates outputs with states (as given above, H maps S O)
Mealy-type
Associates outputs with transitions (H maps S x I O)
Shorthand notations to simplify descriptions

Implicitly assign 0 to all unassigned outputs in a state
Implicitly AND every transition condition with clock edge (FSM is synchronous)
12
Finite-state machine with datapath model

(FSMD)
FSMD extends FSM: complex data types and variables for storing data
FSMs use only Boolean data types and operations, no variables
We described UnitControl as an FSMD
FSMD: 7-tuple <S, I , O, V, F, H, s0>
req > floor
S is a set of states {s0, s1, , sl}

I is a set of inputs {i0, i1, , im}
O is a set of outputs {o0, o1, , on}
u,d,o, t = 1,0,0,0
V is a set of variables {v0, v1, , vn}

F is a next-state function (S x I x V S)
H is an action function (S O + V)
s0 is an initial state
GoingUp
!(req > floor)
req > floor

!(timer < 10)
u,d,o,t = 0,0,1,0
Idle
req == floor
req < floor
u,d,o,t = 0,1,0,0
timer < 10
DoorOpen
u,d,o,t = 0,0,1,1
!(req<floor)
GoingDn

req < floor
t is timer_start
I,O,V may represent complex data types (i.e., integers, floating point, etc.)
F,H may include arithmetic operations
H is an action function, not just an output function
Describes variable updates as well as outputs
Complete system state now consists of current state, si, and values of all variables

13
Describing a system as a state machine

1. List all possible states
2. Declare all variables (none in this example)
3. For each state, list possible transitions, with conditions, to other states
4. For each state and/or transition,
list associated actions
5. For each state, ensure exclusive
and complete exiting transition
conditions
No two exiting conditions can

be true at same time
Otherwise nondeterministic
state machine
One condition must be true at

any given time
Reducing explicit transitions

should be avoided when first
learning

req > floor
u,d,o, t = 1,0,0,0
!(req > floor)
GoingUp
timer < 10
req > floor

u,d,o,t = 0,0,1,0
Idle
!(timer < 10)
DoorOpen
u,d,o,t = 0,0,1,1
req == floor
req < floor
u,d,o,t = 0,1,0,0
!(req<floor)
GoingDn
req < floor
t is timer_start
14
State machine vs. sequential program model

Different thought process used with each model
State machine:
Encourages designer to think of all possible states and transitions among states
based on all possible input conditions
Sequential program model:

Designed to transform data through series of instructions that may be iterated and
conditionally executed
State machine description excels in many cases

More natural means of computing in those cases
Not due to graphical representation (state diagram)
Would still have same benefits if textual language used (i.e., state table)
Besides, sequential program model could use graphical representation (i.e., flowchart)

15
Try Capturing Other Behaviors with an FSM

E.g., Answering machine blinking light when there are
messages
E.g., A simple telephone answering machine that
answers after 4 rings when activated
E.g., A simple crosswalk traffic control light
Others

16
Capturing state machines in

sequential programming language
Despite benefits of state machine model, most popular development tools use
sequential programming language
C, C++, Java, Ada, VHDL, Verilog, etc.
Development tools are complex and expensive, therefore not easy to adapt or replace
Must protect investment
Two approaches to capturing state machine model with sequential programming

language
Front-end tool approach
Additional tool installed to support state machine language
Graphical and/or textual state machine languages
May support graphical simulation
Automatically generate code in sequential programming language that is input to main development tool
Drawback: must support additional tool (licensing costs, upgrades, training, etc.)
Language subset approach

Most common approach...

17
Language subset approach
Follow rules (template) for capturing

state machine constructs in equivalent
sequential language constructs
Used with software (e.g.,C) and
hardware languages (e.g.,VHDL)
Capturing UnitControl state machine
in C
Enumerate all states (#define)

Declare state variable initialized to
initial state (IDLE)
Single switch statement branches to
current states case
Each case has actions
up, down, open, timer_start
Each case checks transition conditions

to determine next state
if() {state = ;}

#define IDLE0
#define GOINGUP1
#define GOINGDN2
#define DOOROPEN3
void UnitControl() {
int state = IDLE;
while (1) {
switch (state) {
IDLE: up=0; down=0; open=1; timer_start=0;
if
(req==floor) {state = IDLE;}
if
(req > floor) {state = GOINGUP;}
if
(req < floor) {state = GOINGDN;}
break;
GOINGUP: up=1; down=0; open=0; timer_start=0;
if
(req > floor) {state = GOINGUP;}
if
(!(req>floor)) {state = DOOROPEN;}
break;
GOINGDN: up=1; down=0; open=0; timer_start=0;
if
(req < floor) {state = GOINGDN;}
if
(!(req<floor)) {state = DOOROPEN;}
break;
DOOROPEN: up=0; down=0; open=1; timer_start=1;
if (timer < 10) {state = DOOROPEN;}
if (!(timer<10)){state = IDLE;}
break;
}
}
}
UnitControl state machine in sequential programming language
18
General template
#define S0 0
#define S1 1
...
#define SN N
void StateMachine() {
int state = S0; // or whatever is the initial state.
while (1) {
switch (state) {
S0:
// Insert S0s actions here & Insert transitions Ti leaving S0:
if( T0s condition is true ) {state = T0s next state; /*actions*/ }
if( T1s condition is true ) {state = T1s next state; /*actions*/ }
...
if( Tms condition is true ) {state = Tms next state; /*actions*/ }
break;
S1:
// Insert S1s actions here
// Insert transitions Ti leaving S1
break;
...
SN:
// Insert SNs actions here
// Insert transitions Ti leaving SN
break;
}
}
}

19
HCFSM and the Statecharts language
Hierarchical/concurrent state machine model

(HCFSM)
Extension to state machine model to support

hierarchy and concurrency
States can be decomposed into another state
machine
y
A2
A1
A1
B
A2
States can execute concurrently
With hierarchy has identical functionality as Without

hierarchy, but has one less transition (z)
Known as OR-decomposition
With hierarchy
Without hierarchy
Known as AND-decomposition
Concurrency
Statecharts
Graphical language to capture HCFSM

timeout: transition with time limit as condition
history: remember last substate OR-decomposed
state A was in before transitioning to another state B
B
C
C1
x
D1
y
C2
v
D2
Return to saved substate of A when returning from B

instead of initial state

20
UnitControl with FireMode

req>floor
u,d,o = 1,0,0
GoingUp
req>floor
u,d,o = 0,0,1
UnitControl
timeout(10)
req==floor
u,d,o = 0,1,0
FireMode
!(req>floor)
Idle
DoorOpen
fire
fire
!(req<floor)
req<floor
fire
FireGoingDn
GoingDn
fire
floor>1
req<floor
When fire is true, move elevator

to 1st floor and open door
w/o hierarchy: Getting messy!
w/ hierarchy: Simple!
u,d,o = 0,0,1
u,d,o = 0,1,0
floor==1 u,d,o = 0,0,1
FireDrOpen
!fire
With hierarchy
fire
UnitControl
Without hierarchy
NormalMode
req>floor
u,d,o = 1,0,0
GoingUp
!(req>floor)
req>floor
ElevatorController
UnitControl
u,d,o = 0,0,1
RequestResolver
NormalMode
u,d,o = 0,1,0
...
!fire
Idle
req==floor
req<floor
GoingDn
fire
timeout(10)
!(req>floor)
DoorOpen
u,d,o = 0,0,1
req<floor
FireMode
fire
!fire
With concurrent RequestResolver
FireMode
u,d,o = 0,1,0
FireGoingDn
floor==1 u,d,o = 0,0,1
floor>1
FireDrOpen
fire

21
Program-state machine model (PSM):

HCFSM plus sequential program model
Program-states actions can be FSM or

sequential program
ElevatorController
int req;
UnitControl
NormalMode
up = down = 0; open = 1;
while (1) {
while (req == floor);
open = 0;
if (req > floor) { up = 1;}
else {down = 1;}
while (req != floor);
open = 1;
delay(10);
}
}
!fire
fire
Designer can choose most appropriate
Stricter hierarchy than HCFSM used in

Statecharts
transition between sibling states only, single entry
Program-state may complete
Reaches end of sequential program code, OR
FSM transition to special complete substate
PSM has 2 types of transitions
Transition-immediately (TI): taken regardless of

source program-state
Transition-on-completion (TOC): taken only if
condition is true AND source program-state is
complete
SpecCharts: extension of VHDL to capture PSM

model
SpecC: extension of C to capture PSM model
RequestResolver
...
req = ...
...
FireMode
up = 0; down = 1; open = 0;
while (floor > 1);
up = 0; down = 0; open = 1;
NormalMode and FireMode described as

sequential programs
Black square originating within FireMode
indicates !fire is a TOC transition
Transition from FireMode to NormalMode

only after FireMode completed
22
Role of appropriate model and language
Finding appropriate model to capture embedded system is an important step

Model shapes the way we think of the system
Originally thought of sequence of actions, wrote sequential program
First wait for requested floor to differ from target floor

Then, we close the door
Then, we move up or down to the desired floor
Then, we open the door
Then, we repeat this sequence
To create state machine, we thought in terms of states and transitions among states
When system must react to changing inputs, state machine might be best model
HCFSM described FireMode easily, clearly
Language should capture model easily

Ideally should have features that directly capture constructs of model
FireMode would be very complex in sequential program
Checks inserted throughout code
Other factors may force choice of different model

Structured techniques can be used instead
E.g., Template for state machine capture in sequential program language

23
Concurrent process model
ConcurrentProcessExample() {
x = ReadX()
y = ReadY()
Call concurrently:
PrintHelloWorld(x) and
PrintHowAreYou(y)
}
PrintHelloWorld(x) {
while( 1 ) {
print "Hello world."
delay(x);
}
}
PrintHowAreYou(x) {
while( 1 ) {
print "How are you?"
delay(y);
}
}
Describes functionality of system in terms of two or more

concurrently executing subtasks
Many systems easier to describe with concurrent process model
because inherently multitasking
E.g., simple example:
Read two numbers X and Y

Display Hello world. every X seconds
Display How are you? every Y seconds
More effort would be required with sequential program or state

machine model
PrintHelloWorld
Simple concurrent process example
ReadX
ReadY
PrintHowAreYou
time
Subroutine execution over time

Enter X: 1
Enter Y: 2
Hello world.
Hello world.
How are you?
Hello world.
How are you?
Hello world.
...
(Time
(Time
(Time
(Time
(Time
(Time
=
=
=
=
=
=
1
2
2
3
4
4
s)
s)
s)
s)
s)
s)
Sample input and output
24
Dataflow model
Derivative of concurrent process model

Nodes represent transformations
May execute concurrently
B C
Edges represent flow of tokens (data) from one node to another
Z = (A + B) * (C - D)
May or may not have token at any given time
t1 t2
When all of nodes input edges have at least one token, node may
fire
When node fires, it consumes input tokens processes
transformation and generates output token
Nodes may fire simultaneously
Several commercial tools support graphical languages for capture
of dataflow model
Nodes with arithmetic

transformations
A
B C
modulate
convolve
t1 t2
Can automatically translate to concurrent process model for

implementation
Each node becomes a process
transform
Nodes with more complex

transformations
25
Synchronous dataflow
With digital signal-processors (DSPs), data flows at fixed rate

Multiple tokens consumed and produced per firing
Synchronous dataflow model takes advantage of this
Each edge labeled with number of tokens consumed/produced
each firing
Can statically schedule nodes, so can easily use sequential
program model
Dont need real-time operating system and its overhead
How would you map this model to a sequential programming

language? Try it...
Algorithms developed for scheduling nodes into singleappearance schedules
Only one statement needed to call each nodes associated
procedure
A
mA
mB
mC
modulate
mD
convolve
mt1
t1
t2
tt1
ct2
tt2
transform
tZ
Z
Synchronous dataflow
Allows procedure inlining without code explosion, thus reducing

overhead even more

26
Concurrent processes and real-time systems

27
Concurrent processes
Consider two examples
having separate tasks running
independently but sharing
data
Difficult to write system
using sequential program
model
Concurrent process model
easier
Separate sequential
programs (processes) for
each task
Programs communicate with
each other
Heartbeat Monitoring System

B[1..4]
Heart-beat
pulse
Task 1:
Read pulse
If pulse < Lo then
Activate Siren
If pulse > Hi then
Activate Siren
Sleep 1 second
Repeat
Task 2:
If B1/B2 pressed then
Lo = Lo +/ 1
If B3/B4 pressed then
Hi = Hi +/ 1
Sleep 500 ms
Repeat
Set-top Box
Input
Signal
Task 1:
Read Signal
Separate Audio/Video
Send Audio to Task 2
Send Video to Task 3
Repeat
Task 2:
Wait on Task 1
Decode/output Audio
Repeat
Task 3:
Wait on Task 1
Decode/output Video
Repeat
Video
Audio
28
Process
A sequential program, typically an infinite loop
Executes concurrently with other processes
We are about to enter the world of concurrent programming
Basic operations on processes

Create and terminate
Create is like a procedure call but caller doesnt wait
Created process can itself create new processes
Terminate kills a process, destroying all data

In HelloWord/HowAreYou example, we only created processes
Suspend and resume

Suspend puts a process on hold, saving state for later execution
Resume starts the process again where it left off
Join
A process suspends until a particular child process finishes execution
29
Communication among processes

Processes need to communicate data and
signals to solve their computation problem
Processes that dont communicate are just
independent programs solving separate problems
Basic example: producer/consumer

Process A produces data items, Process B consumes
them
E.g., A decodes video packets, B display decoded
packets on a screen
Encoded video
packets
processA() {
// Decode packet
// Communicate packet
to B
}
}
Decoded video
packets
void processB() {
// Get packet from A
// Display packet
}
How do we achieve this communication?

Two basic methods
Shared memory
Message passing
To display
30
Shared Memory
Processes read and write shared variables
No time overhead, easy to implement
But, hard to use mistakes are common
Example: Producer/consumer with a mistake
Share buffer[N], count
processA produces data items and stores in buffer
processB consumes data items from buffer
Error when both processes try to update count concurrently (lines 10 and 19)
and the following execution sequence occurs. Say count is 3.
count = # of valid data items in buffer

If buffer is full, must wait
If buffer is empty, must wait
A loads count (count = 3) from memory into register R1 (R1 = 3)

A increments R1 (R1 = 4)
B loads count (count = 3) from memory into register R2 (R2 = 3)
B decrements R2 (R2 = 2)
A stores R1 back to count in memory (count = 4)
B stores R2 back to count in memory (count = 2)
01:
02:
03:
04:
05:
06:
07:
08:
09:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
data_type buffer[N];
int count = 0;
void processA() {
int i;
while( 1 ) {
produce(&data);
while( count == N );/*loop*/
buffer[i] = data;
i = (i + 1) % N;
count = count + 1;
}
}
void processB() {
int i;
while( 1 ) {
while( count == 0 );/*loop*/
data = buffer[i];
i = (i + 1) % N;
count = count - 1;
consume(&data);
}
}
void main() {
create_process(processA);
create_process(processB);
}
count now has incorrect value of 2
Embedded
mb
31
Message Passing
Message passing
Data explicitly sent from one process to
another
Sending process performs special operation,
send
Receiving process must perform special
operation, receive, to receive the data
Both operations must explicitly specify which
process it is sending to or receiving from
Receive is blocking, send may or may not be
blocking
void processA() {
while( 1 ) {
produce(&data)
send(B, &data);
/* region 1 */
receive(B, &data);
consume(&data);
}
}
void processB() {
while( 1 ) {
receive(A, &data);
transform(&data)
send(A, &data);
/* region 2 */
}
}
Safer model, but less flexible

32
Back to Shared Memory: Mutual Exclusion

Certain sections of code should not be performed concurrently
Critical section
Possibly noncontiguous section of code where simultaneous updates, by multiple
processes to a shared memory location, can occur
When a process enters the critical section, all other processes must be locked
out until it leaves the critical section
Mutex
A shared object used for locking and unlocking segment of shared data
Disallows read/write access to memory it guards
Multiple processes can perform lock operation simultaneously, but only one process
will acquire lock
All other processes trying to obtain lock will be put in blocked state until unlock
operation performed by acquiring process when it exits critical section
These processes will then be placed in runnable state and will compete for lock again

33
Correct Shared Memory Solution to the

Consumer-Producer Problem
The primitive mutex is used to ensure critical sections are

executed in mutual exclusion of each other
Following the same execution sequence as before:
A/B execute lock operation on count_mutex

Either A or B will acquire lock
B loads count (count = 3) from memory into register R2 (R2

= 3)
B decrements R2 (R2 = 2)
B stores R2 back to count in memory (count = 2)
B executes unlock operation
Say B acquires it
A will be put in blocked state
A is placed in runnable state again
A loads count (count = 2) from memory into register R1 (R1

= 2)
A increments R1 (R1 = 3)
A stores R1 back to count in memory (count = 3)
Count now has correct value of 3

01:
02:
03:
04:
05:
06:
07:
08:
09:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
int count = 0;
mutex count_mutex;
void processA() {
int i;
while( 1 ) {
produce(&data);
while( count == N );/*loop*/
buffer[i] = data;
i = (i + 1) % N;
count_mutex.lock();
count = count + 1;
count_mutex.unlock();
}
}
void processB() {
int i;
while( 1 ) {
while( count == 0 );/*loop*/
data = buffer[i];
i = (i + 1) % N;
count_mutex.lock();
count = count - 1;
count_mutex.unlock();
consume(&data);
}
}
void main() {
create_process(processA);
create_process(processB);
}
34
Process Communication
Try modeling req value of our
elevator controller
System interface
up
Unit
Control
Using shared memory

Using shared memory and mutexes
Using message passing
down
open
floor
req
Request
Resolver
...
b1
b2
bN
up1
up2
dn2
up3
dn3
buttons
inside
elevator
up/down
buttons on
each
floor
...
dnN

35
A Common Problem in Concurrent

Programming: Deadlock
Deadlock: A condition where 2 or more processes are

blocked waiting for the other to unlock critical sections of
code
Both processes are then in blocked state

Cannot execute unlock operation so will wait forever
Example code has 2 different critical sections of code that

can be accessed simultaneously
2 locks needed (mutex1, mutex2)

Following execution sequence produces deadlock
A executes lock operation on mutex1 (and acquires it)

B executes lock operation on mutex2( and acquires it)
A/B both execute in critical sections 1 and 2, respectively
A executes lock operation on mutex2
B executes lock operation on mutex1
A blocked until B unlocks mutex2

B blocked until A unlocks mutex1
DEADLOCK!
One deadlock elimination protocol requires locking of

numbered mutexes in increasing order and two-phase
locking (2PL)
01:
02:
03:
04:
05:
06:
07:
08:
09:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
mutex mutex1, mutex2;

void processA() {
while( 1 ) {
mutex1.lock();
/* critical section
mutex2.lock();
/* critical section
mutex2.unlock();
/* critical section
mutex1.unlock();
}
}
void processB() {
while( 1 ) {
mutex2.lock();
/* critical section
mutex1.lock();
/* critical section
mutex1.unlock();
/* critical section
mutex2.unlock();
}
}
1 */
2 */
1 */
2 */
1 */
2 */
Acquire locks in 1st phase only, release locks in 2nd phase

36
Synchronization among processes

Sometimes concurrently running processes must synchronize their execution
When a process must wait for:
another process to compute some value
reach a known point in their execution
signal some condition
Recall producer-consumer problem

processA must wait if buffer is full
processB must wait if buffer is empty
This is called busy-waiting
Process executing loops instead of being blocked
CPU time wasted
More efficient methods

Join operation, and blocking send and receive discussed earlier
Both block the process so it doesnt waste CPU time
Condition variables and monitors

37
Condition variables
Condition variable is an object that has 2 operations, signal and wait
When process performs a wait on a condition variable, the process is blocked
until another process performs a signal on the same condition variable
How is this done?
Process A acquires lock on a mutex
Process A performs wait, passing this mutex
Causes mutex to be unlocked
Process B can now acquire lock on same mutex

Process B enters critical section
Computes some value and/or make condition true
Process B performs signal when condition true

Causes process A to implicitly reacquire mutex lock
Process A becomes runnable

38
Condition variable example:

consumer-producer
Consumer-producer using condition variables
2 condition variables
buffer_empty
Signals at least 1 free location available in buffer
buffer_full
Signals at least 1 valid data item in buffer
processA:
produces data item

acquires lock (cs_mutex) for critical section
checks value of count
if count = N, buffer is full
performs wait operation on buffer_empty
this releases the lock on cs_mutex allowing
processB to enter critical section, consume data
item and free location in buffer
processB then performs signal
if count < N, buffer is not full
processA inserts data into buffer
increments count
signals processB making it runnable if it has
performed a wait operation on buffer_full
01:
02:
03:
04:
06:
07:
08:
09:
10:
11:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
37:
int count = 0;
mutex cs_mutex;
condition buffer_empty, buffer_full;
void processA() {
int i;
while( 1 ) {
produce(&data);
cs_mutex.lock();
if( count == N ) buffer_empty.wait(cs_mutex);
buffer[i] = data;
i = (i + 1) % N;
count = count + 1;
cs_mutex.unlock();
buffer_full.signal();
}
}
void processB() {
int i;
while( 1 ) {
cs_mutex.lock();
if( count == 0 ) buffer_full.wait(cs_mutex);
data = buffer[i];
i = (i + 1) % N;
count = count - 1;
cs_mutex.unlock();
buffer_empty.signal();
consume(&data);
}
}
void main() {
create_process(processA); create_process(processB);
}

39
Monitors
Collection of data and methods or subroutines that

operate on data similar to an object-oriented
paradigm
Monitor guarantees only 1 process can execute
inside monitor at a time
(a) Process X executes while Process Y has to wait
(b) Process X performs wait on a condition

Process Y allowed to enter and execute
Monitor
Monitor
DATA
Waiting
DATA
CODE
Process
X
CODE
Process
Y
Process
X
(a)
(b)
Monitor
(c) Process Y signals condition Process X waiting on

Process Y blocked
Process X allowed to continue executing
(d) Process X finishes executing in monitor or waits
on a condition again
Process Y made runnable again

Monitor
DATA
Waiting
DATA
CODE
Process
X
CODE
Process
Y
(c)
Process
Y
Process
X
Process
Y
(d)
40
Monitor example: consumer-producer
Single monitor encapsulates both

processes along with buffer and count
One process will be allowed to begin
executing first
If processB allowed to execute first
Will execute until it finds count = 0

Will perform wait on buffer_full condition
variable
processA now allowed to enter monitor and
execute
processA produces data item
finds count < N so writes to buffer and
increments count
processA performs signal on buffer_full
condition variable
processA blocked
processB reenters monitor and continues
execution, consumes data, etc.
01:
02:
03:
04:
06:
07:
08:
09:
10:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
35:
Monitor {
int count = 0;
condition buffer_full, condition buffer_empty;
void processA() {
int i;
while( 1 ) {
produce(&data);
if( count == N ) buffer_empty.wait();
buffer[i] = data;
i = (i + 1) % N;
count = count + 1;
}
}
void processB() {
int i;
while( 1 ) {
if( count == 0 ) buffer_full.wait();
data = buffer[i];
i = (i + 1) % N;
count = count - 1;
buffer_empty.signal();
consume(&data);
}
}
} /* end monitor */
void main() {
create_process(processA); create_process(processB);
}

41
Implementation
Mapping of systems functionality

onto hardware processors:
captured using computational
model(s)
written in some language(s)
Implementation choice independent

from language(s) choice
Implementation choice based on
power, size, performance, timing and
cost requirements
Final implementation tested for
feasibility
Also serves as blueprint/prototype
for mass manufacturing of final
product

State
machine
Sequent.
program
Dataflow
Pascal
C/C++
Java
Implementation A Implementation
B
Concurrent
processes
VHDL
Implementation
C
The choice of
computational
model(s) is based
on whether it
allows the designer
to describe the
system.
The choice of
language(s) is
based on whether
it captures the
computational
model(s) used by
the designer.
The choice of
implementation is
based on whether it
meets power, size,
performance and
cost requirements.
42
Can use single and/or general-purpose processors

(a) Multiple processors, each executing one process
True multitasking (parallel processing)

General-purpose processors
(a)
Process3
Use programming language like C and compile to

instructions of processor
Expensive and in most cases not necessary
Process4
Process2
Process3
(b)
Processor D
General Purpose
Processor
Process4
Most processes dont use 100% of processor time

Can share processor time and still achieve necessary
execution rates
(c) Combination of (a) and (b)
Processor C
Process1
More common
(b) One general-purpose processor running all

processes
Processor B
Process2
Custom single-purpose processors
Processor A
Process1
Processor A
Process1
Process2
(c)
Multiple processes run on one general-purpose

processor while one or more processes run on own
single_purpose processor
Process3
Process4
General
Purpose
Processor

Communication Bus
Communication Bus
Concurrent process model:

implementation
43
Implementation:
multiple processes sharing single processor
Can manually rewrite processes as a single sequential program
Ok for simple examples, but extremely difficult for complex examples

Automated techniques have evolved but not common
E.g., simple Hello World concurrent program from before would look like:
I = 1; T = 0;
while (1) {
Delay(I); T = T + 1;
if X modulo T is 0 then call PrintHelloWorld
if Y modulo T is 0 then call PrintHowAreYou
}
Can use multitasking operating system
Much more common

Operating system schedules processes, allocates storage, and interfaces to peripherals, etc.
Real-time operating system (RTOS) can guarantee execution rate constraints are met
Describe concurrent processes with languages having built-in processes (Java, Ada, etc.) or a sequential
programming language with library support for concurrent processes (C, C++, etc. using POSIX threads
for example)
Can convert processes to sequential program with process scheduling right in code
Less overhead (no operating system)

More complex/harder to maintain

44
Processes vs. threads

Different meanings when operating system terminology
Regular processes
Heavyweight process
Own virtual address space (stack, data, code)
System resources (e.g., open files)
Threads
Lightweight process
Subprocess within process
Only program counter, stack, and registers
Shares address space, system resources with other threads
Allows quicker communication between threads
Small compared to heavyweight processes

Can be created quickly
Low cost switching between threads

45
Implementation:
suspending, resuming, and joining
Multiple processes mapped to single-purpose processors
Built into processors implementation
Could be extra input signal that is asserted when process suspended
Additional logic needed for determining process completion
Extra output signals indicating process done
Multiple processes mapped to single general-purpose processor

Built into programming language or special multitasking library like POSIX
Language or library may rely on operating system to handle

46
Implementation: process scheduling

Must meet timing requirements when multiple concurrent processes
implemented on single general-purpose processor
Not true multitasking
Scheduler
Special process that decides when and for how long each process is executed
Implemented as preemptive or nonpreemptive scheduler
Preemptive
Determines how long a process executes before preempting to allow another process
to execute
Time quantum: predetermined amount of execution time preemptive scheduler allows each
process (may be 10 to 100s of milliseconds long)
Determines which process will be next to run
Nonpreemptive
Only determines which process is next after current process finishes execution
47
Scheduling: priority
Process with highest priority always selected first by scheduler
Typically determined statically during creation and dynamically during
execution
FIFO
Runnable processes added to end of FIFO as created or become runnable
Front process removed from FIFO when time quantum of current process is up
or process is blocked
Priority queue
Runnable processes again added as created or become runnable
Process with highest priority chosen when new process needed
If multiple processes with same highest priority value then selects from them
using first-come first-served
Called priority scheduling when nonpreemptive
Called round-robin when preemptive

48
Priority assignment
Period of process
Repeating time interval the process must complete one execution within
Usually determined by the description of the system
Amount of time process must be completed by after it has started
E.g., execution time = 5 ms, deadline = 20 ms, period = 100 ms

Process must complete execution within 20 ms after it has begun regardless of its period
Process begins at start of period, runs for 4 ms then is preempted
Process suspended for 14 ms, then runs for the remaining 1 ms
Completed within 4 + 14 + 1 = 19 ms which meets deadline of 20 ms
Without deadline process could be suspended for much longer
Rate monotonic scheduling
E.g., refresh rate of display is 27 times/sec

Period = 37 ms
Execution deadline
E.g., period = 100 ms

Process must execute once every 100 ms
Processes with shorter periods have higher priority

Typically used when execution deadline = period
Rate monotonic
Process
Period
Priority
A
B
C
D
E
F
25 ms
50 ms
12 ms
100 ms
40 ms
75 ms
5
3
6
1
4
2
Deadline monotonic
Process
Deadline
Priority
G
H
I
J
K
L
17 ms
50 ms
32 ms
10 ms
140 ms
32 ms
5
2
3
6
1
4
Deadline monotonic scheduling
Processes with shorter deadlines have higher priority

Typically used when execution deadline < period

49
Real-time systems
Systems composed of 2 or more cooperating, concurrent processes with
stringent execution time constraints
E.g., set-top boxes have separate processes that read or decode video and/or
sound concurrently and must decode 20 frames/sec for output to appear
continuous
Other examples with stringent time constraints are:
digital cell phones

navigation and process control systems
assembly line monitoring systems
multimedia and networking systems
etc.
Communication and synchronization between processes for these systems is

critical
Therefore, concurrent process model best suited for describing these systems
50
Real-time operating systems (RTOS)
Provide mechanisms, primitives, and guidelines for building real-time embedded systems
Windows CE
Built specifically for embedded systems and appliance market

Scalable real-time 32-bit platform
Supports Windows API
Perfect for systems designed to interface with Internet
Preemptive priority scheduling with 256 priority levels per process
Kernel is 400 Kbytes
QNX
Real-time microkernel surrounded by optional processes (resource managers) that provide POSIX and
UNIX compatibility
Microkernels typically support only the most basic services

Optional resource managers allow scalability from small ROM-based systems to huge multiprocessor systems
connected by various networking and communication technologies
Preemptive process scheduling using FIFO, round-robin, adaptive, or priority-driven scheduling

32 priority levels per process
Microkernel < 10 Kbytes and complies with POSIX real-time standard

51
Summary
Computation models are distinct from languages
Sequential program model is popular
Most common languages like C support it directly
State machine models good for control

Extensions like HCFSM provide additional power
PSM combines state machines and sequential programs
Concurrent process model for multi-task systems

Communication and synchronization methods exist
Scheduling is critical
Dataflow model good for signal processing

52

Chapter 2: Custom single-purpose

processors
Outline
Introduction
Combinational logic
Sequential logic
Custom single-purpose processor design
RT-level custom single-purpose processor design

Introduction
Processor
Digital circuit that performs a
computation tasks
Controller and datapath
CCD
General-purpose: variety of computation
tasks
Single-purpose: one particular
lens
computation task
Custom single-purpose: non-standard
task
Digital camera chip

CCD
preprocessor
A2D
JPEG codec
Pixel coprocessor
Microcontroller
Multiplier/Accum
DMA controller
Display
ctrl
A custom single-purpose
processor may be
Fast, small, low power
But, high NRE, longer time-to-market,
less flexible
D2A
Memory controller
ISA bus interface
UART
LCD ctrl

CMOS transistor on silicon

Transistor
The basic electrical component in digital systems
Acts as an on/off switch
Voltage at gate controls whether current flows from
source to drain
Dont confuse this gate with a logic gate
gate
1
IC package
IC
source
gate
oxide
channel
drain
Conducts
if gate at 1
source
drain
Silicon substrate
nMOS transistor
CMOS transistor implementations

Complementary Metal Oxide
Semiconductor
We refer to logic levels
source
drain
gate
Conducts
if gate at 1
source
gate
Conducts
if gate at 0
drain
pMOS
nMOS
Typically 0 : 0V, 1 : 5V or less
Two basic CMOS types

nMOS conducts if gate at 1
pMOS conducts if gate at 0
Hence complementary
x
x
F = x'
F = (xy)'
x
y
Basic gates
F = (x+y)'
x
0
Inverter, NAND, NOR
0
NOR gate
NAND gate
inverter

Basic logic gates

x
x
0
1
F
0
1
F = x
Inverter
F=xy
AND
F=x
Driver
x
0
1
F
1
0
x
y
F = (x y)
NAND

x
0
0
1
1
y
0
1
0
1
F
0
0
0
1
x
y
x
0
0
1
1
y
0
1
0
1
F
1
1
1
0
x
y
F=x+y
OR
F = (x+y)
NOR
x
0
0
1
1
y
0
1
0
1
F
0
1
1
1
x
0
0
1
1
y
0
1
0
1
F
1
0
0
0
F=xy
XOR
F = (x y)
XNOR
x
0
0
1
1
y
0
1
0
1
F
0
1
1
0
x
0
0
1
1
y
0
1
0
1
F
1
0
0
1
Combinational logic design

A) Problem description
B) Truth table
y is 1 if a is to 1, or b and c are 1. z is 1 if
b or c is to 1, but not both, or if all are 1.
D) Minimized output equations

y bc
00 01 11 10
a
0 0
0
1
0
1
a
0
0
0
0
1
1
1
1
C) Output equations
Outputs
y
z
0
0
0
1
0
1
1
0
1
0
1
1
1
1
1
1
Inputs
b
c
0
0
0
1
1
0
1
1
0
0
0
1
1
0
1
1
y = a'bc + ab'c' + ab'c + abc' + abc
z = a'b'c + a'bc' + ab'c + abc' + abc
E) Logic Gates
(random logic)
a
b
c
y = a + bc
z
bc
0
00
0
01
1
11
0
10
1
z = ab + bc + bc

Combinational components
I(m-1) I1 I0
n
S0
n-bit, m x 1
Multiplexor
S(log m) n
O
Multiplexor
O=
I0 if S=0..00
I1 if S=0..01
I(m-1) if S=1..11
I(log n -1) I0
B
n
A
n
log n x n
Decoder
n-bit
Adder
O(n-1) O1 O0
carry sum
less equal greater
Decoder
Adder
Comparator
sum = A+B
(first n bits)
carry = (n+1)th
bit of A+B
With enable input e

all Os are 0 if e=0
With carry-in input Ci

n-bit
Comparator
O0 =1 if I=0..00
O1 =1 if I=0..01
O(n-1) =1 if I=1..11
sum = A + B + Ci
less = 1 if A<B
equal =1 if A=B
greater=1 if A>B
B
n
n bit,
m function S0
ALU
S(log m)
n
O
ALU
O = A op B
op determined
by S.
May have status outputs

carry, zero, etc.
Sequential components
I
n
load
shift
n-bit
Register
clear
n-bit
Shift register
n-bit
Counter
n
Q
Shift register
(storage) Register
Counter
Q = lsb
- Content shifted
- I stored in msb
Q=
0 if clear=1,
I if load=1 and clock=1,
Q(previous) otherwise.
Q=
0 if clear=1,
Q(prev)+1 if count=1 and clock=1.

Sequential logic design

A) Problem Description
C) Implementation Model
You want to construct a clock

divider. Slow down your preexisting clock so that you output a
1 for every four clock cycles
Combinational logic
I0
B) State Diagram
a=0
a=1
1
x=0
a=0
I1
I0
Q1
0
0
0
0
1
1
1
1
Inputs
Q0
a
0
0
0
1
1
0
1
1
0
0
0
1
1
0
1
1
I1
0
0
0
1
1
1
1
0
Outputs
I0
0
1
1
0
0
1
1
0
x
0
0
0
1
a=1
a=0
Q0
State register
x=1
x=0
x
I1
Q1
D) State Table (Moore-type)
a=1
a=1
2
x=0
a=0
Given this implementation model

Sequential logic design quickly reduces to
combinational logic design

gis
10
Sequential logic design (cont.)

F) Combinational Logic
E) Minimized Output Equations

I1 Q1Q0
00
a
01
11
10
01
11
10
I0 Q1Q0
00
a
01
11
10
x Q1Q0
00
a
(random logic)
a
x
I1 = Q1Q0a + Q1a +
Q1Q0
I1
I0 = Q0a + Q0a
I0
x = Q1Q0
Q1 Q0

11
Custom single-purpose processor basic

model
external
control
inputs
external
data
inputs
controller
datapath
control
inputs
datapath
control
outputs
external
control
outputs
datapath
controller
datapath
next-state
and
control
logic
registers
state
register
functional
units
external
data
outputs
controller and datapath

a view inside the controller and datapath
12
Example: greatest common divisor

!1
(a) black-box
view
First create algorithm

Convert algorithm to
complex state machine
(c) state diagram
1:
1
!(!go_i)
2:
go_i
x_i
y_i
!go_i
2-J:
GCD
Known as FSMD: finitestate machine with datapath

Can use templates to
perform such conversion
3:
x = x_i
4:
y = y_i
d_o
(b) alg. specification
!(x!=y)
5:
0: int x, y;
1: while (1) {
2: while (!go_i) ;
3: x = x_i;
4:
y = y_i;
5: while (x != y) {
6:
if (x < y)
7:
y = y - x;
else
8:
x = x - y;
}
9:
d_o = x;
}
x!=y
6:
x<y
7:
y = y -x
!(x<y)
8: x = x - y
6-J:
5-J:
9:
d_o = x
1-J:

13
State diagram templates

Assignment statement
Loop statement
while (cond) {
loop-bodystatements
}
next statement
a=b
next statement
a=b
Branch statement
!cond
C:
if (c1)
c1 stmts
else if c2
c2 stmts
else
other stmts
next statement
C:
c1
cond
loop-bodystatements
next
statement
c2 stmts
!c1*!c2
others
J:
J:
next
statement

c1 stmts
!c1*c2
next
statement
14
Creating the datapath

Create a register for any
declared variable
Create a functional unit for
each arithmetic operation
Connect the ports, registers
and functional units
!1
1:
1
!(!go_i)
2:
x_i
!go_i
Datapath
2-J:
x_sel
3:
x = x_i
4:
y = y_i
x_ld
n-bit 2x1
0: x
0: y
y_ld
!(x!=y)
5:
!=
5: x!=y
x_neq_y
6:
x<y
y = y -x
7:
n-bit 2x1
y_sel
x!=y
Based on reads and writes

Use multiplexors for
multiple sources
y_i
!(x<y)
<
subtractor
6: x<y
subtractor
8: x-y
x_lt_y
8: x = x - y
9: d
d_ld
d_o
6-J:
Create unique identifier
7: y-x
5-J:
for each datapath component

control input and output
9:
d_o = x
1-J:

15
Creating the controllers FSM

go_i
!1
1:
Controller
1
!(!go_i)
0000
1:
0001
2:
!1
1
2:
!go_i
!(!go_i)
!go_i
2-J:
0010 2-J:
3:
x = x_i
4:
y = y_i
0011
x_sel = 0
3: x_ld = 1
0100
y_sel = 0
4: y_ld = 1
0101
5:
!(x!=y)
5:
x_i
0110
x<y
7:
y = y -x
!(x<y)
8: x = x - y
x_neq_y
6:
!x_lt_y
8: x_sel = 1
x_ld = 1
0111
6-J:
9:
1-J:
d_o = x
!=
x_lt_y
1011
9:
d_ld = 1
1100 1-J:

n-bit 2x1
0: x
0: y
y_ld
5: x!=y
x_neq_y
1010 5-J:
n-bit 2x1
y_sel
1000
1001 6-J:
5-J:
x_sel
x_ld
x_lt_y
7: y_sel = 1
y_ld = 1
y_i
Datapath
!x_neq_y
x!=y
6:
Same structure as FSMD

Replace complex
actions/conditions with
datapath configurations
<
6: x<y
subtractor
8: x-y
subtractor
7: y-x
9: d
d_ld
d_o
16
Splitting into a controller and datapath

go_i
Controller
Controller implementation model
0000
go_i
!1
x_i
1:
1
x_sel
Combinational
logic
0001
y_sel
(b) Datapath
2:
x_sel
!go_i
x_ld
0010 2-J:
y_ld
x_neq_y
0011
x_lt_y
d_ld
0100
x_ld
x_sel = 0
3: x_ld = 1
5:
0110
6:
I1
5: x!=y
x_neq_y
x_neq_y=1
x_lt_y=1
7: y_sel = 1
y_ld = 1
I0
0: x
0: y
!=
x_neq_y=0
subtractor
8: x-y
subtractor
7: y-x
9: d
d_ld
x_lt_y=0
8: x_sel = 1
x_ld = 1
0111
<
6: x<y
x_lt_y
State register
I2
n-bit 2x1
y_ld
y_sel = 0
4: y_ld = 1
0101
n-bit 2x1
y_sel
Q3 Q2 Q1 Q0
I3
y_i
!(!go_i)
d_o
1000
1001 6-J:
1010 5-J:
1011
9:
d_ld = 1
1100 1-J:

17
Controller state table for the GCD example

Inputs
Q3
Q2
Q1
Q0
Outputs
x_lt_
y
*
go_i
I3
I2
I1
I0
x_sel
y_sel
x_ld
y_ld
d_ld
x_neq
_y
*

18
Completing the GCD custom single-purpose

processor design
We finished the datapath
We have a state table for
the next state and control
logic
controller
datapath
next-state
and
control
logic
registers
state
register
functional
units
All thats left is

combinational logic
design
This is not an optimized

design, but we see the
basic steps
a view inside the controller and datapath

ard
Hardware/Software
19
We often start with a state

machine
Rather than algorithm
Cycle timing often too central
to functionality
Problem Specification
RT-level custom single-purpose processor

design
Sende
r
clock
data_in(4)
Example

H
Bridge
A single-purpose processor that
converts two 4-bit inputs, arriving one
at a time over data_in along with a
rdy_in pulse, into one 8-bit output on
data_out along with a rdy_out pulse.
rdy_in=0
rdy_out
Rece
iver
data_out(8)
Bridge
rdy_in=1
RecFirst4Start
data_lo=data_in
RecFirst4End
rdy_in=1
WaitFirst4
rdy_in=0
FSMD
Bus bridge that converts 4-bit

bus to 8-bit bus
Start with FSMD
Known as register-transfer
(RT) level
Exercise: complete the design
rdy_in
WaitSecond4
rdy_in=0
rdy_in=1
RecSecond4Start
data_hi=data_in
rdy_in=0
Send8Start
data_out=data_hi
& data_lo
rdy_out=1
Send8End
rdy_out=0
rdy_in=1
RecSecond4End
Inputs
rdy_in: bit; data_in: bit[4];
Outputs
rdy_out: bit; data_out:bit[8]
Variables
data_lo, data_hi: bit[4];
20
RT-level custom single-purpose processor

design (cont)
Bridge
(a) Controller
rdy_in=0
WaitFirst4
rdy_in=0
WaitSecond4
Send8Start
data_out_ld=1
rdy_out=1
rdy_in=1
rdy_in=1
RecFirst4Start
data_lo_ld=1
rdy_in=0
rdy_in=1
RecSecond4Start
data_hi_ld=1
RecFirst4End
rdy_in=1
RecSecond4End
Send8End
rdy_out=0
rdy_in
rdy_out
clk
data_out
data_hi
data_lo
data_lo_ld
data_out_ld
data_hi_ld
to all
registers
data_in(4)
data_out
(b) Datapath

21
Optimizing custom single-purpose processors

Optimization is the task of making design metric
values the best possible
Optimization opportunities
original program
FSMD
datapath
FSM

22
Optimizing the original program

Analyze program attributes and look for areas of
possible improvement
number of computations
size of variable
time and space complexity
operations used
multiplication and division very expensive

23
Optimizing the original program (cont)

original program
0: int x, y;
1: while (1) {
2:
while (!go_i) ;
3:
x = x_i;
4:
y = y_i;
5:
while (x != y) {
6:
if (x < y)
7:
y = y - x;
else
8:
x = x - y;
}
9:
d_o = x;
}
replace the subtraction

operation(s) with modulo
operation in order to speed
up program
optimized program
0: int x, y, r;
1: while (1) {
2:
while (!go_i) ;
// x must be the larger number
3:
if (x_i >= y_i) {
4:
x=x_i;
5:
y=y_i;
}
6:
else {
7:
x=y_i;
8:
y=x_i;
}
9:
while (y != 0) {
10:
r = x % y;
11:
x = y;
12:
y = r;
}
13:
d_o = x;
}
GCD(42, 8) - 9 iterations to complete the loop
GCD(42,8) - 3 iterations to complete the loop
x and y values evaluated as follows : (42, 8), (43, 8),

(26,8), (18,8), (10, 8), (2,8), (2,6), (2,4), (2,2).
x and y values evaluated as follows: (42, 8), (8,2),

(2,0)

24
Optimizing the FSMD

Areas of possible improvements
merge states
states with constants on transitions can be eliminated, transition
taken is already known
states with independent operations can be merged
separate states
states which require complex operations (a*b*c*d) can be broken
into smaller states to reduce hardware size
scheduling

25
Optimizing the FSMD (cont.)

int x, y;
!1
1:
original FSMD
optimized FSMD
int x, y;
!(!go_i)
2:
eliminate state 1 transitions have constant values
2:
go_i
!go_i
2-J:
3:
3:
merge state 2 and state 2J no loop operation in

between them
x = x_i
!go_i
x = x_i
y = y_i
5:
4:
y = y_i
!(x!=y)
5:
merge state 3 and state 4 assignment operations are

independent of one another
x!=y
6:
x<y
7:
y = y -x
!(x<y)
merge state 5 and state 6 transitions from state 6 can

be done in state 5
x<y
7: y = y -x
9:
x>y
8: x = x - y
d_o = x
8: x = x - y
eliminate state 5J and 6J transitions from each state

can be done from state 7 and state 8, respectively
6-J:
5-J:
9:
d_o = x
eliminate state 1-J transition from state 1-J can be

done directly from state 9
1-J:

26
Optimizing the datapath

Sharing of functional units
one-to-one mapping, as done previously, is not necessary
if same operation occurs in different states, they can share a
single functional unit
Multi-functional units
ALUs support a variety of operations, it can be shared
among operations occurring in different states

27
Optimizing the FSM

State encoding
task of assigning a unique bit pattern to each state in an FSM
size of state register and combinational logic vary
can be treated as an ordering problem
State minimization
task of merging equivalent states into a single state
state equivalent if for all possible input combinations the two states
generate the same outputs and transitions to the next same state

28
Summary
Straightforward design techniques

Can be built to execute algorithms
Typically start with FSMD
CAD tools can be of great assistance

29

Chapter 3 Instruction-Set Processors:

Software
Introduction
Instruction-Set Processor
Processor designed for a variety of computation tasks
General-Purpose Processor (GPP)
Application-Specific Processor (ASIP): optimized for a specific subset of tasks
Low unit cost because NRE is spreaded over large numbers of units
Motorola sold half a billion 68HC05 microcontrollers in 1996 alone
Carefully designed since higher NRE is acceptable

Can yield good performance, size and power
System implementations designed with low NRE cost, short time-tomarket/prototype, high flexibility
User just writes software; no processor design
Terms microprocessor, microcontroller or micro adopted when they were finally

implemented on one or few chips

Basic Architecture
Control unit and
datapath
Processor
Control unit
Note similarity to
single-purpose
processor
Datapath
ALU
Controller
Control
/Status
Registers
Key differences
Datapath is general
Control unit doesnt
store the algorithm
the algorithm is
programmed into the
memory
E
PC
IR
I/O
Memory
Datapath Operations
Load
Processor
Read memory location

into register
Control unit
Datapath
ALU
ALU operation
Controller
+1
Control
/Status
Input certain registers

through ALU, store
back in register
Registers
Store
10
Write register to
memory location
PC
11
IR
I/O
...
Memory
10
11
...
Control Unit
Control unit: configures the datapath

operations
Processor
Sequence of desired operations

(instructions) stored in memory
program
Control unit
ALU
Controller
Instruction cycle broken into

several sub-operations, each one
clock cycle, e.g.:
Fetch: Get next instruction into IR
Decode: Determine what the
instruction means
Fetch operands: Move data from
memory to datapath register
Execute: Move data through the
ALU
Store results: Write data from
register to memory
Datapath
Control
/Status
Registers
PC
IR
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1

R0
Memory
R1
...
500
501
10
...
5
Control Unit Sub-Operations

Fetch
Processor
Get next instruction

into IR
PC: program
counter, always
points to next
instruction
IR: holds the
fetched instruction
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
Memory
...
500
501
101
inc R1, R0
102 store M[501], R1
R1
10
...


Decode
Processor
Control unit
Determine what the

instruction means
Datapath
ALU
Controller
Control
/Status
Registers
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
7

Fetch operands
Processor
Control unit
Move data from

memory to datapath
register
Datapath
ALU
Controller
Control
/Status
Registers
10
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
Memory
...
500
501
101
inc R1, R0
102 store M[501], R1
R1
10
...


Execute
Move data through
the ALU
This particular
instruction does
nothing during this
sub-operation
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
9

Store results
Processor
Write data from

register to memory
This particular
instruction does
nothing during this
sub-operation
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC
IR
load R0, M[500]
100
R0
I/O
Memory
100 load R0, M[500]
...
500
501
101
inc R1, R0
102 store M[501], R1
R1
10
...

10
Instruction Cycles
PC=100
Fetch Decode Fetch Exec. Store

ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC 100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
11
Instruction Cycles
PC=100
Processor

ops
results
clk
Control unit
Datapath
ALU
Controller
+1
Control
/Status
PC=101
Registers

ops
results
clk
10
PC 101
IR
inc R1, R0
R0
I/O
100 load R0, M[500]
Memory
101
inc R1, R0
102 store M[501], R1
11
R1
...
500
501
10
...

12
Instruction Cycles
PC=100

ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
PC=101
Registers

ops
results
clk
10
PC 102
IR
store M[501], R1
R0
11
R1
PC=102

ops
results
clk
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1

Memory
...
500 10
501 11
...
13
Architectural Considerations
N-bit processor
N-bit ALU, registers,
buses, memory data
interface
Embedded: 8-bit, 16bit, 32-bit common
Desktop/servers: 32bit, even 64
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
PC size determines
address space
I/O
Memory

14
Architectural Considerations
Clock frequency
Inverse of clock
period
Must be longer than
longest register to
register delay in
entire processor
Memory access is
often the longest
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
I/O
Memory

15
Pipelining: Increasing Instruction

Throughput
Wash
Non-pipelined
Dry
Decode
Time
Instruction 1
pipelined dish cleaning
Execute
Store res.
Fetch ops.
Pipelined
2
non-pipelined dish cleaning
Fetch-instr.
pipelined instruction execution
Time
Pipelined
Time

16
Superscalar and VLIW Architectures

Performance can be improved by:
Faster clock (but theres a limit)
Pipelining: slice up instruction into stages, overlap stages
Multiple ALUs to support more than one instruction stream
Superscalar
Scalar: non-vector operations
Fetches instructions in batches, executes as many as possible
May require extensive hardware to detect independent instructions
VLIW: each word in memory has multiple independent instructions
Relies on the compiler to detect and schedule instructions
Currently growing in popularity

Hardware/Software
ard
17
Two Memory Architectures
Princeton (Von Neumann)

Fewer memory wires
Harvard
Simultaneous program
and data memory
access
(microcontrollers)
Processor
Program
memory
Data memory
Harvard
Processor
Memory
(program and data)
Princeton

18
Cache Memory
Memory access may be slow
Cache is small but fast
memory close to processor
Holds copy of part of memory
Hits and misses
Fast/expensive technology, usually on

the same chip
Processor
Program Cache
Data Cache
Memory
Slower/cheaper technology, usually on

a different chip

19
Programmers View
Programmer doesnt need detailed understanding of architecture
Instead, needs to know what instructions can be executed
Two levels of instructions:

Assembly level
Structured languages (C, C++, Java, etc.)
Most development today done using structured languages

But, some assembly level programming may still be necessary
Drivers: portion of program that communicates with and/or controls
(drives) another device
Often have detailed timing considerations, extensive bit manipulation
Assembly level may be best for these

20
Assembly-Level Instructions
Instruction 1
opcode
operand1
operand2
Instruction 2
opcode
operand1
operand2
Instruction 3
opcode
operand1
operand2
Instruction 4
opcode
operand1
operand2
...
Instruction Set
Defines the legal set of instructions for that processor
Data transfer: memory/register, register/register, I/O, etc.
Arithmetic/logical: move register through ALU and back
Branches: determine next PC value when not just PC+1
21
A Simple (Trivial) Instruction Set

Assembly instruct.
First byte
Second byte
Operation
MOV Rn, direct
0000
Rn
direct
Rn = M(direct)
MOV direct, Rn
0001
Rn
direct
M(direct) = Rn
MOV @Rn, Rm
0010
Rn
MOV Rn, #immed.
0011
Rn
ADD Rn, Rm
0100
Rn
Rm
Rn = Rn + Rm
SUB Rn, Rm
0101
Rn
Rm
Rn = Rn - Rm
JZ Rn, relative
0110
Rn
opcode
Rm
M(Rn) = Rm
immediate
relative
Rn = immediate
PC = PC+ relative
(only if Rn is 0)
operands

22
Addressing Modes
Addressing
mode
Operand field
Immediate
Data
Register-direct
Register-file
contents
Memory
contents
Register address
Data
Register
indirect
Register address
Memory address
Direct
Memory address
Data
Indirect
Memory address
Memory address
Data
Data

23
Sample Programs
Equivalent assembly program
C program
int total = 0;
for (int i=10; i!=0; i--)
total += i;
// next instructions...
0
1
2
3
MOV R0, #0;

MOV R1, #10;
MOV R2, #1;
MOV R3, #0;
// total = 0
// i = 10
// constant 1
// constant 0
Loop:
5
6
7
JZ R1, Next;
ADD R0, R1;
SUB R1, R2;
JZ R3, Loop;
// Done if i=0
// total += i
// i-// Jump always
Next:
// next instructions...
Try some others

Handshake: Wait until the value of M[254] is not 0, set M[255] to 1, wait
until M[254] is 0, set M[255] to 0 (assume those locations are ports).
(Harder) Count the occurrences of zero in an array stored in memory
locations 100 through 199.
24
Programmer Considerations
Program and data memory space
Embedded processors often very limited
e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
Registers: How many are there?

Only a direct concern for assembly-level programmers
I/O
How communicate with external signals?
Interrupts

25
Microprocessor Architecture Overview

If you are using a particular microprocessor, now is a
good time to review its architecture

26
Example: parallel port driver

LPT Connection Pin
I/O Direction
Register Address
Output
0th bit of register #2

0th
7th
2-9
Output
bit of register #0
10,11,12,13,15
Input
6,7,5,4,3th bit of register #1
14,16,17
Output
1,2,3th bit of register #2
Pin 13
PC
Switch
Parallel port
Pin 2
LED
Using assembly language programming we can configure a PC

parallel port to perform digital I/O (8255A peripheral I/F controller chip)
write and read to three special registers to accomplish this table provides
list of parallel port connector pins and corresponding register location
Example : parallel port which monitors the input switch and turns the LED
on/off accordingly

27
Parallel Port Example

;
;
;
;
;
This program consists of a sub-routine that reads

the state of the input pin, determining the on/off state
of our switch and asserts the output pin, turning the LED
on/off accordingly
x86 assembly language
CheckPort
push
push
mov
in
and
cmp
jne
proc
ax
; save the content
dx
; save the content
dx, 3BCh + 1 ; base + 1 for register #1
al, dx
; read register #1
al, 10h
; mask out all but bit # 4
al, 0
; is it 0?
SwitchOn
; if not, we need to turn the LED on
SwitchOff:
mov
in
and
out
jmp
dx, 3BCh + 0 ; base + 0 for register #0

al, dx
; read the current state of the port
al, f7h
; clear first bit (masking)
dx, al
; write it out to the port
Done
; we are done
SwitchOn:
mov
in
or
out
dx,
al,
al,
dx,
Done:
pop
pop
CheckPort
dx
ax
endp
extern C CheckPort(void);
// defined in
// assembly
void main(void) {
while( 1 ) {
CheckPort();
}
}
Pin 13
PC
Switch
Parallel port
Pin 2
LED
LPT Connection Pin
I/O Direction
Register Address
Output
0th bit of register #2
3BCh + 0 ; base + 0 for register #0

dx
; read the current state of the port
01h
; set first bit (masking)
al
; write it out to the port
2-9
Output
0th-7th bit of register #0
10,11,12,13,15
Input
6,7,5,4,3th bit of reg. #1
14,16,17
Output
1,2,3th bit of register #2
; restore the content

; restore the content

28
Operating System
Optional software layer
providing low-level services to
a program (application).
File management, disk access
Keyboard/display interfacing
Scheduling multiple programs for
execution
Or even just multiple threads from
one program
Program makes system calls to

the OS
Embedded
mb
DB file_name out.txt -- store file name

MOV
MOV
INT
JZ
R0, 1324
R1, file_name
34
R0, L1
-----
system call open id

address of file-name
cause a system call
if zero -> error
. . . read the file

JMP L2
-- bypass error cond.
L1:
. . . handle the error
L2:
29
Development Environment
Development processor
The processor on which we write and debug our programs
Usually a PC
Target processor
The processor that the program will run on in our embedded
system
Often different from the development processor
Target processor

30
Software Development Process

Compilers
C File
C File
Compiler
Binary
File
Binary
File
Cross compiler
Asm.
File
Runs on one
processor, but
generates code for
another
Assemble
r
Binary
File
Linker
Library
Exec.
File
Implementation Phase

Debugger
Profiler
Verification Phase
Assemblers
Linkers
Debuggers
Profilers
31
Running a Program
If development processor is different than target, how
can we run our compiled code? Two options:
Download to target processor
Simulate
Simulation
One method: Hardware description language
But slow, not always available
Another method: Instruction set simulator (ISS)

Runs on development processor, but executes instructions of target
processor
32
Instruction Set Simulator For A Simple

Processor
#include <stdio.h>
typedef struct {
unsigned char first_byte, second_byte;
} instruction;
instruction program[1024];
unsigned char memory[256];
int main(int argc, char *argv[]) {
//instruction memory
//data memory
}
return 0;
}
FILE* ifs;
void run_program(int num_bytes) {
If( argc != 2 ||
(ifs = fopen(argv[1], rb) == NULL ) {
return 1;
}
if (run_program(fread(program, 2,
sizeof(program), ifs)) == 0) {
print_memory_contents();
return(0);
}
else return(-1);
int pc = -1;
unsigned char reg[16], fb, sb;
while( ++pc < (num_bytes / 2) ) {
fb = program[pc].first_byte;
sb = program[pc].second_byte;
switch( fb >> 4 ) {
case 0: reg[fb & 0x0f] = memory[sb]; break;
case 1: memory[sb] = reg[fb & 0x0f]; break;
case 2: memory[reg[fb & 0x0f]] =
reg[sb >> 4]; break;
case 3: reg[fb & 0x0f] = sb; break;
case 4: reg[fb & 0x0f] += reg[sb >> 4]; break;
case 5: reg[fb & 0x0f] -= reg[sb >> 4]; break;
case 6: pc += sb; break;
default: return 1;

33
Testing and Debugging

(a)
ISS
(b)
Implementation
Phase
Verification
Phase
Implementation
Phase
Debugger
/ ISS
Emulator
External tools
Gives us control over time

set breakpoints, look at
register values, set values,
step-by-step execution, ...
But, doesnt interact with real
environment
Download to board
Use device programmer
Runs in real environment, but
not controllable
Compromise: emulator
Programmer
Verification
Phase
Runs in real environment, at

speed or near
Supports some controllability
from the PC

34
Application-Specific Instruction-Set
Processors (ASIPs)
GPPs
Sometimes too general to be effective in demanding
application
e.g., video processing requires huge video buffers and operations
on large arrays of data, inefficient on a GPP
But single-purpose processor has high NRE, not

programmable
ASIPs targeted to a particular domain

Contain architectural features specific to that domain
e.g., embedded control, digital signal processing, video processing,
network processing, telecommunications, etc.
Still programmable
35
A Common ASIP: Microcontroller

For embedded control applications
Reading sensors, setting actuators
Mostly dealing with events (bits): data is present, but not in huge
amounts
e.g., VCR, disk drive, digital camera (assuming SPP for image
compression), washing machine, microwave oven
Microcontroller features
On-chip peripherals
Timers, analog-digital converters, serial communication, etc.
Tightly integrated for programmer, typically part of register space
On-chip program and data memory

Direct programmer access to many of the chips pins
Specialized instructions for bit-manipulation and other low-level
operations
36
Another Common ASIP: Digital Signal

Processors (DSP)
For signal processing applications
Large amounts of digitized data, often streaming
Data transformations must be applied fast
e.g., cell-phone voice filter, digital TV, music synthesizer
DSP features
Several instruction execution units
Multiple-accumulate single-cycle instruction, other instrs.
Efficient vector operations e.g., add two arrays
Vector ALUs, loop buffers, etc.

37
Trend: Even More Customized ASIPs

In the past, microprocessors were acquired as chips
Today, we increasingly acquire a processor as Intellectual
Property (IP)
e.g., synthesizable VHDL model
Opportunity to add a custom datapath hardware and a few

custom instructions, or delete a few instructions
Can have significant performance, power and size impacts
Problem: need compiler/debugger for customized ASIP
Remember, most development uses structured languages
One solution: automatic compiler/debugger generation
e.g., www.tensilica.com
Another solution: retargettable compilers

e.g., www.improvsys.com (customized VLIW architectures)
38
Selecting a Microprocessor
Issues
Technical: speed, power, size, cost
Other: development environment, prior expertise, licensing, etc.
Speed: how evaluate a processors speed?

Clock speed but instructions per cycle may differ
Instructions per second but work per instr. may differ
Dhrystone: Synthetic benchmark (1984). Standard code (mostly string handling; no
floating point operations). Dhrystones/sec.
MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digitals VAX 11/780).
A.k.a. Dhrystone MIPS. Commonly used today.
So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second
SPEC: set of more realistic benchmarks, but oriented to desktops
EEMBC EDN Embedded Benchmark Consortium, www.eembc.org
Suites of benchmarks: automotive, consumer electronics, networking, office
automation, telecommunications

39
Instruction-Set Processors
Processor
Clock speed
Intel PIII
1GHz
IBM
PowerPC
750X
MIPS
R5000
StrongARM
SA-110
550 MHz
250 MHz
233 MHz
Intel
8051
Motorola
68HC811
12 MHz
TI C5416
160 MHz
Lucent
DSP32C
80 MHz
3 MHz
Periph.
2x16 K
L1, 256K
L2, MMX
2x32 K
L1, 256K
L2
2x32 K
2 way set assoc.
None
4K ROM, 128 RAM,

32 I/O, Timer, UART
4K ROM, 192 RAM,
32 I/O, Timer, WDT,
SPI
128K, SRAM, 3 T1
Ports, DMA, 13
ADC, 9 DAC
16K Inst., 2K Data,
Serial Ports, DMA
Bus Width
MIPS
General Purpose Processors
32
~900
Power
Trans.
Price
97W
~7M
$900
32/64
~1300
5W
~7M
$900
32/64
NA
NA
3.6M
NA
32
268
1W
2.1M
NA
Microcontroller
~1
~0.2W
~10K
$7
~.5
~0.1W
~10K
$5
Digital Signal Processors

16/32
~600
NA
NA
$34
32
NA
NA
$75
40
Sources: Intel, Motorola, MIPS, ARM, TI, and IBM Website/Datasheet; Embedded Systems Programming, Nov. 1998
40
Designing an Instruction-Set Processor

FSMD
Declarations:
Not something an embedded

system designer normally
would do
bit PC[16],
// Program Counter
IR[16];
// Instruction Reg.
bit M[64k][16], // Memory
RF[16][16]; // Register File
Reset
PC=0;
Fetch
IR=M[PC];
PC=PC+1
Decode
But instructive to see how

simply we can build one top
down
Remember that real processors
arent usually built this way
from states
below
Mov1
RF[Rn] = M[dir]
to Fetch
Mov2
M[dir] = RF[Rn]
to Fetch
Mov3
M[@Rn] = RF[Rm]
to Fetch
Mov4
RF[Rn]= imm
to Fetch
Op = 0000
0001
0010
Much more optimized, much

more bottom-up design
0011
Add
RF[Rn] =RF[Rn]+RF[Rm]
to Fetch
Sub
RF[Rn] = RF[Rn]-RF[Rm]
to Fetch
0100
Aliases:
0101
Op IR[15..12]
Rn IR[11..8]
Rm IR[7..4]

dir IR[7..0]
imm IR[7..0]
rel IR[7..0]
Jz
0110
PC=(RF[Rn]=0) ?rel :PC

to Fetch
41
Architecture of a Simple Microprocessor

Storage devices for each
declared variable
Control unit
register file holds each of the

variables
Controller
(Next-state and
control
logic; state
register)
Functional units to carry out the

FSMD operations
One ALU carries out every
required operation
Connections added among the

components ports
corresponding to the operations
required by the FSM
Unique identifiers created for
every control signal
To all
input
contro
l
signals
Datapath
From all
output
control
signals
16
PCld
PCinc
Irld
PC
IR
RFs
2x1 mux
RFwa
RFw
RFwe
RF (16)
RFr1a
RFr1e
RFr2a
RFr1
RFr2e
RFr2
ALUs
PCclr
ALU
ALUz
Ms
4x1 mux
Mre Mwe
Memory

42
A Simple Microprocessor
Reset
PC=0;
PCclr=1;
Fetch
IR=M[PC];
PC=PC+1
MS=10;
Irld=1;
Mre=1;
PCinc=1;
Decode
from states
below
RF[Rn] = M[dir]
to Fetch
RFwa=Rn; RFwe=1; RFs=01;

Ms=01; Mre=1;
Mov2
M[dir] = RF[Rn]
to Fetch
RFr1a=Rn; RFr1e=1;
Ms=01; Mwe=1;
Mov3
M[@Rn] = RF[Rm]
to Fetch
RFr1a=Rn; RFr1e=1; Alus=11;

RFr2a=Rm;RFr2e=1;
Ms=11; Mwe=1;
Mov4
RF[Rn]= imm
to Fetch
Mov1
Op = 0000
0001
0010
0011
0100
0101
0110
Add
Sub
Jz
FSMD
RFwa=Rn; RFwe=1; RFs=10;
RF[Rn] =RF[Rn]+RF[Rm] RFwa=Rn; RFwe=1; RFs=00;

RFr1a=Rn; RFr1e=1;
to Fetch
RFr2a=Rm; RFr2e=1; ALUs=00
RF[Rn] = RF[Rn]-RF[rm] RFwa=Rn; RFwe=1; RFs=00;
RFr1a=Rn; RFr1e=1;
to Fetch
RFr2a=Rm; RFr2e=1; ALUs=01
PCld= ALUz;
PC=(RF[Rn]=0) ?rel :PC
RFrla=Rn;
to Fetch
RFrle=1;
FSM operations that replace the FSMD

operations after a datapath is created
Control unit
Controller
(Next-state and
control
logic; state
register)
To all
input
contro
l
signals
From all
output
control
signals
16
PCld
PCinc
Irld
PC
IR
Datapath
RFs
2x1 mux
RFwa
RFw
RFwe
RF (16)
RFr1a
RFr1e
RFr2a
RFr2e
RFr1
RFr2
ALUs
PCclr
ALU
ALUz
3
Ms
4x1 mux
0
Mre Mwe
Memory
You just built a simple microprocessor!

43
Chapter Summary
Instruction-Set processors
Good performance, low NRE, flexible
Controller, datapath, and memory

Structured languages prevail
But some assembly level programming still necessary
Many tools available

Including instruction-set simulators, and in-circuit emulators
ASIPs
Microcontrollers, DSPs, network processors, more customized ASIPs
Choosing among processors is an important step

Designing an instruction-set processor is conceptually the same
as designing a single-purpose processor
44

Chapter 4 Standard Single Purpose

Processors: Peripherals
Introduction
Single-purpose processors
Performs specific computation task
Designed by us for a unique task
Standard single-purpose processors
Off-the-shelf -- pre-designed for a common task

a.k.a. peripherals
serial transmission
analog/digital conversions

Timers, counters, watchdog timers

Timer: measures time intervals
To generate timed output events
e.g., hold traffic light green for 10 s
To measure input events

e.g., measure a cars speed
Basic timer
Clk
16-bit up
counter
Based on counting clock pulses
E.g., let Clk period be 10 ns (f = 100 MHz)

And we count 20,000 Clk pulses
Then 200 microseconds have passed
16-bit counter would count up to 65,535*10 ns =
655.35 microsec., resolution = 10 ns
Top: indicates top count reached, wrap-around

16 Cnt
Top
Reset
Counters
Counter: like a timer, but counts

pulses on a general input signal
rather than clock
Timer/counter
Clk
e.g., count cars passing over a sensor

Can often configure device as either a
timer or counter
2x1
mux
16-bit up
counter
16 Cnt
Cnt_in
Top
Reset
Mode

Other timer structures

Interval timer
Indicates when desired time
interval has passed
We set terminal count to
desired interval
Number of clock cycles
= Desired time interval /
Clock period
Cascaded counters
Prescaler
Divides clock
Increases range, decreases
resolution
Ha
H
16/32-bit timer
Clk
Timer with a terminal
count
16-bit up
counter
16 Cnt1
Top1
Clk
16-bit up
counter
16 Cnt
16-bit up
counter
16
Reset
Cnt2
Top2
=
Top
Time with prescaler

Clk
Prescaler
Terminal count
16-bit up
counter
Mode
Example: Reaction Timer

reaction
button
indicator
light
LCD
/* main.c */
#define MS_INIT
63535
void main(void){
int count_milliseconds = 0;
time: 100 ms
configure timer mode

set Cnt to MS_INIT
Measure time between turning light on

and user pushing button
16-bit timer, clk period is 83.33 ns (12 MHz),
counter increments every 6 cycles
Resolution = 6*83.33 ns=0.5 microsec.
Range = 65535*0.5 microseconds = 32.77
milliseconds
Want program to count millisec., so initialize
counter to 65535 1000/0.5 = 63535
wait a random amount of time

turn on indicator light
start timer
while (user has not pushed reaction button){
if(Top) {
stop timer
set Cnt to MS_INIT
start timer
reset Top
count_milliseconds++;
}
}
turn light off
printf(time: %i ms, count_milliseconds);
}

E
Watchdog timer
Must reset timer every
X time unit, else timer
generates a signal
Common use: detect
failure, self-reset
Another use: timeouts
e.g., ATM machine
16-bit timer, 2
millisec. resolution
timereg value = 2*(2161)X = 131070X
For 2 min. timeout,
X = 120,000 microsec.;
so timereg = 11070
osc
prescaler
clk
(/12)
12 MHz
scalereg
overflow
(12 bits)
1 MHz
overflow
Timereg
(16 bits)
to system reset
or interrupt
1/(131070 ms)
1/(2ms)
checkreg
/* main.c */
main(){
wait until card inserted
call watchdog_reset_routine
while(transaction in progress){
if(button pressed){
perform corresponding action
call watchdog_reset_routine
}
/* if watchdog_reset_routine not called every
< 2 minutes, interrupt_service_routine is
called */
}

watchdog_reset_routine(){
/* checkreg is set so we can load value into
timereg. Zero is loaded into scalereg and
11070 is loaded into timereg */
checkreg = 1
scalereg = 0
timereg = 11070
}
void interrupt_service_routine(){
eject card
reset screen
}
Serial Transmission Using UARTs

UART: Universal
Asynchronous Receiver
Transmitter
Takes parallel data and
transmits serially
Receives serial data and
converts to parallel
Parity: extra bit for simple

error checking
Start bit, stop bit
Baud rate
embedded
device
1
10011011
10011011
Sending UART
start bit
Receiving UART
end bit
data
signal changes per second

bit rate usually higher

H
Pulse width modulator

Generates pulses with specific
high/low times
Duty cycle: % time high
pwm_o
clk
Square wave: 50% duty cycle
Common use: control average

voltage to electric device
Simpler than DC-DC
converter or digital-analog
converter
DC motor speed, dimmer
lights
Another use: encode

commands, receiver uses timer
to decode
H
25% duty cycle average pwm_o is 1.25V
pwm_o
clk
50% duty cycle average pwm_o is 2.5V.
pwm_o
clk
75% duty cycle average pwm_o is 3.75V.
Controlling a DC motor with a PWM

counter
( 0 254)
clk_div
clk
controls how
fast the
counter
increments
8-bit
comparator
Input Voltage
% of Maximum
Voltage Applied
RPM of DC Motor
2.5
50
1840
3.75
75
6900
5.0
100
9200
counter <
cycle_high,
pwm_o = 1
counter >=
cycle_high,
pwm_o = 0
pwm_o
cycle_high
Relationship between applied voltage and speed of

the DC Motor
Internal Structure of PWM
void main(void){
/* controls period */
PWMP = 0xff;
/* controls duty cycle */
PWM1 = 0x7f;
The PWM alone cannot drive the

DC motor, a possible way to
implement a driver is shown
below using an MJE3055T NPN
transistor.
5V
DC
From
processor
5V
MOTOR
while(1){};
}
A
B

10
LCD controller
E
R/W
RS
void WriteChar(char c){
communications
bus
RS = 1;
DATA_BUS = c;
EnableLCD(45);
DB7DB0
8
microcontroller
/* indicate data being sent */

/* send data to LCD */
/* toggle the LCD with appropriate delay */
LCD
controller
CODES
I/D = 1 cursor moves left
DL = 1 8-bit
I/D = 0 cursor moves right
DL = 0 4-bit
S = 1 with display shift
N = 1 2 rows
S/C =1 display shift
N = 0 1 row
S/C = 0 cursor movement
F = 1 5x10 dots
R/L = 1 shift to right
F = 0 5x7 dots
RS
R/W
DB7
DB6
DB5
DB4
DB3
DB2
DB1
DB0
Description
Clears all display, return cursor home
Returns cursor home
I/D
S/C
R/L
Move cursor and shifts display
DL
Sets interface data length, number of

display lines, and character font
R/L = 0 shift to left

WRITE DATA
Sets cursor move direction and/or

specifies not to shift display
ON/OFF of all display(D), cursor
ON/OFF (C), and blink position (B)
Writes Data
11
Keypad controller
N1
N2
N3
N4
k_pressed
M1
M2
M3
M4
4
key_code
key_code
keypad controller
N=4, M=4

12
Stepper motor controller

Stepper motor: rotates fixed number
of degrees when given a step signal
In contrast, DC motor just rotates when
power applied, coasts to stop otherwise
Specification: degrees/step or
#steps/revol. (e.g., 1.8 or 200 steps)
Rotation achieved by applying

specific voltage sequence to 4 coils
(1 or 2 coils driven during each step)
Controller greatly simplifies this

Sequence
1
2
3
4
5
A
+
+
+
B
+
+
+
A
+
+
-
B
+
+
-
Vd
16
MC3479P 15
14
13
12
Bias/Set
11
Phase A
Clk
10
CW/CCW
O|C
Full/Half Step
GND
Red
White
Yellow
Black
Vm
B
B
GND
A
A
B
B
13
Stepper motor with controller (driver)

/* main.c */
MC3479P
Stepper Motor
Driver
10
7
void main(void){
sbit clk=P1^1;
sbit cw=P1^0;
8051
CW/CCW
CLK
P1.0
P1.1
2 A B 15
3 A B 14
*/turn the motor forward */

cw=0;
/* set direction */
clk=0;
/* pulse clock */
delay();
clk=1;
void delay(void){
int i, j;
for (i=0; i<1000; i++)
for ( j=0; j<50; j++)
i = i + 0;
}
/*turn the motor backwards */

cw=1;
/* set direction */
clk=0;
/* pulse clock */
delay();
clk=1;
}
The output pins on the stepper motor driver do not

provide enough current to drive the stepper motor.
To amplify the current, a buffer is needed. One
possible implementation of the buffers is pictured
to the right. Q1 is an MJE3055T NPN transistor
and Q2 is an MJE2955T PNP transistor.
Stepper
Motor
+V
1K
Q1
A
Q2
1K

14
Stepper motor without controller (driver)

8051
P2.4
/*main.c*/
sbit notA=P2^0;
sbit isA=P2^1;
sbit notB=P2^2;
sbit isB=P2^3;
sbit dir=P2^4;
GND/ +V
P2.3
P2.2
P2.1
P2.0
Stepper
Motor
A possible way to implement the buffers is located

below. The 8051 alone cannot drive the stepper motor, so
several transistors were added to increase the current going
to the stepper motor. Q1, Q3 are MJE3055T NPN
transistors and Q2 is an MJE2955T PNP transistor. A is
connected to the 8051 microcontroller and B is connected
to the stepper motor.
+V
1K
Q1
B
+V
1K
A
Q2
Q3
330

void delay(){
int a, b;
for(a=0; a<5000; a++)
for(b=0; b<10000; b++)
a=a+0;
}
void move(int dir, int steps) {
int y, z;
/* clockwise movement */
if(dir == 1){
for(y=0; y<=steps; y++){
for(z=0; z<=19; z+4){
isA=lookup[z];
isB=lookup[z+1];
notA=lookup[z+2];
notB=lookup[z+3];
delay();
}
}
}
/* counter clockwise movement */

if(dir==0){
for(y=0; y<=step; y++){
for(z=19; z>=0; z - 4){
isA=lookup[z];
isB=lookup[z-1];
notA=lookup[z -2];
notB=lookup[z-3];
delay( );
}
}
}
}
void main( ){
int z;
int lookup[20] = {
1, 1, 0, 0,
0, 1, 1, 0,
0, 0, 1, 1,
1, 0, 0, 1,
1, 1, 0, 0 };
while(1){
/*move forward, 15 degrees (2 steps) */
move(1, 2);
/* move backwards, 7.5 degrees (1step)*/
move(0, 1);
}
}
15
Analog-to-digital converters
3.0V
2.5V
2.0V
1.5V
1.0V
0.5V
0V
analog output (V)
5.0V
4.5V
4.0V
3.5V
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
analog input (V)
Vmax = 7.5V
7.0V
6.5V
6.0V
5.5V
2
1
t1
t2
0100
t3
2
1
time
t1
t4
1000 0110 0101

Digital output
0100
t3
time
t4
1000 0110
Digital input
0101
digital to analog
analog to digital
proportionality
t2

16
ADC using successive approximation

Given an analog input signal whose voltage should range from 0 to 15 volts, and an 8-bit digital encoding, calculate the correct encoding for
5 volts. Then trace the successive-approximation approach to find the correct encoding.
Va / Vmax = d /(2^n 1)
5/15 = d/(28-1)
d= 85
encoding: 01010101
Successive-approximation method
(Vmax Vmin) = 7.5 volts
Vmax = 7.5 volts.
(5.63 + 4.69) = 5.16 volts

Vmax = 5.16 volts.
(7.5 + 0) = 3.75 volts

Vmin = 3.75 volts.
(5.16 + 4.69) = 4.93 volts

Vmin = 4.93 volts.
(7.5 + 3.75) = 5.63 volts

Vmax = 5.63 volts
(5.16 + 4.93) = 5.05 volts

Vmax = 5.05 volts.
(5.63 + 3.75) = 4.69 volts

Vmin = 4.69 volts.
(5.05 + 4.93) = 4.99 volts

17

Chapter 5 Memory
Outline
Memory Write Ability and Storage Permanence

Common Memory Types
Composing Memory
Memory Hierarchy and Cache
Advanced RAM

Introduction
Embedded systems functionality aspects
Processing
processors
transformation of data
Storage
memory
retention of data
Communication
buses
transfer of data

Memory: basic concepts

Stores large number of bits
m x n: m words of n bits each

k = Log2(m) address input signals
or m = 2^k words
e.g., 4,096 x 8 memory:
m words
m n memory
n bits per word
32,768 bits
12 address input signals
8 input/output data signals
memory external view
r/w
Memory access
r/w: selects read or write
enable: read or write only when asserted
multiport: multiple accesses to different locations
simultaneously
2k n read and write

memory
enable
A0
Ak-1
Qn-1
Q0
Traditional ROM/RAM distinctions
ROM
RAM
EEPROM
FLASH
NVRAM
Nonvolatile
In-system
programmable
SRAM/DRAM
Near
zero
Write
ability
e.g., NVRAM
Write ability
EPROM
Tens of
years
Battery
life (10
years)
Ideal memory
OTP ROM
e.g., EEPROM
Advanced RAMs can hold bits without

power
read and write, lose stored bits without

power
Advanced ROMs can be written to
Mask-programmed ROM
Life of
product
Traditional distinctions blurred
read only, bits stored without power
Storage
permanence
Write ability/ storage permanence
Manner and speed a memory can be

written
During
External
External
External
External
In-system, fast
fabrication programmer, programmer, programmer programmer
writes,
1,000s
OR in-system, OR in-system,
only
one time only
unlimited
block-oriented
1,000s
of cycles
cycles
writes, 1,000s
of cycles
of cycles
Storage permanence
ability of memory to hold stored bits

after they are written
Write ability and storage permanence of memories,

showing relative degrees along each axis (not to scale).

Write ability
Ranges of write ability

High end
processor writes to memory simply and quickly
e.g., RAM
Middle range
processor writes to memory, but slower
e.g., FLASH, EEPROM
Lower range
special equipment, programmer, must be used to write to memory
e.g., EPROM, OTP ROM
Low end
bits stored only during fabrication
e.g., Mask-programmed ROM
In-system programmable memory

Can be written to by a processor in the embedded system using the
memory
Memories in high end and middle range of write ability

Storage permanence
Range of storage permanence

High end
essentially never loses bits
e.g., mask-programmed ROM
Middle range
holds bits days, months, or years after memorys power source turned off
e.g., NVRAM
Lower range
holds bits as long as power supplied to memory
e.g., SRAM
Low end
begins to lose bits almost immediately after written
e.g., DRAM
Nonvolatile memory
Holds bits after power is no longer supplied
High end and middle range of storage permanence

ROM: Read-Only Memory
Store software program for general-purpose

processor
program instructions can be one or more ROM
words
Store constant data needed by system
Implement combinational circuit
External view
2k n ROM
enable
A0
Nonvolatile memory
Can be read from but not written to, by a
processor in an embedded system
Traditionally written to, programmed,
before inserting to embedded system
Uses
Ak-1
Qn-1
Q0
Example: 8 x 4 ROM
Horizontal lines = words

Vertical lines = data
Lines connected only at circles
Decoder sets word 2s line to 1 if
address input is 010
Data lines Q3 and Q1 are set to 1
because there is a programmed
connection with word 2s line
Word 2 is not connected with data
lines Q2 and Q0
Output is 1010
Internal view
8 4 ROM
word 0
38
decoder
enable
word 1
word 2
A0
A1
A2
word line
data line
programmable
connection
wired-OR
Q3 Q2 Q1 Q0

H
Implementing combinational function

Any combinational circuit of n functions of same k variables
can be done with 2^k x n ROM
Truth table
Inputs (address)
a
b
c
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Outputs
y
z
0
0
0
1
0
1
1
0
1
0
1
1
1
1
1
1

82 ROM
0
0
0
1
1
1
1
1
enable
c
b
a
0
1
1
0
0
1
1
1
z
word 0
word 1
word 7
10
Mask-programmed ROM
Connections programmed at fabrication
set of masks
Lowest write ability

only once
Highest storage permanence

bits never change unless damaged
Typically used for final design of high-volume systems

spread out NRE cost for a low unit cost

11
OTP ROM: One-time programmable ROM

Connections programmed after manufacture by user
user provides file of desired contents of ROM

file input to machine called ROM programmer
each programmable connection is a fuse
ROM programmer blows fuses where connections should not exist
Very low write ability

typically written only once and requires ROM programmer device
Very high storage permanence

bits dont change unless reconnected to programmer and more fuses
blown
Commonly used in final products

cheaper, harder to inadvertently modify
12
EPROM: Erasable programmable ROM
Programmable component is a MOS transistor
Transistor has floating gate surrounded by an insulator

(a) Negative charges form a channel between source and drain
storing a logic 1
(b) Large positive voltage at gate causes negative charges to
move out of channel and get trapped in floating gate storing a
logic 0
(c) (Erase) Shining UV rays on surface of floating-gate causes
negative charges to return to channel from floating gate restoring
the logic 1
(d) An EPROM package showing quartz window through which
UV light can pass
0V
floating gate
drain
source
(a)
+15V
(b)
source
drain
Better write ability
5-30 min
can be erased and reprogrammed thousands of times
Reduced storage permanence
source
drain
(c)
program lasts about 10 years but is susceptible to

radiation and electric noise
Typically used during design development

(d)
13
EEPROM: Electrically erasable

programmable ROM
Programmed and erased electronically
typically by using higher than normal voltage
can program and erase individual words
Better write ability

can be in-system programmable with built-in circuit to provide higher
than normal voltage
built-in memory controller commonly used to hide details from memory user
writes very slow due to erasing and programming

busy pin indicates to processor EEPROM still writing
can be erased and programmed tens of thousands of times
Similar storage permanence to EPROM (about 10 years)

Far more convenient than EPROMs, but more expensive
14
Flash Memory
Extension of EEPROM
Same floating gate principle
Same write ability and storage permanence
Fast erase
Large blocks of memory erased at once, rather than one word at a time
Blocks typically several thousand bytes large
Writes to single words may be slower

Entire block must be read, word updated, then entire block written back
Used with embedded systems storing large data items in

nonvolatile memory
e.g., digital cameras, TV set-top boxes, cell phones
15
RAM: Random-access memory

Typically volatile memory
bits are not held without power supply
Read and written to easily by embedded system
during execution
Internal structure more complex than ROM
external view
r/w
2k
enable
A0
Ak-1
Qn-1
a word consists of several memory cells, each

storing 1 bit
44 RAM
enable
rd/wr connected to every cell

when row is enabled by decoder, each cell has logic
that stores input data bit when rd/wr indicates write
or outputs stored bit when rd/wr indicates read
Q0
internal view
I3 I2 I1 I0
each input and output data line connects to each

cell in its column
n read and write

memory
24
decoder
A0
A1
Memory
cell
rd/wr
To every cell
Q3 Q 2 Q 1 Q 0

16
Basic types of RAM

SRAM: Static RAM
Memory cell uses flip-flop to store bit
Requires 6 transistors
Holds data as long as power supplied
memory cell internals
SRAM
Data'
Data
DRAM: Dynamic RAM

Memory cell uses MOS transistor and
capacitor to store bit
More compact than SRAM
Refresh required due to capacitor leak
words cells refreshed when read
DRAM
Data
W
Typical refresh rate 15.625 microsec.

Slower to access than SRAM
17
Ram variations
PSRAM: Pseudo-static RAM
DRAM with built-in memory refresh controller
Popular low-cost high-density alternative to SRAM
NVRAM: Nonvolatile RAM

Holds data after external power removed
Battery-backed RAM
SRAM with own permanently connected battery
writes as fast as reads
no limit on number of writes unlike nonvolatile ROM-based memory
SRAM with EEPROM or flash

stores complete RAM contents on EEPROM or flash before power turned off

18
Example:
HM6264 & 27C256 RAM/ROM devices
Low-cost low-capacity memory
devices
Commonly used in 8-bit
microcontroller-based
embedded systems
First two numeric digits indicate
device type
RAM: 62
ROM: 27
11-13, 15-19
data<70>
2,23,21,24,
25, 3-10
22
addr<15...0>
11-13, 15-19
data<70>
27,26,2,23,21,
addr<15...0>
24,25, 3-10
22
/OE
27
/WE
20
/CS1
26
CS2 HM6264
20
/OE
/CS
27C256
block diagrams
Device
Access Time (ns)
HM6264
85-100
27C256
90
Standby Pwr. (mW)

.01
.5
Active Pwr. (mW)

15
100
Vcc Voltage (V)

5
5
device characteristics
Read operation
Subsequent digits indicate

capacity in kilobits
Write operation
data
data
addr
addr
OE
WE
/CS1
/CS1
CS2
CS2
timing diagrams

19
Example:
TC55V2325FF-100 memory device
2-megabit
synchronous pipelined
burst SRAM memory
device
Designed to be
interfaced with 32-bit
processors
Capable of fast
sequential reads and
writes as well as
single byte I/O
data<310>
addr<150>
Device
Access Time (ns)
TC55V23
10
25FF-100
addr<10...0>
Standby Pwr. (mW)

na
Active Pwr. (mW)

1200
Vcc Voltage (V)

3.3
device characteristics
/CS1
A single read operation
/CS2
CS3
CLK
/WE
/ADSP
/OE
/ADSC
MODE
/ADV
/ADSP
/ADSC
/ADV
CLK
TC55V2325F
F-100
addr <150>
/WE
/OE
/CS1 and /CS2
CS3
data<310>
block diagram
timing diagram

20
Composing memory
Memory size needed often differs from size of readily

available memories
When available memory is larger, simply ignore unneeded
high-order address bits and higher data lines
When available memory is smaller, compose several smaller
memories into one larger memory
Connect side-by-side to increase width of words

Connect top to bottom to increase number of words
added high-order address line selects smaller memory
containing desired word using a decoder
Combine techniques to increase number and width of words
Increase number of words

2m+1 n ROM
2m n ROM
A0
Am-1
Am
12
decoder
2m n ROM
enable
Qn-1
2m 3n ROM
2m n ROM
enable
Increase width
of words
A0
Am
2m n ROM
Increase number
and width of
words
Q3n-1
2m n ROM
Q2n-1
Q0
enable
Q0
outputs

21
Memory hierarchy
Want inexpensive, fast
memory
Main memory
Large, inexpensive, slow
memory stores entire
program and data
Cache
Small, expensive, fast
memory stores copy of likely
accessed parts of larger
memory
Can be multiple levels of
cache
Processor
Registers
Cache
Main memory
Disk
Tape
22
Cache
Usually designed with SRAM
faster but more expensive than DRAM
Usually on same chip as processor

space limited, so much smaller than off-chip main memory
faster access ( 1 cycle vs. several cycles for main memory)
Cache operation:
Request for main memory access (read or write)
First, check cache for copy
cache hit
copy is in cache, quick access
cache miss
copy not in cache, read address and possibly its neighbors into cache
Several cache design choices

cache mapping, replacement policies, and write techniques
23
Cache mapping
Far fewer number of available cache addresses
Are address contents in cache?
Cache mapping used to assign main memory address to cache
address and determine hit or miss
Three basic techniques:
Direct mapping
Fully associative mapping
Set-associative mapping
Caches partitioned into indivisible blocks or lines of adjacent

memory addresses
usually 4 or 8 addresses per line
24
Direct mapping
Main memory address divided into 2 fields
Index
cache address
number of bits determined by cache size
Tag
compared with tag stored in cache at address
indicated by index
if tags match, check valid bit
Tag
Index
Offset
V T D
Valid bit
Data
indicates whether data in slot has been loaded

from memory
Valid
=
Offset
used to find particular word in cache line

25
Fully associative mapping

Complete main memory address stored in each cache address
All addresses stored in cache simultaneously compared with
desired address
Valid bit and offset same as direct mapping
Tag
Offset
Data
V T D
V T D
V T D

Valid
=
26
Set-associative mapping
Compromise between direct mapping and
fully associative mapping
Index same as in direct mapping
But, each cache address contains content
and tags of 2 or more memory address
locations
Tags of that set simultaneously compared as
in fully associative mapping
Cache with set size N called N-way setassociative
Tag
Index
V T D
Offset
V T D
Data
Valid
=
2-way, 4-way, 8-way are common

27
Cache-replacement policy
Technique for choosing which block to replace
when fully associative cache is full
when set-associative caches line is full
Direct mapped cache has no choice

Random
replace block chosen at random
LRU: least-recently used

replace block not accessed for longest time
FIFO: first-in-first-out
push block onto queue when accessed
choose block to replace by popping queue
28
Cache write techniques

When written, data cache must update main memory
Write-through
write to main memory whenever cache is written to

easiest to implement
processor must wait for slower main memory write
potential for unnecessary writes
Write-back
main memory only written when dirty block replaced
extra dirty bit for each block set when cache block written to
reduces number of slow main memory writes

29
Cache impact on system performance

Most important parameters in terms of performance:
Total size of cache
total number of data bytes cache can hold
tag, valid and other house keeping bits not included in total
Degree of associativity
Data block size
Larger caches achieve lower miss rates but higher access cost
e.g.,
2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20 cycles
avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7 cycles
4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not change
(improvement)
8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will not change
(worse)
30
Cache performance trade-offs

Improving cache hit rate without increasing size
Increase line size
Change set-associativity
0.16
0.14
0.12
% cache miss
0.1
1 way
2 way
0.08
4 way
0.06
8 way
0.04
0.02
0
1 Kb
2 Kb
4 Kb
8 Kb
16 Kb 32 Kb
64 Kb 128 Kb
cache size

31
Advanced RAM
DRAMs commonly used as main memory in processor based
embedded systems
high capacity, low cost
Many variations of DRAMs proposed
need to keep pace with processor speeds

FPM DRAM: fast page mode DRAM
EDO DRAM: extended data out DRAM
SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM
RDRAM: rambus DRAM

32
Basic DRAM
address
cas
ras
Col Decoder
cas, ras, clock
Sense
Amplifiers
Row Decoder
Col Addr. Buffer
rd/wr
Row Addr. Buffer
Refresh
Circuit
Data In Buffer
strobes consecutive memory

address periodically causing
memory content to be refreshed
Refresh circuitry disabled
during read or write operation
data
Data Out Buffer
Address bus multiplexed

between row and column
components
Row and column addresses are
latched in, sequentially, by
strobing ras and cas signals,
respectively
Refresh circuitry can be external
or internal to DRAM device
Bit storage array

Hardware/Software
ar
33
Fast Page Mode DRAM (FPM DRAM)
Each row of memory bit array is viewed as a page

Page contains multiple words
Individual words addressed by column address
Timing diagram:
row (page) address sent
3 words read consecutively by sending column address for each
Extra cycle eliminated on each read/write of words from same page

ras
cas
address
row
col
data

col
col
data
data
data
34
Extended data out DRAM (EDO DRAM)

Improvement of FPM DRAM
Extra latch before output buffer
allows strobing of cas before data read operation completed
Reduces read/write latency by additional cycle
ras
cas
address
row
col
data
col
col
data
data
data
Speedup through overlap

35
(S)ynchronous and
Enhanced Synchronous (ES) DRAM
SDRAM latches data on active edge of clock
Eliminates time to detect ras/cas and rd/wr signals
A counter is initialized to column address then incremented on
active edge of clock to access consecutive memory locations
ESDRAM improves SDRAM
added buffers enable overlapping of column addressing
faster clocking and lower read/write latency possible
clock
ras
cas
address
row
data

col
data
data
data
36
Rambus DRAM (RDRAM)

More of a bus interface architecture than DRAM
architecture
Data is latched on both rising and falling edge of
clock
Broken into 4 banks each with own row decoder
can have 4 pages open at a time
Capable of very high throughput

37
DRAM integration problem

SRAM easily integrated on same chip as processor
DRAM more difficult
Different chip making process between DRAM and
conventional logic
Goal of conventional logic (IC) designers:
minimize parasitic capacitance to reduce signal propagation delays
and power consumption
Goal of DRAM designers:

create capacitor cells to retain stored information
Integration processes beginning to appear

38
Memory Management Unit (MMU)

Duties of MMU
Handles DRAM refresh, bus interface and arbitration
Takes care of memory sharing among multiple
processors
Translates logic memory addresses from processor to
physical memory addresses of DRAM
Modern CPUs often come with MMU built-in

Single-purpose processors can be used

39

Chapter 11: Design Technology
Outline
Automation: synthesis
Verification: hardware/software co-simulation
Reuse: intellectual property cores
Design process models

Introduction
Design task
Define system functionality
Convert functionality to physical implementation while
Satisfying constrained metrics
Optimizing other design metrics
Designing embedded systems is hard

Complex functionality
Millions of possible environment scenarios
Competing, tightly constrained metrics
Productivity gap
As low as 10 lines of code or 100 transistors produced per day

Improving productivity
Design technologies developed to improve productivity
We focus on technologies advancing hardware/software unified
view
Automation
Specification
Automation
Program replaces manual design

Synthesis
Verification
Reuse
Implementation
Reuse
Predesigned components
Cores
General-purpose and single-purpose processors on single IC
Verification
Ensuring correctness/completeness of each design step
Hardware/software co-simulation

Automation: synthesis
Early design mostly hardware

Software complexity increased with advent
of general-purpose processor
Different techniques for software design
and hardware design
The codesign ladder

Caused division of the two fields
Design tools evolve for higher levels of

abstraction
Different rate in each field
(1990s)
Compilers
(1960s,1970s)
Register transfers
RT synthesis
(1980s, 1990s)
Hardware/software design fields rejoining

Both can start from behavioral description in
sequential program model
30 years longer for hardware design to reach
this step in the ladder
Many more design dimensions
Optimization critical


Assemblers, linkers
(1950s, 1960s)
Logic synthesis
(1970s, 1980s)
Microprocessor plus
program bits
Logic gates
Implementation
VLSI, ASIC, or PLD

implementation
Hardware/software parallel evolution
Software design evolution

Assemblers
The codesign ladder
convert assembly programs into machine

instructions
Compilers
translate sequential programs into assembly
Hardware design evolution
(1990s)
Compilers
(1960s,1970s)
Interconnected logic gates

Logic synthesis
Register transfers
RT synthesis
(1980s, 1990s)
converts logic equations or FSMs into gates

Register-transfer (RT) synthesis

converts FSMDs into FSMs, logic equations,
predesigned RT components (registers,
adders, etc.)
converts sequential programs into FSMDs
Assemblers, linkers
(1950s, 1960s)
Logic synthesis
(1970s, 1980s)
Microprocessor plus
program bits
Logic gates
Implementation
VLSI, ASIC, or PLD

implementation

Increasing abstraction level

Higher abstraction level focus of hardware/software design evolution
Description smaller/easier to capture
E.g., Line of sequential program code can translate to 1000 gates
Many more possible implementations available

(a) Like flashlight, the higher above the ground, the more ground illuminated
Sequential program designs may differ in performance/transistor count by orders of magnitude
Logic-level designs may differ by only power of 2
modeling cost increases

opportunities decrease
(b) Design process proceeds to lower abstraction level, narrowing in on single

implementation

idea
idea
back-of-the-envelope
sequential program
register-transfers
logic
implementation
(a)
implementation
(b)
Synthesis
Automatically converting systems behavioral description to a structural
implementation
Complex whole formed by parts
Structural implementation must optimize design metrics
More expensive, complex than compilers

Cost = $100s to $10,000s
User controls 100s of synthesis options
Optimization critical
Otherwise could use software
Optimizations different for each user

Run time = hours, days

Gajskis Y-chart
Each axis represents type of description

Behavioral
Defines outputs as function of inputs
Algorithms but no implementation
Structural
Implements behavior by connecting
components with known behavior
Processors, memories
Gives size/locations of components and

wires on chip/board
Synthesis converts behavior at given level

to structure at same level or lower
Register transfers
Gates, flip-flops
Logic equations/FSM
Transistors
Transfer functions
Cell Layout
Modules
E.g.,
Sequential programs
Registers, FUs, MUXs
Physical
Behavior
Structural
FSM gates, flip-flops (same level)

FSM transistors (lower level)
FSM X registers, FUs (higher level)
FSM X processors, memories (higher level)

Chips
Boards
Physical
Logic synthesis
Logic-level behavior to structural implementation

Logic equations and/or FSM to connected gates
Combinational logic synthesis

Two-level minimization (Sum of products/product of sums)
Best possible performance
Longest path = 2 gates (AND gate + OR gate/OR gate + AND gate)
Minimize size
Minimum cover
Minimum cover that is prime
Heuristics
Multilevel minimization
Trade performance for size
Pareto-optimal solution
Heuristics
FSM synthesis
State minimization
State encoding

10
Two-level minimization
Represent logic function as sum of
products (or product of sums)
AND gate for each product
OR gate for each sum
Gives best possible performance

At most 2 gate delay
Goal: minimize size

Minimum cover
Sum of products
F = abc'd' + a'b'cd + a'bcd + ab'cd
Direct implementation
a
b
c
Minimum # of AND gates (sum of products)

Minimum # of inputs to each AND gate (sum
of products)
4 4-input AND gates and

1 4-input OR gate
40 transistors
11
Minimum cover
Minimum # of AND gates (sum of products)
Literal: variable or its complement
a or a, b or b, etc.
Minterm: product of literals

Each literal appears exactly once
abcd, abcd, abcd, etc.
Implicant: product of literals

Each literal appears no more than once
abcd, acd, etc.
Covers 1 or more minterms

acd covers abcd and abcd
Cover: set of implicants that covers all minterms of function

Minimum cover: cover with minimum # of implicants
12
Minimum cover: K-map approach

Karnaugh map (K-map)
1 represents minterm
Circle represents implicant
K-map: sum of products

cd
ab 00 01 11 10
Minimum cover
Covering all 1s with min # of
circles
Example: direct vs. min cover
K-map: minimum cover

cd
ab 00 01 11 10
00
00
01
01
11
11
10
10
Minimum cover
F=abc'd' + a'cd + ab'cd
Less gates
Minimum cover implementation
4 vs. 5
Less transistors
28 vs. 40
a
b
c
2 4-input AND gate

1 3-input AND gates
1 4 input OR gate
28 transistors

13

Minimum # of inputs to AND gates
Prime implicant
K-map: minimum cover that is prime

cd
ab
Implicant not covered by any other

implicant
Max-sized circle in K-map
00
01
11
10
00
01
11
10

Covering with min # of prime implicants

Min # of max-sized circles
Example: prime cover vs. min cover
Same # of gates
4 vs. 4
Less transistors
26 vs. 28
F=abc'd' + a'cd + b'cd
Implementation
a
b
c
d

1 4-input AND gate

2 3-input AND gates
F 1 4 input OR gate
26 transistors
14
Minimum cover: heuristics

K-maps give optimal solution every time
Functions with > 6 inputs too complicated
Use computer-based tabular method
Finds all prime implicants

Finds min cover that is prime
Also optimal solution every time
Problem: 2n minterms for n inputs
32 inputs = 4 billion minterms
Exponential complexity
Heuristic
Solution technique where optimal solution not guaranteed
Hopefully comes close
15
Heuristics: iterative improvement

Start with initial solution
i.e., original logic equation
Repeatedly make modifications toward better solution

Common modifications
Expand
Replace each nonprime implicant with a prime implicant covering it
Delete all implicants covered by new prime implicant
Reduce
Opposite of expand
Reshape
Expands one implicant while reducing another
Maintains total # of implicants
Irredundant
Selects min # of implicants that cover from existing implicants
Synthesis tools differ in modifications used and the order they are used
16
Multilevel logic minimization

Trade performance for size
Increase delay for lower # of gates
Gray area represents all possible
solutions
Circle with X represents ideal solution
2-level gives best performance
max delay = 2 gates
Solve for smallest size
Multilevel gives pareto-optimal solution

Minimum delay for a given size
Minimum size for a given delay

delay
Generally not possible
2-level minim.
size
17
Example
Minimized 2-level logic function:
F = adef + bdef + cdef + gh
Requires 5 gates with 18 total gate inputs
4 ANDS and 1 OR
After algebraic manipulation:

F = (a + b + c)def + gh
Requires only 4 gates with 11 total gate inputs
2 ANDS and 2 ORs
Less inputs per gate

Assume gate inputs = 2 transistors
Reduced by 14 transistors
36 (18 * 2) down to 22 (11 * 2)
Sacrifices performance for size

Inputs a, b, and c now have 3-gate delay
Iterative improvement heuristic commonly used

2-level minimized
a
d
b
e
c
f
g
h
multilevel minimized
a
b
c
d
e
f
g
h
18
FSM synthesis
FSM to gates
State minimization
Reduce # of states
Identify and merge equivalent states
Outputs, next states same for all possible inputs
Tabular method gives exact solution
Table of all possible state pairs
If n states, n2 table entries
Thus, heuristics used with large # of states
State encoding
Unique bit sequence for each state

If n states, log2(n) bits
n! possible encodings
Thus, heuristics common

19
Technology mapping
Library of gates available for implementation
Simple
only 2-input AND,OR gates
Complex
various-input AND,OR,NAND,NOR,etc. gates
Efficiently implemented meta-gates (i.e., AND-OR-INVERT,MUX)
Final structure consists of specified librarys components only

If technology mapping integrated with logic synthesis
More efficient circuit
More complex problem
Heuristics required

20
Complexity impact on user
As complexity grows, heuristics used

Heuristics differ tremendously among synthesis tools
Computationally expensive
Higher quality results

Variable optimization effort settings
Long run times (hours, days)
Requires huge amounts of memory
Typically needs to run on servers, workstations
Fast heuristics
Lower quality results

Shorter run times (minutes, hours)
Smaller amount of memory required
Could run on PC
Super-linear-time (i.e. n3) heuristics usually used

User can partition large systems to reduce run times/size
1003 > 503 + 503 (1,000,000 > 250,000)

21
Integrating logic design and physical design

Past
Gate delay much greater than wire delay
Thus, performance evaluated as # of levels
of gates only
Today
Wire
Delay
Gate delay shrinking as feature size

shrinking
Wire delay increasing
Transistor
Performance evaluation needs wire length
Transistor placement (needed for wire

length) domain of physical design
Thus, simultaneous logic synthesis and
physical design required for efficient
circuits
Reduced feature size

22
Register-transfer synthesis
Converts FSMD to custom single-purpose processor
Datapath
Register units to store variables
Complex data types
Functional units
Arithmetic operations
Connection units
Buses, MUXs
FSM controller
Controls datapath
Key sub problems:

Allocation
Instantiate storage, functional, connection units
Binding
Mapping FSMD operations to specific units
23
High-level synthesis
Converts single sequential program to single-purpose processor
Does not require the program to schedule states
Key sub problems

Allocation
Binding
Scheduling
Assign sequential programs operations to states
Conversion template given in Ch. 2
Optimizations important
Compiler
Constant propagation, dead-code elimination, loop unrolling
Advanced techniques for allocation, binding, scheduling

24
System synthesis
Convert 1 or more processes into 1 or more processors (system)
For complex embedded systems
Multiple processes may provide better performance/power
May be better described using concurrent sequential programs
Tasks
Transformation
Can merge 2 exclusive processes into 1 process

Can break 1 large process into separate processes
Procedure inlining
Loop unrolling
Allocation
Essentially design of system architecture
Select processors to implement processes
Also select memories and busses
25
System synthesis
Tasks (cont.)
Partitioning
Mapping 1 or more processes to 1 or more processors
Variables among memories
Communications among buses
Scheduling
Multiple processes on a single processor
Memory accesses
Bus communications
Tasks performed in variety of orders

Iteration among tasks common

26
System synthesis
Synthesis driven by constraints
E.g.,
Meet performance requirements at minimum cost
Allocate as much behavior as possible to general-purpose processor
Low-cost/flexible implementation
Minimum # of SPPs used to meet performance
System synthesis for GPP only (software)

Common for decades
Multiprocessing
Parallel processing
Real-time scheduling
Hardware/software codesign
Simultaneous consideration of GPPs/SPPs during synthesis
Made possible by maturation of behavioral synthesis in 1990s
27
Temporal vs. spatial thinking

Design thought process changed by evolution of synthesis
Before synthesis
Designers worked primarily in structural domain
Connecting simpler components to build more complex systems
Connecting logic gates to build controller
Connecting registers, MUXs, ALUs to build datapath
capture and simulate era

Capture using CAD tools
Simulate to verify correctness before fabricating
Spatial thinking
Structural diagrams
Data sheets

28
Temporal vs. spatial thinking

After synthesis
describe-and-synthesize era
Designers work primarily in behavioral domain
describe and synthesize era
Describe FSMDs or sequential programs
Synthesize into structure
Temporal thinking
States or sequential statements have relationship over time
Strong understanding of hardware structure still important

Behavioral description must synthesize to efficient structural
implementation

29
Verification
Ensuring design is correct and complete
Correct
Implements specification accurately
Complete
Describes appropriate output to all relevant input
Formal verification
Hard
For small designs or verifying certain key properties only
Simulation
Most common verification method

30
Formal verification
Analyze design to prove or disprove certain properties
Correctness example
Prove ALU structural implementation equivalent to behavioral
description
Derive Boolean equations for outputs
Create truth table for equations
Compare to truth table from original behavior
Completeness example
Formally prove elevator door can never open while elevator is moving
Derive conditions for door being open
Show conditions conflict with conditions for elevator moving

31
Simulation
Create computer model of design
Provide sample input
Check for acceptable output
Correctness example
ALU
Provide all possible input combinations
Check outputs for correct results
Completeness example
Elevator door closed when moving
Provide all possible input sequences
Check door always closed when elevator moving

32
Increases confidence
Simulating all possible input sequences impossible for most
systems
E.g., 32-bit ALU
232 * 232 = 264 possible input combinations

At 1 million combinations/sec
million years to simulate
Sequential circuits even worse
Can only simulate tiny subset of possible inputs

Typical values
Known boundary conditions
E.g., 32-bit ALU
Both operands all 0s
Both operands all 1s
Increases confidence of correctness/completeness

Does not prove
33
Advantages over physical implementation

Controllability
Control time
Stop/start simulation at any time
Control data values

Inputs or internal values
Observability
Examine system/environment values at any time
Debugging
Can stop simulation at any point and:
Observe internal values
Modify system/environment values before restarting
Can step through small intervals (i.e., 500 nanoseconds)

34
Disadvantages
Simulation setup time
Often has complex external environments
Could spend more time modeling environment than system
Models likely incomplete

Some environment behavior undocumented if complex environment
May not model behavior correctly
Simulation speed much slower than actual execution

Sequentializing parallel design
IC: gates operate in parallel
Simulation: analyze inputs, generate outputs for each gate 1 at time
Several programs added between simulated system and real hardware

1 simulated operation:
= 10 to 100 simulator operations
= 100 to 10,000 operating system operations
= 1,000 to 100,000 hardware operations

35
Simulation speed
Relative speeds of different types of
simulation/emulation
1 hour actual execution of SOC
= 1.2 years instruction-set simulation
= 10,000,000 hours gate-level simulation
1
u10
u100
u10000
u1,000,000
u10,000,000
1 hour
1 day
hardware emulation
throughput model
u1000
u100,000
IC
FPGA
4 days
1.4 months
instruction-set simulation
cycle-accurate simulation
register-transfer-level HDL simulation

gate-level HDL simulation
1.2 years
12 years
>1 lifetime
1
millennium

36
Overcoming long simulation time

Reduce amount of real time simulated
1 msec execution instead of 1 hour
0.001sec * 10,000,000 = 10,000 sec = 3 hours
Reduced confidence
1 msec of cruise controller operation tells us little
Faster simulator
Emulators
Special hardware for simulations
Less precise/accurate simulators

Exchange speed for observability/controllability
37
Reducing precision/accuracy
Dont need gate-level analysis for all simulations
E.g., cruise control
Dont care what happens at every input/output of each logic gate
Simulating RT components ~10x faster

Cycle-based simulation ~100x faster
Accurate at clock boundaries only
No information on signal changes between boundaries
Faster simulator often combined with reduction in real time

If willing to simulate for 10 hours
Use instruction-set simulator
Real execution time simulated
10 hours * 1 / 10,000
= 0.001 hour
= 3.6 seconds
38
Hardware/software co-simulation
Variety of simulation approaches exist
From very detailed
E.g., gate-level model
To very abstract
E.g., instruction-level model
Simulation tools evolved separately for hardware/software

Recall separate design evolution
Software (GPP)
Typically with instruction-set simulator (ISS)
Hardware (SPP)
Typically with models in HDL environment
Integration of GPP/SPP on single IC creating need for merging

simulation tools
39
Integrating GPP/SPP simulations

Simple/nave way
HDL model of microprocessor
Runs system software
Much slower than ISS
Less observable/controllable than ISS
HDL models of SPPs

Integrate all models
Hardware-software co-simulator
ISS for microprocessor

HDL model for SPPs
Create communication between simulators
Simulators run separately except when transferring data
Faster
Though, frequent communication between ISS and HDL model slows it down

40
Minimizing communication
Memory shared between GPP and SPPs
Where should memory go?
In ISS
HDL simulator must stall for memory access
In HDL?
ISS must stall when fetching each instruction
Model memory in both ISS and HDL

Most accesses by each model unrelated to others accesses
No need to communicate these between models
Co-simulator ensures consistency of shared data

Huge speedups (100x or more) reported with this technique

41
Emulators
General physical device system mapped to
Microprocessor emulator
Microprocessor IC with some monitoring, control circuitry
SPP emulator
FPGAs (10s to 100s)
Usually supports debugging tasks
Created to help solve simulation disadvantages

Mapped relatively quickly
Hours, days
Can be placed in real environment

No environment setup time
No incomplete environment
Typically faster than simulation

Hardware implementation

42
Disadvantages
Still not as fast as real implementations
E.g., emulated cruise-control may not respond fast enough to
keep control of car
Mapping still time consuming

E.g., mapping complex SOC to 10 FPGAs
Just partitioning into 10 parts could take weeks
Can be very expensive

Top-of-the-line FPGA-based emulator: $100,000 to $1mill
Leads to resource bottleneck
Can maybe only afford 1 emulator
Groups wait days, weeks for other group to finish using
43
Reuse: intellectual property cores

Commercial off-the-shelf (COTS) components
Predesigned, prepackaged ICs

Implements GPP or SPP
Reduces design/debug time
Have always been available
System-on-a-chip (SOC)
All components of system implemented on single chip
Made possible by increasing IC capacities
Changing the way COTS components sold
As intellectual property (IP) rather than actual IC
Behavioral, structural, or physical descriptions
Processor-level components known as cores
SOC built by integrating multiple descriptions

44
Cores
Soft core
Synthesizable behavioral
description
Typically written in HDL
(VHDL/Verilog)
Gajskis Y-chart
Processors, memories
Firm core
Structural description
Typically provided in HDL
Hard core
Physical description
Provided in variety of physical
layout file formats
Behavior
Structural
Sequential programs
Registers, FUs, MUXs
Register transfers
Gates, flip-flops
Logic equations/FSM
Transistors
Transfer functions
Cell Layout
Modules
Chips
Boards
Physical
45
Advantages/disadvantages of hard core

Ease of use
Developer already designed and tested core
Can use right away
Can expect to work correctly
Predictability
Size, power, performance predicted accurately
Not easily mapped (retargeted) to different process

E.g., core available for vendor Xs 0.25 micrometer CMOS
process
Cant use with vendor Xs 0.18 micrometer process
Cant use with vendor Y
46
Advantages/disadvantages of soft/firm cores

Soft cores
Can be synthesized to nearly any technology
Can optimize for particular use
E.g., delete unused portion of core
Lower power, smaller designs
Requires more design effort

May not work in technology not tested for
Not as optimized as hard core for same processor
Firm cores
Compromise between hard and soft cores
Some retargetability
Limited optimization
Better predictability/ease of use
47
New challenges to processor providers

Cores have dramatically changed business model
Pricing models
Past
Vendors sold product as IC to designers
Designers must buy any additional copies
Could not (economically) copy from original
Today
Vendors can sell as IP
Designers can make as many copies as needed
Vendor can use different pricing models

Royalty-based model
Similar to old IC model
Designer pays for each additional model
Fixed price model
One price for IP and as many copies as needed
Many other models used
48
IP protection
Past
Illegally copying IC very difficult
Reverse engineering required tremendous, deliberate effort
Accidental copying not possible
Today
Cores sold in electronic format
Deliberate/accidental unauthorized copying easier

Safeguards greatly increased
Contracts to ensure no copying/distributing
Encryption techniques
limit actual exposure to IP
Watermarking
determines if particular instance of processor was copied
whether copy authorized
49
New challenges to processor users
Licensing arrangements
Not as easy as purchasing IC
More contracts enforcing pricing model and IP protection
Possibly requiring legal assistance
Extra design effort

Especially for soft cores
Must still be synthesized and tested
Minor differences in synthesis tools can cause problems
Verification requirements more difficult

Extensive testing for synthesized soft cores and soft/firm cores mapped to particular
technology
Ensure correct synthesis
Timing and power vary between implementations
Early verification critical

Cores buried within IC
Cannot simply replace bad core
50
Design process model

Describes order that design steps are processed
Behavior description step
Behavior to structure conversion step
Mapping structure to physical implementation
step
Waterfall design model

Behavioral
Structural
Waterfall model
Physical
Proceed to next step only after current step

completed
Spiral model
Proceed through 3 steps in order but with less
detail
Repeat 3 steps gradually increasing detail
Keep repeating until desired system obtained
Becoming extremely popular (hardware &
software development)
Spiral design model

Structural
Behavioral
Physical
51
Waterfall method
Not very realistic
Bugs often found in later steps that must be fixed in
earlier step
E.g., forgot to handle certain input condition
Prototype often needed to know complete desired

behavior
Waterfall design model
E.g, customer adds features after product demo
Behavioral
System specifications commonly change

E.g., to remain competitive by reducing power, size
Structural
Certain features dropped
Unexpected iterations back through 3 steps

cause missed deadlines
Physical
Lost revenues
May never make it to market
52
Spiral method
First iteration of 3 steps incomplete
Much faster, though
End up with prototype
Use to test basic functions
Get idea of functions to add/remove
Original iteration experience helps in following

iterations of 3 steps
Spiral design model

Structural
Behavioral
Must come up with ways to obtain structure and

physical implementations quickly
E.g., FPGAs for prototype
silicon for final product
May have to use more tools
Physical
Extra effort/cost
Could require more time than waterfall method

If correct implementation first time with waterfall
53
General-purpose processor design models

Previous slides focused on SPPs
Can apply equally to GPPs
Waterfall model
Structure developed by particular company

Acquired by embedded system designer
Designer develops software (behavior)
Designer maps application to architecture
Compilation
Manual design
Spiral-like model
Beginning to be applied by embedded system designers
54
Spiral-like model
Designer develops or acquires architecture

Develops application(s)
Maps application to architecture
Analyzes design metrics
Now makes choice
Modify mapping
Modify application(s) to better suit architecture
Modify architecture to better suit application(s)
Y-chart
Architecture
Application(s)
Mapping
Not as difficult now

Maturation of synthesis/compilers
IPs can be tuned
Analysis
Continue refining to lower abstraction level until

particular implementation chosen
55
Summary
Design technology seeks to reduce gap between IC
capacity growth and designer productivity growth
Synthesis has changed digital design
Increased IC capacity means sw/hw components
coexist on one chip
Design paradigm shift to core-based design
Simulation essential but hard
Spiral design process is popular
56

Chapter 7 Digital Camera Example
Outline
Introduction to a simple digital camera

Designers perspective
Requirements specification
Design
Four implementations

Introduction
Putting it all together
Instruction-set processor (GPP, ASIP)
Single-purpose processor
Custom
Standard
Memory
Interfacing
Knowledge applied to designing a simple digital

camera
GPP/ASIP vs. single-purpose processors
Partitioning of functionality among different processor types
Introduction to a simple digital camera

Captures images
Stores images in digital format
No film
Multiple images stored in camera
Number depends on amount of memory and bits used per image
Downloads images to PC
Only recently possible
Systems-on-a-chip
Multiple processors and memories on one IC
High-capacity flash memory
Very simple description used for example

Many more features with real digital camera
Variable size images, image deletion, digital stretching, zooming in and out, etc.
Designers perspective
Two key tasks
Processing images and storing in memory
When shutter pressed:
Image captured
Converted to digital form by charge-coupled device (CCD)
Compressed and archived in internal memory
Uploading images to PC
Digital camera attached to PC
Special software commands camera to transmit archived
images serially
Charge-coupled device (CCD)

Special sensor that captures a B/W image (8 bits/pixel, 16 bits/pixel, )
Light-sensitive silicon solid-state device composed of many cells
When exposed to light, each
cell becomes electrically
charged. This charge can
then be converted to a 8-bit
value where 0 represents no
exposure while 255
represents very intense
exposure of that cell to light.
The electromechanical shutter

is activated to expose the
cells to light for a brief
moment.
Lens area
Covered columns Electro-
Pixel rows
mechanical
shutter
Some of the columns are

covered with a black strip of
paint. The light-intensity of
these pixels is used for zerobias adjustments of all the
cells.
The electronic circuitry, when

commanded, discharges the
cells, activates the
electromechanical shutter,
and then reads the 8-bit
charge value of each cell.
These values can be clocked
out of the CCD by external
logic through a standard
parallel bus interface.
Electronic
circuitry
Pixel columns

Zero-bias error
Manufacturing errors cause cells to measure slightly above or below actual
light intensity
Error is typically the same across columns, but is different across rows
Some of left most columns blocked by black paint to detect zero-bias error
Reading of other than 0 in blocked cells is zero-bias error
Each row is corrected by subtracting the average error found in blocked cells for
that row
Covered
cells
136
145
144
176
144
122
121
173
170
146
153
183
156
131
155
175
155
168
168
161
161
128
164
176
140
123
117
111
133
147
185
183
144
120
121
186
192
206
254
188
115
117
127
130
153
151
165
184
112
119
118
132
138
131
138
117
248 12
147 12
135 9
133 0
139 7
127 2
129 4
129 5
Before zero-bias adjustment

14
10
9
0
7
0
4
5
Zero-bias
adjustment
-13
-11
-9
0
-7
-1
-4
-5
123
134
135
176
137
121
117
168
157
135
144
183
149
130
151
170
142
157
159
161
154
127
160
171
127
112
108
111
126
146
181
178
131
109
112
186
185
205
250
183
102
106
118
130
146
150
161
179
99
108
109
132
131
130
134
112
235
136
126
133
132
126
125
124
After zero-bias adjustment

7
Compression
Store more images
Transmit image to PC in less time
JPEG (Joint Photographic Experts Group)
Popular standard format for representing digital images in a compressed
form
Provides for a number of different modes of operation
Mode used in this chapter provides high compression ratios using DCT
(discrete cosine transform)
Image data divided into blocks of 8 x 8 pixels
3 steps performed on each block
DCT
Quantization
Huffman encoding
DCT step
Transforms original 8 x 8 block into a cosine-frequency
domain
Upper-left corner values represent more of the essence of the image
Lower-right corner values represent finer details
Can reduce precision of these values and retain reasonable image quality
FDCT (Forward DCT) formula

C(h) = [ if (h == 0) then 1/sqrt(2) else 1.0 ]
Auxiliary function used in main function F(u,v)
F(u,v) = C(u) C(v) x=0..7 y=0..7 Dxy FRV>[X@FRV>\Y@
Gives encoded pixel at row u, column v
Dxy is original pixel value at row x, column y
IDCT (Inverse DCT)

Reverses process to obtain original block (not needed for this design)
Quantization step
Achieve high compression ratio by reducing image
quality (loss compression)
Reduce bit precision of encoded data
Fewer bits needed for encoding
One way is to divide all values by a factor of 2
Simple right shifts can do this
Dequantization would reverse process for decompression

1150
-81
14
2
44
36
-19
-5
39 -43
-3 115
-11
1
-61 -13
13 37
-11
-9
-7 21
-13 -11
-10
-73
-42
-12
-4
-4
-6
-17
26
-6
26
36
10
20
3
-4
-83
-2
-3
-23
-21
-28
3
-1
11
22
17
-18
7
-21
12
7
41
-5
-38
5
-8
14
-21
-4
Divide each cells

value by 8
After being decoded using DCT
144
-10
2
0
6
5
-2
-1
5
0
-1
-8
2
-1
-1
-2
-5
14
0
-2
5
-1
3
-1
-1
-9
-5
-2
-1
-1
-1
-2
3
-1
3
5
1
3
0
-1
-10
0
0
-3
-3
-4
0
0
1
3
2
-2
1
-3
2
1
5
-1
-5
1
-1
2
-3
-1
After quantization

10
Huffman encoding step

Serialize 8 x 8 block of pixels
Values are converted into single list using zigzag pattern
Perform Huffman encoding

More frequently occurring pixels assigned short binary code
Longer binary codes left for less frequently occurring pixels
Each pixel in serial list converted to Huffman encoded values

Much shorter list, thus compression

11
Huffman encoding example
Pixel frequencies on left
Pixel value 1 occurs 15 times

Pixel value 14 occurs 1 time
Build Huffman tree from bottom up
Create one leaf node for each pixel

value and assign frequency as nodes
value
Create an internal node by joining any
two nodes whose sum is a minimal
value
Repeat until complete binary tree
Traverse tree from root to leaf to

obtain binary code for leafs pixel
value
This sum is internal nodes value
Append 0 for left traversal, 1 for right

traversal
Pixel
frequencies
-1 15x
0
8x
-2
6x
1
5x
2
5x
3
5x
5
5x
-3
4x
-5
3x
-10 2x
144 1x
-9
1x
-8
1x
-4
1x
6
1x
14 1x
6
4
3
5
29
-1
1
5
1
7
1
8
1
4
1
0
-2
-10
5
2
3
1
6
-5
1
14
1
1
Huffman encoding is reversible
Huffman
codes
Huffman tree
-3
1
-4
1
-8
1
-9
1
144
-1
0
-2
1
2
3
5
-3
-5
-10
144
-9
-8
-4
6
14
00
100
110
010
1110
1010
0110
11110
10110
01110
111111
111110
101111
101110
011111
011110
No code is a prefix of another code

12
Archive step
Record starting address and image size
Can use linked list
One possible way to archive images

If max number of images archived is N:
Set aside memory for N addresses and N image-size variables

Keep a counter for location of next available address
Initialize addresses and image-size variables to 0
Set global memory address to N x 4
Assuming addresses, image-size variables occupy N x 4 bytes
First image archived starting at address N x 4

Global memory address updated to N x 4 + (compressed image size)
Memory requirement based on N, image size, and average

compression ratio
13
Uploading to PC
When connected to PC and upload command received
Read images from memory
Transmit serially using UART
While transmitting
Reset pointers, image-size variables and global memory pointer
accordingly

14
Requirements Specification
Systems requirements what system should do
Nonfunctional requirements
Constraints on design metrics (e.g., should use 0.001 watt or less)
Functional requirements
Systems behavior (e.g., output X should be input Y times 2)
Initial specification may be very general and come from marketing dept.
E.g., short document detailing market need for a low-end digital camera that:
captures and stores at least 50 low-res images and uploads to PC,

costs around $100 with single medium-size IC costing less that $25,
has long as possible battery life,
has expected sales volume of 200,000 if market entry < 6 months,
100,000 if between 6 and 12 months,
insignificant sales beyond 12 months

15
Nonfunctional requirements
Design metrics of importance based on initial specification
Performance: time required to process image

Size: number of elementary logic gates (2-input NAND gate) in IC
Power: measure of avg. electrical energy consumed while processing
Energy: battery lifetime (power x time)
Constrained metrics
Values must be below (sometimes above) certain threshold
Optimization metrics
Improved as much as possible to improve product
A metric can be both constrained and optimization

16
Nonfunctional requirements (cont.)
Performance
Must process image fast enough to be useful
1 sec reasonable constraint
Slower would be annoying
Faster not necessary for low-end of market
Therefore, constrained metric
Size
Must use IC that fits in reasonably sized camera
Constrained and optimization metric
Constraint could be 200,000 gates, but smaller would be cheaper
Power
Must operate below certain temperature (cooling fan not possible)
Therefore, constrained metric
Energy
Reducing power or time reduces energy
Optimized metric: want battery to last as long as possible

17
Informal functional specification

Flowchart breaks functionality
down into simpler functions
Each functions details could then
be described in English
Zero-bias adjust
CCD
input
DCT
Done earlier in chapter

Quantize
Low quality image has resolution

of 64 x 64 (only for example;
typically 640x480 or more)
yes
no
Archive in
memory
yes
More
88
blocks?
no
Done?
Transmit serially
serial output
e.g., 011010...
Mapping functions to a particular

processor type not done at this
stage
18
Refined functional specification

Refine informal specification into
one that can actually be executed
Can use C/C++ code to describe
each function
Called system-level model,
prototype, or simply model
Also is first implementation
Can provide insight into operations

of system
Executable model of digital camera
101011010
110101010
010101101.
..
CCD.C
CCDPP.C
image file
CNTRL.C
101010101
010101010
101010101
0...
Profiling can find computationally

intensive functions
Can obtain sample output used to

verify correctness of final
implementation

CODEC.C
UART.C
output file
19
CCD module
Simulates real CCD

CcdInitialize is passed name of image file
CcdCapture reads image from file
CcdPopPixel outputs pixels one at a time
void CcdInitialize(const char *imageFileName) {

imageFileHandle = fopen(imageFileName, "r");
rowIndex = -1;
colIndex = -1;
}
#include <stdio.h>
#define SZ_ROW
64
void CcdCapture(void) {
#define SZ_COL
(64 + 2)
int pixel;
static FILE *imageFileHandle;

rewind(imageFileHandle);
static char buffer[SZ_ROW][SZ_COL];
for(rowIndex=0; rowIndenx<SZ_ROW; rowIndex++) {
static unsigned rowIndex, colIndex;
for(colIndex=0; colIndex<SZ_COL; colIndex++) {
char CcdPopPixel(void) {
char pixel;
pixel = buffer[rowIndex][colIndex];
if( ++colIndex == SZ_COL ) {
colIndex = 0;
if( ++rowIndex == SZ_ROW ) {
colIndex = -1;
rowIndex = -1;
}
}
return pixel;
}
if( fscanf(imageFileHandle, "%i", &pixel) == 1 ) {

buffer[rowIndex][colIndex] = (char)pixel;
}
}
}
rowIndex = 0;
colIndex = 0;
}

20
CCDPP (CCD PreProcessing) module
Performs zero-bias adjustment

CcdppCapture uses CcdCapture and CcdPopPixel to obtain
image
Performs zero-bias adjustment after each row read in
#define SZ_ROW
64
#define SZ_COL
64
static char buffer[SZ_ROW][SZ_COL];

static unsigned rowIndex, colIndex;
void CcdppInitialize() {
rowIndex = -1;
void CcdppCapture(void) {
colIndex = -1;
char bias;
CcdCapture();
for(rowIndex=0; rowIndex<SZ_ROW; rowIndex++) {
}
char CcdppPopPixel(void) {
char pixel;
pixel = buffer[rowIndex][colIndex];
buffer[rowIndex][colIndex] = CcdPopPixel();
if( ++colIndex == SZ_COL ) {
}
bias = (CcdPopPixel() + CcdPopPixel()) / 2;
colIndex = 0;
if( ++rowIndex == SZ_ROW ) {

colIndex = -1;
buffer[rowIndex][colIndex] -= bias;
rowIndex = -1;
}
}
}
}
rowIndex = 0;
return pixel;
colIndex = 0;
}

21
UART module
Actually a half UART

Only transmits, does not receive
UartInitialize is passed name of file to output to
UartSend transmits (writes to output file) bytes at a time
#include <stdio.h>
static FILE *outputFileHandle;
void UartInitialize(const char *outputFileName) {
outputFileHandle = fopen(outputFileName, "w");
}
void UartSend(char d) {
fprintf(outputFileHandle, "%i\n", (int)d);
}

22
CODEC module
static short ibuffer[8][8], obuffer[8][8], idx;
Models FDCT encoding

ibuffer holds original 8 x 8 block
obuffer holds encoded 8 x 8 block
CodecPushPixel called 64 times to fill
ibuffer with original block
CodecDoFdct called once to
transform 8 x 8 block
void CodecInitialize(void) { idx = 0; }
void CodecPushPixel(short p) {
if( idx == 64 ) idx = 0;
ibuffer[idx / 8][idx % 8] = p; idx++;
}
void CodecDoFdct(void) {
int x, y;
for(x=0; x<8; x++) {
for(y=0; y<8; y++)
obuffer[x][y] = FDCT(x, y, ibuffer);
Explained in next slide
}
idx = 0;
CodecPopPixel called 64 times to

retrieve encoded block from obuffer
}
short CodecPopPixel(void) {
short p;
if( idx == 64 ) idx = 0;
p = obuffer[idx / 8][idx % 8]; idx++;
return p;
}

23
CODEC (cont.)
Implementing FDCT formula

C(h) = if (h == 0) then 1/sqrt(2) else 1.0
F(u,v) = x C(u) x C(v) x=0..7 y=0..7 Dxy x
FRV[X[FRV\Y
static const short COS_TABLE[8][8] = {
Only 64 possible inputs to COS, so table can

be used to save performance time
Floating-point values multiplied by 32,678 and

rounded to nearest integer
32,678 chosen in order to store each value using
only 2 bytes of memory
Fixed-point representation explained more later
FDCT unrolls inner loop of summation,

implements outer summation as two
consecutive for loops
{ 32768,
32138,
30273,
27245,
{ 32768,
27245,
12539,
-6392, -23170, -32138, -30273, -18204 },
{ 32768,
18204, -12539, -32138, -23170,
{ 32768,
{ 32768,
6392, -30273, -18204,

-6392, -30273,
{ 32768, -18204, -12539,
18204,
23170,
23170,
12539,
{ 32768, -32138,
30273, -27245,
6392,
12539,
30273,
6392, -23170,
6392 },
27245 },
27245, -12539, -32138 },
23170, -27245, -12539,
32138, -23170,
{ 32768, -27245,
18204,
-6392,
32138, -30273,
23170, -18204,
32138 },
30273, -27245 },
12539,
18204 },
-6392 }
};
static int FDCT(int u, int v, short img[8][8]) {
double s[8], r = 0; int x;
for(x=0; x<8; x++) {
s[x] = img[x][0] * COS(0, v) + img[x][1] * COS(1, v) +
static short ONE_OVER_SQRT_TWO = 23170;

img[x][2] * COS(2, v) + img[x][3] * COS(3, v) +
static double COS(int xy, int uv) {
img[x][4] * COS(4, v) + img[x][5] * COS(5, v) +
return COS_TABLE[xy][uv] / 32768.0;
img[x][6] * COS(6, v) + img[x][7] * COS(7, v);
}
}
static double C(int h) {
for(x=0; x<8; x++) r += s[x] * COS(x, u);
return h ? 1.0 : ONE_OVER_SQRT_TWO / 32768.0;
return (short)(r * .25 * C(u) * C(v));
}
}

24
CNTRL (controller) module
Heart of the system

CntrlInitialize for consistency with other modules only
CntrlCaptureImage uses CCDPP module to input
image and place in buffer
CntrlCompressImage breaks the 64 x 64 buffer into 8 x
8 blocks and performs FDCT on each block using the
CODEC module
Also performs quantization on each block
CntrlSendImage transmits encoded image serially
using UART module
void CntrlSendImage(void) {
for(i=0; i<SZ_ROW; i++)
for(j=0; j<SZ_COL; j++) {
temp = buffer[i][j];
UartSend(((char*)&temp)[0]);
UartSend(((char*)&temp)[1]);
}
}
}
/* send upper byte */

/* send lower byte */
void CntrlCompressImage(void) {
for(i=0; i<NUM_ROW_BLOCKS; i++)
for(j=0; j<NUM_COL_BLOCKS; j++) {
for(k=0; k<8; k++)
void CntrlCaptureImage(void) {
for(l=0; l<8; l++)
CcdppCapture();
CodecPushPixel(
for(i=0; i<SZ_ROW; i++)
(char)buffer[i * 8 + k][j * 8 + l]);
for(j=0; j<SZ_COL; j++)
CodecDoFdct();/* part 1 - FDCT */
buffer[i][j] = CcdppPopPixel();
for(k=0; k<8; k++)
}
#define SZ_ROW
64
#define SZ_COL
64
#define NUM_ROW_BLOCKS
(SZ_ROW / 8)
#define NUM_COL_BLOCKS
(SZ_COL / 8)
for(l=0; l<8; l++) {

buffer[i * 8 + k][j * 8 + l] = CodecPopPixel();
/* part 2 - quantization */
buffer[i*8+k][j*8+l] >>= 6;
}
static short buffer[SZ_ROW][SZ_COL], i, j, k, l, temp;

void CntrlInitialize(void) {}

}
}
25
Putting it all together
Main initializes all modules, then uses CNTRL module to capture,

compress, and transmit one image
Note: only for off-line test; no iterative real-time behavior ( no while(1) )
This system-level model can be used for extensive experimentation

Bugs much easier to correct here rather than in later models
int main(int argc, char *argv[]) {
char *uartOutputFileName = argc > 1 ? argv[1] : "uart_out.txt";
char *imageFileName = argc > 2 ? argv[2] : "image.txt";
/* initialize the modules */
UartInitialize(uartOutputFileName);
CcdInitialize(imageFileName);
CcdppInitialize();
CodecInitialize();
CntrlInitialize();
/* simulate functionality */
CntrlCaptureImage();
CntrlCompressImage();
CntrlSendImage();
}

26
Design
Determine systems architecture

Processors
Any combination of single-purpose (custom or standard) or general-purpose processors
Memories, buses
Map functionality to that architecture

Multiple functions on one processor
One function on one or more processors
Implementation
A particular architecture and mapping
Solution space is set of all implementations
Starting point
Low-end general-purpose processor connected to flash memory
All functionality mapped to software running on processor
Usually satisfies power, size, and time-to-market constraints
If timing constraint not satisfied then later implementations could:
use single-purpose processors for time-critical functions
rewrite functional specification

27
Implementation 1: Microcontroller alone
Low-end processor could be Intel 8051 microcontroller (core)

Total IC cost including (application) NRE about $5
Well below 200 mW power
Time-to-market about 3 months
However, one image per second not possible
12 MHz, 12 cycles per instruction
Executes one million instructions per second
CcdppCapture has nested loops resulting in 4096 (64 x 64) iterations

~100 assembly instructions each iteration
409,000 (4096 x 100) instructions per image
Half of time budget for reading image alone
Would be over budget after adding compute-intensive DCT and Huffman

encoding
28
Implementation 2:
Microcontroller and CCDPP
EEPROM
SOC
UART
8051
RAM
CCDPP
CCDPP function implemented on custom single-purpose processor

Improves performance less microcontroller cycles
Increases NRE cost and time-to-market
Easy to implement
Simple datapath
Few states in controller
Simple UART easy to implement as standard single-purpose processor also

EEPROM for program memory and RAM for data memory added as well
29
Microcontroller
Synthesizable version of Intel 8051 available

Written in VHDL
Captured at register transfer level (RTL)
Fetches instruction from ROM

Decodes using Instruction Decoder
ALU executes arithmetic operations
Source and destination registers reside in
RAM
Block diagram of Intel 8051 processor core

4K ROM
Instruction
Decoder
Controller
128
RAM
ALU
Special data movement instructions used to

load and store externally
Special program generates VHDL description
of ROM from output of C compiler/linker
To External Memory Bus

30
UART
UART in idle mode until invoked
UART invoked when 8051 executes store instruction
with UARTs enable register as target address
Memory-mapped communication between 8051 and
all single-purpose processors
Lower 8-bits of memory address for RAM
Upper 8-bits of memory address for memory-mapped
I/O devices
Start state transmits 0 indicating start of byte

transmission then transitions to Data state
Data state sends 8 bits serially then transitions to
Stop state
Stop state transmits 1 indicating transmission done
then transitions back to idle mode
FSMD description of UART

invoked
Idle
:
I=0
I<8
Stop:
Transmi
t HIGH
I=8
Start:
Transmi
t LOW
Data:
Transmit
data(I),
then I++
31
CCDPP
Hardware implementation of zero-bias operations

Interacts with external CCD chip
CCD chip resides external to our SOC mainly because combining

CCD with ordinary logic not feasible
66 bytes: 64 pixels + 2 blacked-out pixels
FSMD description of CCDPP
Internal buffer, B, memory-mapped to 8051

Variables R, C are buffers row, column indices
GetRow state reads in one row from CCD to B
ComputeBias state computes bias for that row and
stores in variable Bias
FixBias state iterates over same row subtracting
Bias from each element
NextRow transitions to GetRow for repeat of
process on next row or to Idle state when all 64
rows completed
Idle:
GetRow:
invoked
B[R][C]=Pxl
C=C+1
R=0
C=0
C = 66
R = 64
R < 64
NextRow:
ComputeBias:
C < 64
R++
C=0
C = 64
C < 66
Bias=(B[R][11] +
B[R][10]) / 2
C=0
FixBias:
B[R][C]=B[R][C]-Bias

32
Connecting SOC components

Memory-mapped
All single-purpose processors and RAM are connected to 8051s memory bus
Read
Processor places address on 16-bit address bus

Asserts read control signal for 1 cycle
Reads data from 8-bit data bus 1 cycle later
Device (RAM or SPP) detects asserted read control signal
Checks address
Places and holds requested data on data bus for 1 cycle
Write
Processor places address and data on address and data bus

Asserts write control signal for 1 clock cycle
Device (RAM or SPP) detects asserted write control signal
Checks address bus
Reads and stores data from data bus

33
Software
System-level model provides majority of code

Module hierarchy, procedure names, and main program unchanged
Code for UART and CCDPP modules must be redesigned

Simply replace with memory assignments
xdata used to load/store variables over external memory bus

_at_ specifies memory address to store these variables
Byte sent to U_TX_REG by processor will invoke UART
U_STAT_REG used by UART to indicate its ready for next byte
UART may be much slower than processor
Similar modification for CCDPP code
All other modules untouched

Original code from system-level model
Rewritten UART module

static unsigned char xdata U_TX_REG _at_ 65535;
static unsigned char xdata U_STAT_REG _at_ 65534;
void UARTInitialize(void) {}
void UARTSend(unsigned char d) {
while( U_STAT_REG == 1 ) {
/* busy wait */
}
U_TX_REG = d;
}
#include <stdio.h>
static FILE *outputFileHandle;
void UartInitialize(const char *outputFileName) {
outputFileHandle = fopen(outputFileName, "w");
}
void UartSend(char d) {
fprintf(outputFileHandle, "%i\n", (int)d);
}

34
Analysis
Entire SOC tested on VHDL simulator

Interprets VHDL descriptions and
functionally simulates execution of system
Recall program code translated to VHDL
description of ROM
Tests for correct functionality

Measures clock cycles to process one
image (performance)
Gate-level description obtained through

synthesis
Synthesis tool like compiler for SPPs
Simulate gate-level models to obtain data
for power analysis
Obtaining design metrics of interest

VHDL
VHDL
VHDL
VHDL
simulator
Power
equation
Synthesis
tool
Gate level
simulator
gates
Execution time
gates
gates
Sum gates
Power
Chip area
Number of times gates switch from 1 to 0

or 0 to 1
Count number of gates for chip area

35
Implementation 2:
Microcontroller and CCDPP
Analysis of implementation 2
Total execution time for processing one image:
9.1 seconds
Power consumption:
0.033 watt
Energy consumption:
0.30 joule (9.1 s x 0.033 watt)
Total chip area:

98,000 gates

36
Implementation 3: Microcontroller and

CCDPP/Fixed-Point DCT
9.1 seconds still doesnt meet performance constraint
of 1 second
DCT operation prime candidate for improvement
Execution of implementation 2 shows microprocessor
spends most cycles here
Could design custom hardware like we did for CCDPP
More complex so more design effort
Instead, will speed up DCT functionality by modifying

behavior

37
DCT floating-point cost

Floating-point cost
DCT uses ~260 floating-point operations per pixel transformation

4096 (64 x 64) pixels per image
1 million floating-point operations per image
No floating-point support with Intel 8051
Compiler must emulate
Generates procedures for each floating-point operation
mult, add
Each procedure uses tens of integer operations
Thus, > 10 million integer operations per image

Procedures increase code size
Fixed-point arithmetic can improve on this

38
Fixed-point arithmetic
Integer used to represent a real number
Constant number of integers bits represents fractional portion of real number
More bits, more accurate the representation
Remaining bits represent portion of real number before decimal point
Translating a real constant to a fixed-point representation

Multiply real value by 2 ^ (# of bits used for fractional part)
Round to nearest integer
E.g., represent 3.14 as 8-bit integer with 4 bits for fraction
2^4 = 16
3.14 x 16 = 50.24
16 (2^4) possible values for fraction, each represents 0.0625 (1/16)
Last 4 bits (0010) = 2
2 x 0.0625 = 0.125
3(0011) + 0.125 = 3.125 PRUHELWVIRUIUDFWLRQZRXOGLQFUHDVHDFFXUDF\

39
Fixed-point arithmetic operations

Addition
Simply add integer representations
E.g., 3.14 + 2.71 = 5.85
3.14 50 = 0011.0010
2.71 43 = 0010.1011
50 + 43 = 93 = 0101.1101
5(0101) + 13(1101) x 0.0625 = 5.8125 5.85
Multiply
Multiply integer representations
Shift result right by # of bits in fractional part
E.g., 3.14 * 2.71 = 8.5094
50 * 43 = 2150 = 1000.01100110
[ = (3.14*16) * (2.71*16) = (3.14*2.71*16) *16 ]
>> 4 = 1000.0110
8(1000) + 6(0110) x 0.0625 = 8.375
Range of real values used is limited by bit widths of possible resulting values
40
Fixed-point implementation of CODEC

COS_TABLE gives 8-bit fixed-point
representation of cosine values
static const char code COS_TABLE[8][8] = {
6 bits used for fractional portion

Result of multiplications shifted right
by 6
static unsigned char C(int h) { return h ? 64 : ONE_OVER_SQRT_TWO;}
static int F(int u, int v, short img[8][8]) {
long s[8], r = 0;
64,
62,
59,
53,
45,
35,
24,
12 },
64,
53,
24,
-12,
-45,
-62,
-59,
-35 },
64,
35,
-24,
-62,
-45,
12,
59,
53 },
64,
12,
-59,
-35,
45,
53,
-24,
-62 },
64,
-12,
-59,
35,
45,
-53,
-24,
62 },
64,
-35,
-24,
62,
-45,
-12,
59,
-53 },
64,
-53,
24,
12,
-45,
62,
-59,
64,
-62,
59,
-53,
45,
-35,
24,
35 },
-12 }
};
static const char ONE_OVER_SQRT_TWO = 5;
static short xdata inBuffer[8][8], outBuffer[8][8], idx;
void CodecInitialize(void) { idx = 0; }
void CodecPushPixel(short p) {
unsigned char x, j;
if( idx == 64 ) idx = 0;
for(x=0; x<8; x++) {

s[x] = 0;
inBuffer[idx / 8][idx % 8] = p << 6; idx++;

}
for(j=0; j<8; j++)

s[x] += (img[x][j] * COS_TABLE[j][v] ) >> 6;
}
for(x=0; x<8; x++) r += (s[x] * COS_TABLE[x][u]) >> 6;
return (short)((((r * (((16*C(u)) >> 6) *C(v)) >> 6)) >> 6) >> 6);
}

unsigned short x, y;
for(x=0; x<8; x++)
for(y=0; y<8; y++)
outBuffer[x][y] = F(x, y, inBuffer);
idx = 0;
}
41
Implementation 3: Microcontroller and

CCDPP/Fixed-Point DCT
Use same analysis techniques as implementation 2
1.5 seconds
Power consumption:
0.033 watt (same as 2)
Energy consumption:
0.050 joule (1.5 s x 0.033 watt)
Battery life 6x longer!!
Total chip area:

90,000 gates
8,000 less gates (less memory needed for code)
42
Implementation 4:
Microcontr. and CCDPP/DCT and CODEC
EEPROM
SOC
CODEC
RAM
8051
UART
CCDP
P
Performance close but not good enough

Must resort to implementing CODEC in hardware
Single-purpose processor to perform DCT on 8 x 8 block

43
CODEC design
4 memory mapped registers
C_DATAI_REG/C_DATAO_REG used to
push/pop 8 x 8 block into and out of
CODEC
C_CMND_REG used to command
CODEC
Writing 1 to this register invokes CODEC
C_STAT_REG indicates CODEC done

and ready for next block
Polled in software
Direct translation of C code to VHDL for

actual hardware implementation
Fixed-point version used
CODEC module in software changed

similar to UART/CCDPP in
implementation 2
Rewritten CODEC software

static unsigned char xdata C_STAT_REG _at_ 65527;
static unsigned char xdata C_CMND_REG _at_ 65528;
static unsigned char xdata C_DATAI_REG _at_ 65529;
static unsigned char xdata C_DATAO_REG _at_ 65530;
void CodecInitialize(void) {}
void CodecPushPixel(short p) { C_DATAO_REG = (char)p; }
short CodecPopPixel(void) {
return ((C_DATAI_REG << 8) | C_DATAI_REG);
}
C_CMND_REG = 1;
while( C_STAT_REG == 1 ) { /* busy wait */ }
}

44
Implementation 4:
Microcontr. and CCDPP/DCT and CODEC
0.099 seconds (well under 1 sec)
Power consumption:
0.040 watt
Increase over 2 and 3 because SOC has another processor
Energy consumption:
0.00040 joule (0.099 s x 0.040 watt)
Battery life 12x longer than previous implementation!!
Total chip area:

128,000 gates
Significant increase over previous implementations
45
Summary of implementations
Performance (second)
Power (watt)
Size (gate)
Energy (joule)
Implementation 2 Implementation 3 Implementation 4

9.1
1.5
0.099
0.033
0.033
0.040
98,000
90,000
128,000
0.30
0.050
0.0040
Implementation 3
Close in performance
Cheaper
Less time to build
Implementation 4
Great performance and energy consumption
More expensive and may miss time-to-market window
If DCT designed ourselves then increased NRE cost and time-to-market
If existing DCT purchased then increased IC cost (IP royalties)
Which is better?
46
Summary
Digital camera example
Specifications in English and executable language
Design metrics: performance, power and area
Several implementations
Microcontroller: too slow

Microcontroller and coprocessor: better, but still too slow
Fixed-point arithmetic: almost fast enough
Additional coprocessor for compression: fast enough, but
expensive and hard to design
Tradeoffs between hw/sw this is the main design concern
47
Introduction to VHDL
Slides adapted from the

Introduction to VLSI course
GM University, VA, USA
.. .
VHDL
VHDL is a language for describing digital
hardware used by industry worldwide
VHDL is an acronym for VHSIC (Very High

Speed Integrated Circuit) Hardware
Description Language
.. .
Genesis of VHDL
State of art circa 1980
Multiple design entry methods and
hardware description languages in use
No or limited portability of designs
between CAD tools from different vendors
Objective: shortening the time from a
design concept to implementation from
18 months to 6 months
.. .
A Brief History of VHDL

June 1981: Woods Hole Workshop
July 1983: contract awarded to develop VHDL
Intermetrics
IBM
Texas Instruments
August 1985: VHDL Version 7.2 released
December 1987:
VHDL became IEEE Standard 1076-1987 and in
1988 an ANSI standard
.. .
Three versions of VHDL
VHDL-87
VHDL-93
VHDL-01
.. .
.. .
VHDL for Specification
VHDL for Simulation
VHDL for Synthesis
Levels of design description
Algorithmic level
Register Transfer Level
Level of description
most suitable for synthesis
Logic (gate) level

Circuit (transistor) level
Physical (layout) level
.. .
Register Transfer Logic (RTL) Design Description
Combinational
Logic
Combinational
Logic
Registers
.. .
Naming and Labeling (1)

VHDL is not case sensitive
Example:
Names or labels
databus
Databus
DataBus
DATABUS
are all equivalent
.. .
Naming and Labeling (2)

General rules of thumb (according to VHDL-87)
1.
2.
3.
4.
5.
All names should start with an alphabet character (a-z

or A-Z)
Use only alphabet characters (a-z or A-Z) digits (0-9)
and underscore (_)
Do not use any punctuation or reserved characters
within a name (!, ?, ., &, +, -, etc.)
Do not use two or more consecutive underscore
characters (__) within a name (e.g., Sel__A is invalid)
All names and labels in a given entity and architecture
must be unique
.. .
10
Free Format
VHDL is a free format language
No formatting conventions, such as spacing or
indentation imposed by VHDL compilers. Space
and carriage return treated the same way.
Example:
if (a=b) then
or
if (a=b)
then
or
if (a =
b) then
are all equivalent

.. .
11
Comments
Comments in VHDL are indicated with
a double dash, i.e., --
Comment indicator can be placed anywhere in the
line
Any text that follows in the same line is treated as
a comment
Carriage return terminates a comment
No method for commenting a block extending over
a couple of lines
Examples:
-- main subcircuit
Data_in <= Data_bus; -- reading data from the input FIFO
.. .
12
Design Entity
design entity
entity declaration
architecture 1
Design Entity - most basic

building block of a design.
One entity can have
many different architectures.
architecture 2
architecture 3
.. .
13
Entity Declaration
Entity Declaration describes the interface of the
component, i.e. input and output ports.
Entity name
Port names
Port type
ENTITY nand_gate IS
PORT(
a
: IN STD_LOGIC;
b
: IN STD_LOGIC;
z
: OUT STD_LOGIC
);
END nand_gate;
Reserved words
Semicolon
No Semicolon
Port modes (data flow directions)

.. .
14
Entity declaration simplified syntax
ENTITY entity_name IS
PORT (
port_name : signal_mode signal_type;
port_name : signal_mode signal_type;
.
port_name : signal_mode signal_type);
END entity_name;
.. .
15
Architecture
Describes an implementation of a design
entity.
Architecture example:
ARCHITECTURE model OF nand_gate IS

BEGIN
z <= a NAND b;
END model;
.. .
16
Architecture simplified syntax
ARCHITECTURE architecture_name OF entity_name IS

[ declarations ]
BEGIN
code
END architecture_name;
.. .
17
Entity Declaration & Architecture

nand_gate.vhd
LIBRARY ieee;
USE ieee.std_logic_1164.all;
ENTITY nand_gate IS
PORT(
a
: IN STD_LOGIC;
b
: IN STD_LOGIC;
z
: OUT STD_LOGIC);
END nand_gate;
BEGIN
z <= a NAND b;
END model;
.. .
18
Mode In
Port signal
Entity
Driver resides
outside the entity
.. .
19
.. .
20
Mode out
Entity
Port signal
Driver resides
inside the entity
Cant read out

within an entity
c <= z
Mode out with signal

Entity
Port signal
Signal X can be
read inside the entity
Driver resides
inside the entity
z <= x
c <= x
.. .
21
Mode inout
Entity
Port signal
Signal can be
Driver may reside

both inside and outside
of the entity
.. .
22
Mode buffer
Entity
Port signal
z
c
Driver resides
inside the entity
Port signal Z can be

c <= z
.. .
23
Port Modes
The Port Mode of the interface describes the direction in which data travels with
respect to the component
In: Data comes in this port and can only be read within the entity. It can
appear only on the right side of a signal or variable assignment.
Out: The value of an output port can only be updated within the entity. It
cannot be read. It can only appear on the left side of a signal
assignment.
Inout: The value of a bi-directional port can be read and updated within
the entity model. It can appear on both sides of a signal assignment.
Buffer: Used for a signal that is an output from an entity. The value of the
signal can be used inside the entity, which means that in an assignment
statement the signal can appear on the left and right sides of the <=
operator
.. .
24
Library declarations
Library declaration
Use all definitions from the package
LIBRARY ieee;
std_logic_1164
ENTITY nand_gate IS
PORT(
a
: IN STD_LOGIC;
b
: IN STD_LOGIC;
z
: OUT STD_LOGIC);
END nand_gate;
BEGIN
z <= a NAND b;
END model;
.. .
25
Library declarations - syntax
LIBRARY library_name;
USE library_name.package_name.package_parts;
.. .
26
Fundamental parts of a library

LIBRARY
PACKAGE 1
PACKAGE 2
TYPES
CONSTANTS
FUNCTIONS
PROCEDURES
COMPONENTS
TYPES
CONSTANTS
FUNCTIONS
PROCEDURES
COMPONENTS
.. .
27
Libraries
ieee
Specifies multi-level logic system,
including STD_LOGIC, and
STD_LOGIC_VECTOR data types
Need to be explicitly
declared
std
Specifies pre-defined data types
(BIT, BOOLEAN, INTEGER, REAL,
SIGNED, UNSIGNED, etc.), arithmetic
operations, basic type conversion
functions, basic text i/o functions, etc.
Visible by default
work
Current designs after compilation
.. .
28
STD_LOGIC
LIBRARY ieee;
ENTITY nand_gate IS
PORT(
a
: IN STD_LOGIC;
b
: IN STD_LOGIC;
z
: OUT STD_LOGIC);
END nand_gate;
BEGIN
z <= a NAND b;
END model;
What is STD_LOGIC you ask?

.. .
29
STD_LOGIC type demystified

Value
Meaning
Forcing (Strong driven) Unknown
Forcing (Strong driven) 0
Forcing (Strong driven) 1
High Impedance
Weak (Weakly driven) Unknown
Weak (Weakly driven) 0.

Models a pull down.
Weak (Weakly driven) 1.

Models a pull up.
Don't Care
.. .
30
More on STD_LOGIC Meanings (1)
X
Contention on the bus
X
0
.. .
31
0
0
.. .
32

VDD
VDD
H
1
0
.. .
33

-
Do not care.
Can be assigned to outputs for the case of invalid
inputs(may produce significant improvement in
resource utilization after synthesis).
Use with caution
1 = - give FALSE
.. .
34
Resolving logic levels
X
0
1
Z
W
L
H
-
X
X
X
X
X
X
X
X
X
0
X
0
0
0
0
X
X
X
1
1
1
1
1
X
X
0
1
Z
W
L
H
X
X
0
1
W
W
W
W
X
X
0
1
L
W
L
W
X
X
0
1
H
W
W
H
X
X
X
X
X
X
X
X
X
.. .
35
Signals
SIGNAL a : STD_LOGIC;
a
1
wire
SIGNAL b : STD_LOGIC_VECTOR(7 DOWNTO 0);
b
8
bus
.. .
36
Standard Logic Vectors

SIGNAL a: STD_LOGIC;
SIGNAL b: STD_LOGIC_VECTOR(3 DOWNTO 0);
SIGNAL c: STD_LOGIC_VECTOR(3 DOWNTO 0);
SIGNAL d: STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL e: STD_LOGIC_VECTOR(15 DOWNTO 0);
SIGNAL f: STD_LOGIC_VECTOR(8 DOWNTO 0);
.
a <= 1;
b <= 0000;
-- Binary base assumed by default
c <= B0000;
-- Binary base explicitly specified
d <= 0110_0111; -- You can use _ to increase readability
e <= XAF67;
-- Hexadecimal base
f <= O723;
-- Octal base
.. .
37
Vectors and Concatenation

SIGNAL a: STD_LOGIC_VECTOR(3 DOWNTO 0);
SIGNAL c, d, e: STD_LOGIC_VECTOR(7 DOWNTO 0);
a <= 0000;
b <= 1111;
c <= a & b;
-- c = 00001111
d <= 0 & 0001111;
-- d <= 00001111
e <= 0 & 0 & 0 & 0 & 1 & 1 &

1 & 1;
-- e <= 00001111
.. .
38
VHDL Design Styles

VHDL Design
Styles
structural
dataflow
Concurrent
statements
Components and
interconnects
behavioral
Sequential statements
Registers
State machines
Test benches
Algorithm spec.
Subset most suitable for synthesis

.. .
39
.. .
40
xor3 Example
Entity xor3
ENTITY xor3
PORT(
A : IN
B : IN
C : IN
Result
);
end xor3;
IS
STD_LOGIC;
STD_LOGIC;
STD_LOGIC;
: OUT STD_LOGIC
.. .
41
Dataflow Architecture (xor3 gate)

ARCHITECTURE dataflow OF xor3 IS
SIGNAL U1_out: STD_LOGIC;
BEGIN
U1_out <=A XOR B;
Result <=U1_out XOR C;
END dataflow;
U1_out
.. .
42
Dataflow Description
Describes how data moves through the system
and the various processing steps.
Data Flow uses series of concurrent statements
to realize logic. Concurrent statements are
evaluated at the same time; thus, order of these
statements doesnt matter.
Data Flow is most useful style when series of
Boolean equations can represent a logic.
43
.. .
Structural Architecture (xor3 gate)

I1
I2
Y
XOR2
ARCHITECTURE structural OF xor3 IS

SIGNAL U1_OUT: STD_LOGIC;
COMPONENT xor2 IS
PORT(
I1 : IN STD_LOGIC;
I2 : IN STD_LOGIC;
Y : OUT STD_LOGIC
);
END COMPONENT;
BEGIN
U1: xor2 PORT MAP (I1 => A,
I2 => B,
Y => U1_OUT);
A
B
C
Result
XOR3
U1_OUT
A
B
RESULT
XOR3
U2: xor2 PORT MAP (I1 => U1_OUT,

I2 => C,
Y => Result);
END structural;
.. .
44
Component and Instantiation (1)

Named association connectivity
(recommended)
COMPONENT xor2 IS
PORT(
I1 : IN STD_LOGIC;
I2 : IN STD_LOGIC;
Y : OUT STD_LOGIC
);
END COMPONENT;
U1: xor2 PORT MAP (I1 => A,
I2 => B,
Y => U1_OUT);
.. .
45
Component and Instantiation (2)

Positional association connectivity
(not recommended)
COMPONENT xor2 IS
PORT(
I1 : IN STD_LOGIC;
I2 : IN STD_LOGIC;
Y : OUT STD_LOGIC
);
END COMPONENT;
U1: xor2 PORT MAP (A, B, U1_OUT);
.. .
46
Structural Description
Structural design is the simplest to understand.
This style is the closest to schematic capture and
utilizes simple building blocks to compose logic
functions.
Components are interconnected in a hierarchical
manner.
Structural descriptions may connect simple gates
or complex, abstract components.
Structural style is useful when expressing a
design that is naturally composed of sub-blocks.
.. .
47
Behavioral Architecture (xor3 gate)

ARCHITECTURE behavioral OF xor3 IS
BEGIN
xor3_behave: PROCESS (A,B,C)
BEGIN
IF ((A XOR B XOR C) = '1') THEN
Result <= '1';
ELSE
Result <= '0';
END IF;
END PROCESS xor3_behave;
END behavioral;
.. .
48
Behavioral Description
It accurately models what happens on the inputs
and outputs of the black box (no matter what is
inside and how it works).
This style uses PROCESS statements in VHDL.
.. .
49
Testbench Block Diagram
Testbench
Processes
Generating
Design Under
Test (DUT)
Stimuli
Observed Outputs
.. .
50
Testbench Defined
Testbench applies stimuli (drives the inputs) to
the Design Under Test (DUT) and (optionally)
verifies expected outputs.
The results can be viewed in a waveform window
or written to a file.
Since Testbench is written in VHDL, it is not
restricted to a single simulation tool (portability).
The same Testbench can be easily adapted to
test different implementations (i.e. different
architectures) of the same design.
.. .
51
Testbench Anatomy
ENTITY tb IS
--TB entity has no ports
END tb;
ARCHITECTURE arch_tb OF tb IS
--Local signals and constants
COMPONENT TestComp --All Design Under Test component declarations
PORT ( );
END COMPONENT;
----------------------------------------------------BEGIN
testSequence: PROCESS
-- Input stimuli
END PROCESS;
DUT:TestComp PORT MAP(
);
END arch_tb;
-- Instantiations of DUTs
.. .
52
Testbench for XOR3 (1)

LIBRARY ieee;
ENTITY xor3_tb IS
END xor3_tb;
ARCHITECTURE xor3_tb_architecture OF xor3_tb IS
-- Component declaration of the tested unit
COMPONENT xor3
PORT(
A : IN STD_LOGIC;
B : IN STD_LOGIC;
C : IN STD_LOGIC;
Result : OUT STD_LOGIC );
END COMPONENT;
-- Stimulus signals - signals mapped to the input and inout ports of tested entity
SIGNAL test_vector: STD_LOGIC_VECTOR(2 DOWNTO 0);
SIGNAL test_result : STD_LOGIC;
.. .
53
Testbench for XOR3 (2)

BEGIN
UUT : xor3
PORT MAP (
A => test_vector(0),
B => test_vector(1),
C => test_vector(2),
Result => test_result);
);
Testing: PROCESS
BEGIN
test_vector <= "000";
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
END PROCESS;
END xor3_tb_architecture;
.. .
54
Constants
Syntax:
CONSTANT name : type := value;
Examples:
CONSTANT init_value : STD_LOGIC_VECTOR(3 downto 0) := "0100";
CONSTANT ANDA_EXT : STD_LOGIC_VECTOR(7 downto 0) := X"B4";
CONSTANT counter_width : INTEGER := 16;
CONSTANT buffer_address : INTEGER := 16#FFFE#;
CONSTANT clk_period : TIME := 20 ns;
CONSTANT strobe_period : TIME := 333.333 ms;
.. .
55
Constants - features
Constants can be declared in a
PACKAGE, ENTITY, ARCHITECTURE
When declared in a PACKAGE, the constant
is truly global, for the package can be used
in several entities.
When declared in an ARCHITECTURE, the
constant is local, i.e., it is visible only within this architecture.
When declared in an ENTITY declaration, the constant
can be used in all architectures associated with this entity.
.. .
56
Physical data types

Types representing physical quantities,
such as time, voltage, capacitance, etc. are
referred in VHDL as physical data types.
TIME is the only predefined physical data
type.
Value of the physical data type is called a
physical literal.
.. .
57
Time values (physical literals) - Examples

7 ns
1 min
min
10.65 us
10.65 fs
Numeric value
Space
Unit of time
(dimension)
.. .
58
TIME values
Numeric value can be an integer or
a floating point number.
Numeric value is optional. If not given, 1 is
implied.
Numeric value and dimension MUST be
separated by a space.
.. .
59
Units of time
Unit
Base Unit
fs
Derived Units
ps
ns
us
ms
sec
min
hr
Definition
femtoseconds (10-15 seconds)
picoseconds (10-12 seconds)
nanoseconds (10-9 seconds)
microseconds (10-6 seconds)
miliseconds (10-3 seconds)
seconds
minutes (60 seconds)
hours (3600 seconds)
.. .
60
Values of the type TIME

Value of a physical literal is defined in terms
of integral multiples of the base unit, e.g.
10.65 us = 10,650,000,000 fs
10.65 fs = 10 fs
Smallest available resolution in VHDL is 1 fs.
Smallest available resolution in simulation can be
set using a simulator command or parameter.
.. .
61
Arithmetic operations on values of the

type TIME
Examples:
7 ns + 10 ns = 17 ns
1.2 ns 12.6 ps = 1187400 fs
5 ns * 4.3 = 21.5 ns
20 ns / 5ns = 4
.. .
62
VHDL Design Styles

VHDL Design
Styles
dataflow
Concurrent
statements
structural
Components and
interconnects
behavioral
Registers
State machines
Test benches
Algorithm spec.
.. .
63
Data-flow VHDL
Major instructions
Concurrent statements
concurrent signal assignment ()

conditional concurrent signal assignment
(when-else)
selected concurrent signal assignment
(with-select-when)
generate scheme for equations
(for-generate)
.. .
64
Data-flow VHDL
Major instructions

(when-else)
(with-select-when)
(for-generate)
.. .
65
Data-flow VHDL: Example (Full adder)

xiyi
ci
ci xi yi
ci + 1
si
0
0
0
1
0
1
1
1
0
1
1
0
1
0
0
1
00
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
01
11
0
1
10
1
s i = x i y i c i
xiyi
ci
00
01
11
0
1
(a) Truth table
10
ci + 1 = xi yi + xici + yi ci
(b) Karnaugh maps

xi
yi
si
ci
ci + 1
(c) Circuit
.. .
66
Data-flow VHDL: Example (1)
LIBRARY ieee ;
USE ieee.std_logic_1164.all ;
ENTITY fulladd IS
PORT ( x
: IN
y
: IN
cin
: IN
s
: OUT
cout : OUT
END fulladd ;
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ) ;
.. .
67
Data-flow VHDL: Example (2)
ARCHITECTURE fulladd_dataflow OF fulladd IS

BEGIN
s <= x XOR y XOR cin ;
cout <= (x AND y) OR (cin AND x) OR (cin AND y) ;
END fulladd_dataflow ;
.. .
68
Logic Operators
Logic operators
and
or
nand
nor
xor
not
Logic operators precedence
xnor
only in VHDL-93
Highest
and
or
not
nand
nor
xor
xnor
Lowest
.. .
69
.. .
70
No Implied Precedence
Wanted: y = ab + cd
Incorrect
y <= a and b or c and d ;
equivalent to
y <= ((a and b) or c) and d ;
equivalent to
y = (ab + c)d
Correct
y <= (a and b) or (c and d) ;
Concatenation
SIGNAL a: STD_LOGIC_VECTOR(3 DOWNTO 0);
SIGNAL c, d, e, f: STD_LOGIC_VECTOR(7 DOWNTO 0);
a <= 0000;
b <= 1111;
c <= a & b;
-- c = 00001111
d <= 0 & 0001111;
-- d <= 00001111
e <= 0 & 0 & 0 & 0 & 1 & 1 &

1 & 1;
-- e <= 00001111
f <= (0,0,0,0,1,1,1,1) ;
-- f <= 00001111
.. .
71
.. .
72
Rotations in VHDL
a<<<1
a(3) a(2)
a(1)
a(0)
a(2) a(1) a(0) a(3)
a_rotL <= a(2 downto 0) & a(3)
Arithmetic Operators in VHDL (1)

To use basic arithmetic operations involving
std_logic_vectors you need to include the
following library packages:
LIBRARY ieee;
USE ieee.std_logic_unsigned.all;
or
USE ieee.std_logic_signed.all;
.. .
73
Arithmetic Operators in VHDL (2)

You can use standard +, - operators
to perform addition and subtraction:
signal A :
signal B :
signal C :
STD_LOGIC_VECTOR(3 downto 0);

C <= A + B;
.. .
74
Data-flow VHDL
Major instructions

(when-else)
(with-select-when)
(for-generate)
75
.. .
Conditional concurrent signal assignment

When - Else
target_signal <= value1 when condition1 else
value2 when condition2 else
. . .
valueN-1 when conditionN-1 else
valueN;
Value N
Value N-1
0
1
0
1
0
1
Value 2
Target Signal
Value 1
Condition N-1
Condition 2
Condition 1
.. .
76
Operators
Relational operators
=
/=
<
<=
>
>=
Logic and relational operators precedence

Highest
Lowest
=
and
/=
or
not
<
<=
nand
nor
>
xor
>=
xnor
.. .
77
Priority of logic and relational operators

compare a = bc
Incorrect
when a = b and c else
equivalent to
when (a = b) and c else
Correct
when a = (b and c) else
.. .
78
Tri-state Buffer example (1)

LIBRARY ieee;
ENTITY tri_state IS
PORT ( enable: IN STD_LOGIC;
input: IN STD_LOGIC_VECTOR(7 downto 0);
output: OUT STD_LOGIC_VECTOR (7 DOWNTO 0)
);
END tri_state;
.. .
79
Tri-state Buffer example (2)

ARCHITECTURE tri_state_dataflow OF tri_state IS
BEGIN
output <= input WHEN (enable = 0) ELSE
(OTHERS => Z);
END tri_state_dataflow;
.. .
80
Data-flow VHDL
Major instructions

(when-else)
(with-select-when)
(for-generate)
.. .
81
Selected concurrent signal assignment

With Select-When
with choice_expression select
target_signal <= expression1 when choices_1,
expression2 when choices_2,
. . .
expressionN when choices_N;
expression1
choices_1
expression2
choices_2
target_signal
expressionN
choices_N
choice expression
.. .
82
Allowed formats of choices_k
WHEN value
WHEN value_1 to value_2
WHEN value_1 | value_2 | .... | value N
.. .
83
Allowed formats of choice_k - example
WITH sel SELECT

y <= a WHEN "000",
b WHEN "011" to "110",
c WHEN "001" | "111",
d WHEN OTHERS;
.. .
84
MLU: Block Diagram

MUX_0
A1
IN0
NEG_A
MUX_1
IN1
MUX_2
Y1
IN2
IN3
OUTPUT
SEL1
SEL0
B1
NEG_Y
MUX_4_1
MUX_3
NEG_B
L1 L0
.. .
85
MLU: Entity Declaration

LIBRARY ieee;
ENTITY mlu IS
PORT(
NEG_A : IN STD_LOGIC;
NEG_B : IN STD_LOGIC;
NEG_Y : IN STD_LOGIC;
A:
IN STD_LOGIC;
B:
IN STD_LOGIC;
L1 :
IN STD_LOGIC;
L0 :
IN STD_LOGIC;
Y:
OUT STD_LOGIC
);
END mlu;
.. .
86
MLU: Architecture Declarative Section

ARCHITECTURE mlu_dataflow OF mlu IS
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
SIGNAL
A1 : STD_LOGIC;
B1 : STD_LOGIC;
Y1 : STD_LOGIC;
MUX_0 : STD_LOGIC;
MUX_1 : STD_LOGIC;
MUX_2 : STD_LOGIC;
MUX_3 : STD_LOGIC;
L: STD_LOGIC_VECTOR(1 DOWNTO 0);
.. .
87
.. .
88
MLU - Architecture Body

BEGIN
A1<= NOT A WHEN (NEG_A='1') ELSE
A;
B1<= NOT B WHEN (NEG_B='1') ELSE
B;
Y <= NOT Y1 WHEN (NEG_Y='1') ELSE
Y1;
MUX_0 <= A1
MUX_1 <= A1
MUX_2 <= A1
MUX_3 <= A1
AND B1;
OR B1;
XOR B1;
XNOR B1;
L <= L1 & L0;

with (L) select
Y1 <= MUX_0
MUX_1
MUX_2
MUX_3
WHEN "00",
WHEN "01",
WHEN "10",
WHEN OTHERS;
END mlu_dataflow;
Data-flow VHDL
Major instructions

(when-else)
(with-select-when)
(for-generate)
.. .
89
.. .
90
For Generate Statement

For - Generate
label: FOR identifier IN range GENERATE
BEGIN
{Concurrent Statements}
END GENERATE;
PARITY: Block Diagram
.. .
91
PARITY: Entity Declaration

LIBRARY ieee;
ENTITY parity IS
PORT(
parity_in : IN STD_LOGIC_VECTOR(7 DOWNTO 0);
parity_out : OUT STD_LOGIC
);
END parity;
.. .
92
PARITY: Block Diagram

xor_out(1)
xor_out(2)
xor_out(3)
xor_out(4)
xor_out(5) xor_out(6)
.. .
93
.. .
94
PARITY: Architecture
ARCHITECTURE parity_dataflow OF parity IS
SIGNAL xor_out: std_logic_vector (6 downto 1);
BEGIN
xor_out(1) <= parity_in(0) XOR parity_in(1);
xor_out(2) <= xor_out(1) XOR parity_in(2);
parity_out <= xor_out(6) XOR parity_in(7);
END parity_dataflow;
PARITY: Block Diagram (2)

xor_out(0)
xor_out(1)
xor_out(2)
xor_out(3)
xor_out(4)
xor_out(5) xor_out(6)
xor_out(7)
.. .
95
.. .
96
PARITY: Architecture
SIGNAL xor_out: STD_LOGIC_VECTOR (7 downto 0);
BEGIN
xor_out(0) <= parity_in(0);
parity_out <= xor_out(7);
PARITY: Architecture (2)

SIGNAL xor_out: STD_LOGIC_VECTOR (7 DOWNTO 0);
BEGIN
xor_out(0) <= parity_in(0);
G2: FOR i IN 1 TO 7 GENERATE
xor_out(i) <= xor_out(i-1) XOR parity_in(i);
end generate G2;
parity_out <= xor_out(7);
.. .
97
Left vs. right side of the assignment

Left side
<=
Right side
<= when-else
with-select <=
Internal signals (defined

in a given architecture)
Ports of the mode
- out
- inout
- buffer
Expressions including:
Internal signals (defined
in a given architecture)
Ports of the mode
- in
- inout
- buffer
.. .
98
Synthesizable arithmetic operations:
Addition, +
Subtraction, Comparisons, >, >=, <, <=
Multiplication, *
Division by a power of 2, /2**6
(equivalent to right shift)
Shifts by a constant, SHL, SHR
.. .
99
The result of synthesis of an arithmetic
operation is a
- combinational circuit
- without pipelining.
The exact internal architecture used
(and thus delay and area of the circuit)
may depend on the timing constraints specified
during synthesis (e.g., the requested maximum
clock frequency).
.. .
100
Operations on Unsigned Numbers

For operations on unsigned numbers
USE ieee.std_logic_unsigned.all
and
signals (inputs/outputs) of the type
STD_LOGIC_VECTOR
OR
USE ieee.std_logic_arith.all
and
UNSIGNED
.. .
101
Operations on Signed Numbers

For operations on signed numbers
USE ieee.std_logic_signed.all
and
STD_LOGIC_VECTOR
OR
USE ieee.std_logic_arith.all
and
SIGNED
.. .
102
Signed and Unsigned Types

Behave exactly like
STD_LOGIC_VECTOR
plus, they determine whether a given vector
should be treated as a signed or unsigned number.
Require
USE ieee.std_logic_arith.all;
.. .
103
Addition of Unsigned Numbers

LIBRARY ieee ;
USE ieee.std_logic_unsigned.all ;
ENTITY adder16 IS
PORT ( Cin
X, Y
S
Cout
END adder16 ;
: IN
: IN
: OUT
: OUT
STD_LOGIC ;
STD_LOGIC_VECTOR(15 DOWNTO 0) ;
STD_LOGIC ) ;
ARCHITECTURE Behavior OF adder16 IS

SIGNAL Sum : STD_LOGIC_VECTOR(16 DOWNTO 0) ;
BEGIN
Sum <= ('0' & X) + Y + Cin ;
S <= Sum(15 DOWNTO 0) ;
Cout <= Sum(16) ;
END Behavior ;
.. .
104
Addition of Unsigned Numbers

LIBRARY ieee ;
USE ieee.std_logic_arith.all ;
ENTITY adder16 IS
PORT ( Cin
X, Y
S
Cout
END adder16 ;
: IN
: IN
: OUT
: OUT
STD_LOGIC ;
UNSIGNED(15 DOWNTO 0) ;
UNSIGNED(15 DOWNTO 0) ;
STD_LOGIC ) ;

SIGNAL Sum : UNSIGNED(16 DOWNTO 0) ;
BEGIN
Sum <= ('0' & X) + Y + Cin ;
Cout <= Sum(16) ;
END Behavior ;
.. .
105
Addition of Signed Numbers (1)

LIBRARY ieee ;
USE ieee.std_logic_signed.all ;
ENTITY adder16 IS
PORT ( Cin
X, Y
S
Cout, Overflow
END adder16 ;
: IN
: IN
: OUT
: OUT
STD_LOGIC ;
STD_LOGIC ) ;

SIGNAL Sum : STD_LOGIC_VECTOR(16 DOWNTO 0) ;
BEGIN
Sum <= ('0' & X) + Y + Cin ;
Cout <= Sum(16) ;
Overflow <= Sum(16) XOR X(15) XOR Y(15) XOR Sum(15) ;
END Behavior ;
.. .
106
.. .
107
Addition of Signed Numbers (2)

LIBRARY ieee ;
ENTITY adder16 IS
PORT ( Cin
X, Y
S
Cout, Overflow
END adder16 ;
: IN
: IN
: OUT
: OUT
STD_LOGIC ;
SIGNED(15 DOWNTO 0) ;
SIGNED(15 DOWNTO 0) ;
STD_LOGIC ) ;

SIGNAL Sum : SIGNED(16 DOWNTO 0) ;
BEGIN
Sum <= ('0' & X) + Y + Cin ;
Cout <= Sum(16) ;
Overflow <= Sum(16) XOR X(15) XOR Y(15) XOR Sum(15) ;
END Behavior ;
.. .
108
Multiplication of signed and unsigned

numbers (1)
LIBRARY ieee;
entity multiply is
port(
a : in STD_LOGIC_VECTOR(15 downto 0);
b : in STD_LOGIC_VECTOR(7 downto 0);
cu : out STD_LOGIC_VECTOR(23 downto 0);
cs : out STD_LOGIC_VECTOR(23 downto 0)
);
end multiply;
architecture dataflow of multiply is
SIGNAL sa: SIGNED(15 downto 0);
SIGNAL sb: SIGNED(7 downto 0);
SIGNAL sres: SIGNED(23 downto 0);
SIGNAL ua: UNSIGNED(15 downto 0);
SIGNAL ub: UNSIGNED(7 downto 0);
SIGNAL ures: UNSIGNED(23 downto 0);
.. .
109
Multiplication of signed and unsigned

numbers (2)
begin
-- signed multiplication
sa <= SIGNED(a);
sb <= SIGNED(b);
sres <= sa * sb;
cs <= STD_LOGIC_VECTOR(sres);
-- unsigned multiplication
ua <= UNSIGNED(a);
ub <= UNSIGNED(b);
ures <= ua * ub;
cu <= STD_LOGIC_VECTOR(ures);
end dataflow;
.. .
110
Integer Types
Operations on signals (variables)
of the integer types:
INTEGER, NATURAL,
and their sybtypes, such as
TYPE day_of_month IS RANGE 0 TO 31;
are synthesizable in the range
-(231-1) .. 231 -1 for INTEGERs and their subtypes
0 .. 231 -1 for NATURALs and their subtypes
.. .
111
Integer Types
Operations on signals (variables)
of the integer types:
INTEGER, NATURAL,
are less flexible and more difficult to control
than operations on signals (variables) of the type
STD_LOGIC_VECTOR
UNSIGNED
SIGNED, and thus
are recommened to be avoided by beginners.
.. .
112
Addition of Signed Integers
ENTITY adder16 IS
PORT ( X, Y
S
END adder16 ;
: IN
: OUT
INTEGER RANGE -32767 TO 32767 ;

INTEGER RANGE -32767 TO 32767 ) ;

BEGIN
S <= X + Y ;
END Behavior ;
.. .
113
VHDL Design Styles

VHDL Design
Styles
dataflow
Concurrent
statements
structural
Components and
interconnects
behavioral
Registers
State machines
Test benches
Algorithm spec.
.. .
114
Structural VHDL
Major instructions
component instantiation (port map)
generate scheme for component instantiations
(for-generate)
component instantiation with generic
(generic map, port map)
.. .
115
Structural VHDL
Major instructions
component instantiation
(port map)
(for-generate)
.. .
116
Circuit built of medium scale components

s(0)
r(0)
r(1)
En
p(0)
w0
p(1)
r(2)
p(2)
r(3)
r(4)
r(5)
w1
p(3)
q(0)
q(1)
y1
w2
w3
y0
z
priority
ena
w
0
w
1
En
Enable
z(0)
z(0)
y
0
y
1
y
2
y
3
z(1)
z(3)
Clk
z(1)
z(2)
z(2)
dec2to4
D Q
regn
z(3)
Clock
s(1)
.. .
117
2-to-1 Multiplexer
w
0
w
1
(a) Graphical symbol
w
0
w
1
(b) Truth table
.. .
118
VHDL code for a 2-to-1 Multiplexer

LIBRARY ieee ;
ENTITY mux2to1 IS
PORT ( w0, w1, s
f
END mux2to1 ;
: IN
: OUT
STD_LOGIC ;
STD_LOGIC ) ;
ARCHITECTURE dataflow OF mux2to1 IS

BEGIN
f <= w0 WHEN s = '0' ELSE w1 ;
END dataflow ;
.. .
119
.. .
120
Priority Encoder
w0
y0
w1
y1
w2
w3
w3 w2 w1 w0
0
0
0
0
1
0
0
0
1
x
0
0
1
x
x
0
1
x
x
x
y1 y0
d
0
0
1
1
0
1
1
1
1
d
0
1
0
1
VHDL code for a Priority Encoder

LIBRARY ieee ;
ENTITY priority IS
PORT ( w : IN
y : OUT
z : OUT
END priority ;
STD_LOGIC ) ;
ARCHITECTURE dataflow OF priority IS

BEGIN
y <= "11" WHEN w(3) = '1' ELSE
"10" WHEN w(2) = '1' ELSE
"01" WHEN w(1) = '1' ELSE
"00" ;
z <= '0' WHEN w = "0000" ELSE '1' ;
END dataflow ;
.. .
121
2-to-4 Decoder
En w w
1 0
y y y y
0 1 2 3
(a) Truth table
w
0
w
1
En
y
0
y
1
y
2
y
3
(b) Graphical symbol
.. .
122
VHDL code for a 2-to-4 Decoder

LIBRARY ieee ;
ENTITY dec2to4 IS
PORT ( w : IN
En : IN
y
: OUT
END dec2to4 ;
STD_LOGIC ;
STD_LOGIC_VECTOR(3 DOWNTO 0) ) ;
ARCHITECTURE dataflow OF dec2to4 IS

SIGNAL Enw : STD_LOGIC_VECTOR(2 DOWNTO 0) ;
BEGIN
Enw <= En & w ;
WITH Enw SELECT
y <= 0001" WHEN "100",
"0010" WHEN "101",
"0100" WHEN "110",
1000" WHEN "111",
"0000" WHEN OTHERS ;
END dataflow ;
123
.. .
N-bit register with enable

LIBRARY ieee ;
ENTITY regn IS
GENERIC ( N : INTEGER := 8 ) ;
PORT ( D
: IN
Enable, Clock : IN
Q
: OUT
END regn ;
STD_LOGIC_VECTOR(N-1 DOWNTO 0) ;
STD_LOGIC ;
STD_LOGIC_VECTOR(N-1 DOWNTO 0) ) ;
ARCHITECTURE Behavior OF regn IS

BEGIN
PROCESS (Clock)
BEGIN
IF (Clock'EVENT AND Clock = '1' ) THEN
IF Enable = '1' THEN
Q <= D ;
END IF ;
END IF;
END PROCESS ;
END Behavior ;
Enable
Q
Clock
regn
.. .
124
Circuit built of medium scale components

s(0)
r(0)
r(1)
1
p(1)
r(2)
p(2)
r(3)
r(4)
r(5)
En
p(0)
w0
w1
p(3)
q(0)
q(1)
y1
w2
w3
y0
ena
priority
w
0
w
1
En
Enable
t(0)
z(0)
y
0
y
1
y
2
y
3
z(1)
D Q
t(2)
z(2)
z(3)
dec2to4
Clk
t(1)
regn
t(3)
Clock
s(1)
.. .
125
Structural description example (1)

LIBRARY ieee ;
ENTITY priority_resolver IS
PORT (r
: IN
s
: IN
clk
: IN
STD_LOGIC;
en
: IN
STD_LOGIC;
t
: OUT STD_LOGIC_VECTOR(3 DOWNTO 0) ) ;
END priority_resolver;
ARCHITECTURE structural OF priority_resolver IS
SIGNAL
SIGNAL
SIGNAL
SIGNAL
p : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
q : STD_LOGIC_VECTOR (1 DOWNTO 0) ;
z : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
ena : STD_LOGIC ;
.. .
126

COMPONENT mux2to1
PORT (w0, w1, s
f
END COMPONENT ;
: IN
: OUT
COMPONENT priority
PORT (w
: IN
y
: OUT
z
: OUT
END COMPONENT ;
STD_LOGIC ) ;
COMPONENT dec2to4
PORT (w
: IN
En
: IN
y
: OUT
END COMPONENT ;
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ) ;
.. .
127

COMPONENT regn
PORT (
D : IN
Enable, Clock : IN
STD_LOGIC ;
Q : OUT
END COMPONENT ;
.. .
128

BEGIN
u1: mux2to1 PORT MAP (w0 => r(0) ,
w1 => r(1),
s => s(0),
f => p(0));
p(1) <= r(2);
p(1) <= r(3);
w1 => r(5),
s => s(1),
f => p(3));
u3: priority PORT MAP (w => p,
y => q,
z => ena);
u4: dec2to4 PORT MAP (w => q,
En => ena,
y => z);
.. .
129

u5: regn
GENERIC MAP (N => 4)

PORT MAP (D => z ,
Enable => En ,
Clock => Clk,
Q => t );
END structural;
.. .
130
Named association connectivity

recommended in majority of cases,
prevents ommisions and mistakes
COMPONENT dec2to4
PORT (w
: IN
En : IN
y
: OUT
END COMPONENT ;
STD_LOGIC ;
STD_LOGIC_VECTOR(0 TO 3) ) ;

En => ena,
y => z);
.. .
131
Positional association connectivity

allowed, especially for the cases of
small number of ports
multiple instantiations of the same component,
in regular structures
COMPONENT dec2to4
PORT (w
: IN
En : IN
y
: OUT
END COMPONENT ;
STD_LOGIC ;
u4: dec2to4 PORT MAP (w, En, y);
.. .
132
Structural description with

positional association connectivity
BEGIN
u1: mux2to1 PORT MAP (r(0), r(1), s(0), p(0));
p(1) <= r(2);
p(1) <= r(3);
u2: mux2to1 PORT MAP (r(4) , r(5), s(1), p(3));
u3: priority PORT MAP (p, q, ena);
u4: dec2to4 PORT MAP (q, ena, z);
u5: regn GENERIC MAP(4) PORT MAP (z, En, Clk, t);
END structural;
.. .
133
Package example (1)

LIBRARY ieee ;
PACKAGE GatesPkg IS
COMPONENT mux2to1
PORT (w0, w1, s : IN
f
: OUT
END COMPONENT ;
STD_LOGIC ;
STD_LOGIC ) ;
COMPONENT priority
PORT (w : IN
y
: OUT STD_LOGIC_VECTOR(1 DOWNTO 0) ;
z
: OUT STD_LOGIC ) ;
END COMPONENT ;
.. .
134
Package example (2)

COMPONENT dec2to4
PORT (w : IN
En
: IN
STD_LOGIC ;
y
: OUT STD_LOGIC_VECTOR(0 TO 3) ) ;
END COMPONENT ;
COMPONENT regn
PORT ( D : IN
Enable, Clock
: IN
STD_LOGIC ;
Q : OUT
END COMPONENT ;
.. .
135
Package example (3)

constant ADDAB : std_logic_vector(3 downto 0) := "0000";
constant ADDAM : std_logic_vector(3 downto 0) := "0001";
constant SUBAB : std_logic_vector(3 downto 0) := "0010";
constant SUBAM : std_logic_vector(3 downto 0) := "0011";
constant NOTA : std_logic_vector(3 downto 0) := "0100";
constant NOTB : std_logic_vector(3 downto 0) := "0101";
constant NOTM : std_logic_vector(3 downto 0) := "0110";
constant ANDAB : std_logic_vector(3 downto 0) := "0111";
END GatesPkg;
.. .
136
Package usage (1)

LIBRARY ieee ;
USE work.GatesPkg.all;
PORT (r
: IN
s
: IN
clk
: IN
STD_LOGIC;
en
: IN
STD_LOGIC;
t
: OUT
SIGNAL
SIGNAL
SIGNAL
SIGNAL
p : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
q : STD_LOGIC_VECTOR (1 DOWNTO 0) ;
z : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
ena : STD_LOGIC ;
.. .
137
.. .
138
Package usage (2)

BEGIN
w1 => r(1),
s => s(0),
f => p(0));
p(1) <= r(2);
p(1) <= r(3);
w1 => r(5),
s => s(1),
f => p(3));
u3: priority PORT MAP (w => p,
y => q,
z => ena);
En => ena,
y => z);
Package usage (3)

u5: regn
GENERIC MAP (N => 4)

PORT MAP (D => z ,
Enable => En ,
Clock => Clk,
Q => t );
END structural;
.. .
139
Configuration declaration
CONFIGURATION SimpleCfg OF priority_resolver IS
FOR structural
FOR ALL: mux2to1
USE ENTITY work.mux2to1(dataflow);
END FOR;
FOR u3: priority
USE ENTITY work.priority(dataflow);
END FOR;
FOR u4: dec2to4
USE ENTITY work.dec2to4(dataflow);
END FOR;
END FOR;
END SimpleCfg;
.. .
140
Configuration specification
LIBRARY ieee ;
USE work.GatesPkg.all;
PORT (r
: IN
s
: IN
z
: OUT

SIGNAL p : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
SIGNAL q : STD_LOGIC_VECTOR (1 DOWNTO 0) ;
SIGNAL ena : STD_LOGIC ;
FOR ALL: mux2to1 USE ENTITY work.mux2to1(dataflow);

FOR u3: priority USE ENTITY work.priority(dataflow);
FOR u4: dec2to4 USE ENTITY work.dec2to4(dataflow);
.. .
141
Structural VHDL
Major instructions
component instantiation (port map)
(for-generate)
.. .
142
Example 1
s0
s1
w0
w3
s2
s3
w4
w7
f
w8
w11
w12
w15
.. .
143
A 4-to-1 Multiplexer
LIBRARY ieee ;
ENTITY mux4to1 IS
PORT (
w0, w1, w2, w3
s
: IN
f
: OUT
END mux4to1 ;
: IN
STD_LOGIC ;
STD_LOGIC ) ;
ARCHITECTURE Dataflow OF mux4to1 IS

BEGIN
WITH s SELECT
f <= w0 WHEN "00",
w1 WHEN "01",
w2 WHEN "10",
w3 WHEN OTHERS ;
END Dataflow ;
.. .
144
Straightforward code for Example 1

LIBRARY ieee ;
ENTITY Example1 IS
PORT ( w
: IN
s
: IN
f
: OUT
END Example1 ;
STD_LOGIC_VECTOR(0 TO 15) ;
STD_LOGIC ) ;
.. .
145
Straightforward code for Example 1

ARCHITECTURE Structure OF Example1 IS
COMPONENT mux4to1
PORT ( w0, w1, w2, w3
s
f
END COMPONENT ;
: IN
: IN
: OUT
STD_LOGIC ;
STD_LOGIC ) ;
SIGNAL m : STD_LOGIC_VECTOR(0 TO 3) ;
BEGIN
Mux1: mux4to1 PORT MAP ( w(0),
Mux5: mux4to1 PORT MAP ( m(0),
END Structure ;
w(1),
w(5),
w(9),
w(13),
m(1),
w(2),
w(6),
w(10),
w(14),
m(2),
w(3),
w(7),
w(11),
w(15),
m(3),
s(1 DOWNTO 0), m(0) ) ;

s(1 DOWNTO 0), m(1) ) ;
s(1 DOWNTO 0), m(2) ) ;
s(1 DOWNTO 0), m(3) ) ;
s(3 DOWNTO 2), f ) ;
.. .
146
Modified code for Example 1

ARCHITECTURE Structure OF Example1 IS
COMPONENT mux4to1
PORT ( w0, w1, w2, w3
s
f
END COMPONENT ;
: IN
: IN
: OUT
STD_LOGIC ;
STD_LOGIC ) ;
BEGIN
Muxes: mux4to1 PORT MAP (
w(4*i), w(4*i+1), w(4*i+2), w(4*i+3), s(1 DOWNTO 0), m(i) ) ;
END GENERATE ;
Mux5: mux4to1 PORT MAP ( m(0), m(1), m(2), m(3), s(3 DOWNTO 2), f ) ;
END Structure ;
.. .
147
.. .
148
Example 2
w0
w1
w0
w1
En
w0
w1
w2
w3
w0
w1
En
En
y0
y1
y2
y3
En
w0
w1
En
w0
w1
En
y0
y1
y2
y3
y0
y1
y2
y3
y0
y1
y2
y3
y4
y5
y6
y7
y0
y1
y2
y3
y8
y9
y10
y11
y0
y1
y2
y3
y12
y13
y14
y15
A 2-to-4 binary decoder

LIBRARY ieee ;
ENTITY dec2to4 IS
PORT ( w
: IN
En
: IN
y
: OUT
END dec2to4 ;
STD_LOGIC ;
ARCHITECTURE Dataflow OF dec2to4 IS

SIGNAL Enw : STD_LOGIC_VECTOR(2 DOWNTO 0) ;
BEGIN
Enw <= En & w ;
WITH Enw SELECT
y <= "1000" WHEN "100",
"0100" WHEN "101",
"0010" WHEN "110",
"0001" WHEN "111",
"0000" WHEN OTHERS ;
END Dataflow ;
.. .
149
VHDL code for Example 2 (1)

LIBRARY ieee ;
ENTITY dec4to16 IS
PORT (w
: IN
En
: IN
y
: OUT
END dec4to16 ;
STD_LOGIC ;
.. .
150
VHDL code for Example 2 (2)

ARCHITECTURE Structure OF dec4to16 IS
COMPONENT dec2to4
PORT ( w
En
y
END COMPONENT ;
: IN
: IN
: OUT
STD_LOGIC ;
BEGIN
Dec_ri: dec2to4 PORT MAP ( w(1 DOWNTO 0), m(i), y(4*i TO 4*i+3) );
G2: IF i=3 GENERATE
Dec_left: dec2to4 PORT MAP ( w(i DOWNTO i-1), En, m ) ;
END GENERATE ;
END GENERATE ;
END Structure ;
.. .
151
Mixed Style Modeling

architecture ARCHITECTURE_NAME of ENTITY_NAME is
Here you can declare signals, constants, functions,

procedures
Component declarations
No variable declarations !!
begin
Concurrent statements:
Concurrent simple signal assignment
Conditional signal assignment
Selected signal assignment
Generate statement
Concurrent Statements
Component instantiation statement

Process statement
inside process you can use only sequential
statements
end ARCHITECTURE_NAME;
.. .
152
VHDL Design Styles

VHDL Design
Styles
dataflow
Concurrent
statements
structural
Components and
interconnects
behavioral
Registers
State machines
Test benches
Algorithm spec.
.. .
153
Anatomy of a Process
OPTIONAL
[label:] process [(sensitivity list)]

[declaration part]
begin
statement part
end process [label];
.. .
154
Statement Part
Contains Sequential Statements to be
Executed Each Time the Process Is
Activated
Analogous to Conventional Programming
Languages
.. .
155
What is a PROCESS?
A process is a sequence of instructions referred to as
sequential statements.
The keyword PROCESS
A process can be given a unique name
using an optional LABEL
This is followed by the keyword
PROCESS
The keyword BEGIN is used to indicate
the start of the process
All statements within the process are
executed SEQUENTIALLY. Hence,
order of statements is important.
Testing: PROCESS
BEGIN
test_vector<=00;
WAIT FOR 10 ns;
test_vector<=01;
WAIT FOR 10 ns;
test_vector<=10;
WAIT FOR 10 ns;
test_vector<=11;
WAIT FOR 10 ns;
END PROCESS;
A process must end with the keywords

END PROCESS.
.. .
156
Execution of statements in a PROCESS
The execution of statements

continues sequentially till the
last statement in the process.
After execution of the last
statement, the control is again
passed to the beginning of the
process.
Order of execution
Testing: PROCESS
BEGIN
test_vector<=00;
WAIT FOR 10 ns;
test_vector<=01;
WAIT FOR 10 ns;
test_vector<=10;
WAIT FOR 10 ns;
test_vector<=11;
WAIT FOR 10 ns;
END PROCESS;
Program control is passed to the
first statement after BEGIN
.. .
157
PROCESS with a WAIT Statement
The last statement in the

PROCESS is a WAIT instead of
WAIT FOR 10 ns.
This will cause the PROCESS
to suspend indefinitely when
the WAIT statement is
executed.
This form of WAIT can be used
in a process included in a
testbench when all possible
combinations of inputs have
been tested or a non-periodical
signal has to be generated.
Testing: PROCESS
BEGIN
test_vector<=00;
WAIT FOR 10 ns;
test_vector<=01;
WAIT FOR 10 ns;
test_vector<=10;
WAIT FOR 10 ns;
test_vector<=11;
WAIT;
END PROCESS;
Order of execution
Program execution stops here
.. .
158
WAIT FOR vs. WAIT

WAIT FOR: waveform will keep repeating
itself forever
0
WAIT : waveform will keep its state after

the last wait instruction.
.. .
159
Sequential Statements (1)

If Statement
if boolean expression then
statements
elsif boolean expression then
statements
else boolean expression then
statements
end if;
else and elsif are optional

.. .
160
If Statement - Example
SELECTOR: process
begin
WAIT UNTIL Clock'EVENT AND Clock = '1' ;
IF Sel = 00 THEN
f <= x1;
ELSIF Sel = 10 THEN
f <= x2;
ELSE
f <= x3;
END IF;
end process;
.. .
161
Loop Statement
Loop Statement
FOR i IN range LOOP
statements
END LOOP;
Repeats a Section of VHDL Code

Example: process every element in an array in
the same way
.. .
162
Loop Statement Example (1)
Testing: PROCESS
BEGIN
test_vector<="000";
FOR i IN 0 TO 7 LOOP
WAIT FOR 10 ns;
test_vector<=test_vector+001";
END LOOP;
END PROCESS;
.. .
163
Loop Statement Example (2)

Testing: PROCESS
BEGIN
test_ab<="00";
test_sel<="00";
FOR j IN 0 TO 3 LOOP
WAIT FOR 10 ns;
test_ab<=test_ab+"01";
END LOOP;
test_sel<=test_sel+"01";
END LOOP;
END PROCESS;
.. .
164
PROCESS with a SENSITIVITY LIST

List of signals to which the
process is sensitive.
Whenever there is an
event on any of the
signals in the sensitivity
list, the process fires.
Every time the process
fires, it will run in its
entirety.
WAIT statements are
NOT ALLOWED in a
processes with
SENSITIVITY LIST.
label: process (sensitivity list)

declaration part
begin
statement part
end process;
.. .
165
Generating selected values of one input

SIGNAL test_vector : STD_LOGIC_VECTOR(2 downto 0);
BEGIN
.......
testing: PROCESS
BEGIN
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
WAIT FOR 10 ns;
END PROCESS;
........
END behavioral;
.. .
166
Generating all values of one input

SIGNAL test_vector : STD_LOGIC_VECTOR(3 downto 0):="0000";
BEGIN
.......
testing: PROCESS
BEGIN
WAIT FOR 10 ns;
test_vector <= test_vector + 1;
end process TESTING;
........
END behavioral;
.. .
167
Generating all possible values of two inputs

SIGNAL test_ab : STD_LOGIC_VECTOR(1 downto 0);
SIGNAL test_sel : STD_LOGIC_VECTOR(1 downto 0);
BEGIN
.......
double_loop: PROCESS
BEGIN
test_ab <="00";
test_sel <="00";
for I in 0 to 3 loop
for J in 0 to 3 loop
wait for 10 ns;
test_ab <= test_ab + 1;
end loop;
test_sel <= test_sel + 1;
end loop;
END PROCESS;
........
END behavioral;
.. .
168
Generating periodical signals, such as clocks

CONSTANT clk1_period : TIME := 20 ns;
CONSTANT clk2_period : TIME := 200 ns;
SIGNAL clk1 : STD_LOGIC;
SIGNAL clk2 : STD_LOGIC := 0;
BEGIN
.......
clk1_generator: PROCESS
clk1 <= 0;
WAIT FOR clk1_period/2;
clk1 <= 1;
WAIT FOR clk1_period/2;
END PROCESS;
clk2 <= not clk2 after clk2_period/2;
.......
END behavioral;
.. .
169
Generating one-time signals, such as resets

CONSTANT reset1_width : TIME := 100 ns;
CONSTANT reset2_width : TIME := 150 ns;
SIGNAL reset1 : STD_LOGIC;
SIGNAL reset2 : STD_LOGIC := 1;
BEGIN
.......
reset1_generator: PROCESS
reset1 <= 1;
WAIT FOR reset_width;
reset1 <= 0;
WAIT;
END PROCESS;
reset2_generator: PROCESS
WAIT FOR reset_width;
reset2 <= 0;
WAIT;
END PROCESS;
.......
END behavioral;
.. .
170
Typical error
SIGNAL test_vector : STD_LOGIC_VECTOR(2 downto 0);
SIGNAL reset : STD_LOGIC;
BEGIN
.......
generator1: PROCESS
reset <= 1;
WAIT FOR 100 ns
reset <= 0;
test_vector <="000";
WAIT;
END PROCESS;
generator2: PROCESS
WAIT FOR 200 ns
WAIT FOR 600 ns
END PROCESS;
.......
END behavioral;
.. .
171
Register Transfer Level (RTL) Design Description
Combinational
Logic
Combinational
Logic
Registers
.. .
172
Component Equivalent of a Process

priority: PROCESS (clk)
BEGIN
IF w(3) = '1' THEN
y <= "11" ;
ELSIF w(2) = '1' THEN
y <= "10" ;
ELSIF w(1) = c THEN
y <= a and b;
ELSE
z <= "00" ;
END IF ;
END PROCESS ;
clk
w
a
b
c
y
priority
All signals which appear on the

left of signal assignment
statement (<=) are outputs e.g.
y, z
All signals which appear on the
right of signal assignment
statement (<=) or in logic
expressions are inputs e.g. w, a,
b, c
All signals which appear in the
sensitivity list are inputs e.g. clk
Note that not all inputs need to
be included in the sensitivity list
.. .
173
Processes in VHDL
Processes Describe Sequential Behavior
Processes in VHDL Are Very Powerful
Statements
Allow to define an arbitrary behavior that may
be difficult to represent by a real circuit
Not every process can be synthesized
Use Processes with Caution in the Code to

Be Synthesized
Use Processes Freely in Testbenches and
algorithm specifications
.. .
174
D latch
Truth table
Graphical symbol
Clock
0
1
1
D
Clock
0
1
Q(t+1)
Q(t)
0
1
Timing diagram
t1
t2
t3
t4
Clock
D
Q
Time
175
.. .
D flip-flop
Truth table
Graphical symbol
D
Clk D
n 0
n 1
0
1
Clock
Q(t+1)
0
1
Q(t)
Q(t)
Timing diagram
t1
t2
t3
t4
Clock
D
Q
Time
.. .
176
D latch
LIBRARY ieee ;
ENTITY latch IS
PORT ( D, Clock : IN
Q
: OUT
END latch ;
STD_LOGIC ;
STD_LOGIC) ;
Clock
ARCHITECTURE Behavior OF latch IS

BEGIN
PROCESS ( D, Clock )
BEGIN
IF Clock = '1' THEN
Q <= D ;
END IF ;
END PROCESS ;
END Behavior;
177
.. .
D flip-flop (1)
LIBRARY ieee ;
ENTITY flipflop IS
STD_LOGIC ;
Q
: OUT STD_LOGIC) ;
END flipflop ;
Clock
ARCHITECTURE Behavior_1 OF flipflop IS

BEGIN
PROCESS ( Clock )
BEGIN
IF Clock'EVENT AND Clock = '1' THEN
Q <= D ;
END IF ;
END PROCESS ;
END Behavior_1 ;
.. .
178
D flip-flop (2)
LIBRARY ieee ;
ENTITY flipflop IS
STD_LOGIC ;
Q
: OUT STD_LOGIC) ;
END flipflop ;
Clock

BEGIN
PROCESS ( Clock )
BEGIN
IF rising_edge(Clock) THEN
Q <= D ;
END IF ;
END PROCESS ;
END Behavior_1 ;
179
.. .
D flip-flop (3)
LIBRARY ieee ;
ENTITY flipflop IS
STD_LOGIC ;
Q
: OUT STD_LOGIC) ;
END flipflop ;
Clock

BEGIN
PROCESS
BEGIN
Q <= D ;
END PROCESS ;
END Behavior_2 ;
.. .
180
D flip-flop (4)
LIBRARY ieee ;
ENTITY flipflop IS
STD_LOGIC ;
Q
: OUT STD_LOGIC) ;
END flipflop ;
Clock

BEGIN
PROCESS
BEGIN
WAIT UNTIL rising_edge(Clock) ;
Q <= D ;
END PROCESS ;
END Behavior_2 ;
181
.. .
D flip-flop with asynchronous reset

LIBRARY ieee ;
ENTITY flipflop IS
PORT ( D, Resetn, Clock
Q
END flipflop ;
: IN
: OUT
STD_LOGIC ;
STD_LOGIC) ;
Clock
Resetn
ARCHITECTURE Behavior OF flipflop IS

BEGIN
PROCESS ( Resetn, Clock )
BEGIN
IF Resetn = '0' THEN
Q <= '0' ;
ELSIF Clock'EVENT AND Clock = '1' THEN
Q <= D ;
END IF ;
END PROCESS ;
END Behavior ;
.. .
182
D flip-flop with synchronous reset

LIBRARY ieee ;
ENTITY flipflop IS
PORT ( D, Resetn, Clock
Q
END flipflop ;
: IN
: OUT
STD_LOGIC ;
STD_LOGIC) ;
Clock
Resetn
ARCHITECTURE Behavior OF flipflop IS

BEGIN
PROCESS
BEGIN
Q <= '0' ;
ELSE
Q <= D ;
END IF ;
END PROCESS ;
END Behavior ;
183
.. .
8-bit register with asynchronous reset

LIBRARY ieee ;
ENTITY reg8 IS
PORT ( D
Resetn, Clock
Q
END reg8 ;
: IN
: IN
STD_LOGIC ;
: OUT STD_LOGIC_VECTOR(7 DOWNTO 0) ) ;
ARCHITECTURE Behavior OF reg8 IS

BEGIN
BEGIN
Q <= "00000000" ;
Q <= D ;
END IF ;
END PROCESS ;
END Behavior ;`
Resetn
D
Clock
reg8
.. .
184
N-bit register with asynchronous reset

LIBRARY ieee ;
ENTITY regn IS
PORT ( D
: IN
Resetn, Clock : IN
STD_LOGIC ;
Q
: OUT
END regn ;
BEGIN
BEGIN
Q <= (OTHERS => '0') ;
Q <= D ;
END IF ;
END PROCESS ;
END Behavior ;
Resetn
D
Clock
regn
185
.. .
N-bit register with enable

LIBRARY ieee ;
ENTITY regn IS
PORT ( D
: IN
Enable, Clock : IN
Q
: OUT
END regn ;
STD_LOGIC ;

BEGIN
PROCESS (Clock)
BEGIN
Q <= D ;
END IF ;
END IF;
END PROCESS ;
END Behavior ;
Enable
Q
Clock
regn
.. .
186
2-bit up-counter with synchronous reset

LIBRARY ieee ;
ENTITY upcount IS
PORT ( Clear, Clock
: IN
Q
: BUFFER
END upcount ;
STD_LOGIC ;
ARCHITECTURE Behavior OF upcount IS

BEGIN
upcount: PROCESS ( Clock )
BEGIN
IF (Clock'EVENT AND Clock = '1') THEN
IF Clear = '1' THEN
Q <= "00" ;
ELSE
Q <= Q + 01 ;
END IF ;
END IF;
END PROCESS;
END Behavior ;
Clear
2
Q
upcount
Clock
.. .
187
4-bit up-counter with asynchronous reset (1)

LIBRARY ieee ;
ENTITY upcount IS
PORT ( Clock, Resetn, Enable : IN
STD_LOGIC ;
Q
: OUT STD_LOGIC_VECTOR (3 DOWNTO 0)) ;
END upcount ;
Enable
Q
Clock
upcount
Resetn
.. .
188
4-bit up-counter with asynchronous reset (2)

ARCHITECTURE Behavior OF upcount IS
SIGNAL Count : STD_LOGIC_VECTOR (3 DOWNTO 0) ;
BEGIN
PROCESS ( Clock, Resetn )
BEGIN
Count <= "0000" ;
ELSIF (Clock'EVENT AND Clock = '1') THEN
Count <= Count + 1 ;
END IF ;
Enable
END IF ;
Q
END PROCESS ;
Q <= Count ;
Clock
END Behavior ;
upcount
Resetn
.. .
189
Shift register
Sin
Q(1)
Q(2)
Q(3)
Q(0)
Clock
Enable
.. .
190
Shift Register With Parallel Load

Load
D(3)
D(1)
D(2)
Sin
D
D(0)
Clock
Enable
Q(3)
Q(2)
Q(1)
Q(0)
.. .
191
4-bit shift register with parallel load (1)

LIBRARY ieee ;
ENTITY shift4 IS
PORT ( D
Enable
Load
Sin
Clock
Q
END shift4 ;
: IN
: IN
: IN
: IN
: IN
: BUFFER
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ;
STD_LOGIC ;
Enable
D
Q
Load
Sin
shift4
Clock
.. .
192
4-bit shift register with parallel load (2)

ARCHITECTURE Behavior_1 OF shift4 IS
BEGIN
PROCESS (Clock)
BEGIN
IF Clock'EVENT AND Clock = '1' THEN
IF Load = '1' THEN
Q <= D ;
ELSIF Enable = 1 THEN
Q(0) <= Q(1) ;
Q(1) <= Q(2);
Q(2) <= Q(3) ;
4
Q(3) <= Sin;
END IF ;
END IF ;
END PROCESS ;
END Behavior_1 ;
Enable
D
Q
Load
Sin
shift4
Clock
.. .
193
N-bit shift register with parallel load (1)

LIBRARY ieee ;
ENTITY shiftn IS
PORT ( D : IN STD_LOGIC_VECTOR(N-1 DOWNTO 0) ;
Enable : IN
STD_LOGIC ;
Load
: IN
STD_LOGIC ;
Sin
: IN
STD_LOGIC ;
Clock
: IN
STD_LOGIC ;
Q
: BUFFER STD_LOGIC_VECTOR(N-1 DOWNTO 0) ) ;
END shiftn ;
N
Enable
D
Q
Load
Sin
shiftn
Clock
.. .
194
N-bit shift register with parallel load (2)

ARCHITECTURE Behavior OF shiftn IS
BEGIN
PROCESS (Clock)
BEGIN
IF Load = '1' THEN
Q <= D ;
ELSIF Enable = 1 THEN
Genbits: FOR i IN 0 TO N-2 LOOP
Q(i) <= Q(i+1) ;
END LOOP ;
Q(N-1) <= Sin ;
N
Enable
END IF;
D
Q
END IF ;
END PROCESS ;
Load
END Behavior ;
Sin
shiftn
Clock
.. .
195
Variable Example (1)

LIBRARY ieee ;
ENTITY Numbits IS
PORT ( X
Count
END Numbits ;
: IN
STD_LOGIC_VECTOR(1 TO 3) ;
: OUT INTEGER RANGE 0 TO 3) ;
.. .
196
Variable Example (2)

ARCHITECTURE Behavior OF Numbits IS
BEGIN
PROCESS(X) count the number of bits in X equal to 1
VARIABLE Tmp: INTEGER;
BEGIN
Tmp := 0;
IF X(i) = 1 THEN
Tmp := Tmp + 1;
END IF;
END LOOP;
Count <= Tmp;
END PROCESS;
END Behavior ;
.. .
197
Variables - features
Can only be declared within processes and
subprograms (functions & procedures)
Initial value can be explicitly specified in the
declaration
When assigned take an assigned value
immediately
Variable assignments represent the desired
behavior, not the structure of the circuit
Should be avoided, or at least used with
caution in a synthesizable code
.. .
198
Delays
Delays are not synthesizable
Statements, such as
wait for 5 ns
a <= b after 10 ns
will not produce the required delay, and
should not be used in the code intended
for synthesis.
.. .
199
Initializations
Declarations of signals (and variables)
with initialized values, such as
SIGNAL a : STD_LOGIC := 0;
cannot be synthesized, and thus should
be avoided.
If present, they will be ignored by the
synthesis tools.
Use set and reset signals instead.
.. .
200
Reports and asserts

Reports and asserts, such as
report "Initialization complete";
assert initial_value <= max_value
report "initial value too large"
severity error;
cannot be synthesized, but they
can be freely used in the code intended for
synthesis.
They will be used during simulation and
ignored during synthesis.
.. .
201
Floating-point operations
Operations on signals (and variables)
of the type
real
are not synthesizable by the
current generation of synthesis tools.
.. .
202
Records Examples (1)

type opcodes is (add, sub, and, or);
type reg_number is range 0 to 8;
type instruction is record
opcode: opcodes;
source_reg1: reg_number;
source_reg2: reg_number;
dest_reg: reg_number;
displacement: integer;
end record instruction
.. .
203
Records Examples (2)

type word is record
instr: instruction;
data: bit_vector(31 downto 0);
end record instruction;
constant add_instr_1_3: instruction:=
(opcode => add,
source_reg1 | dest_reg => 1,
source_reg2 => 3,
displacement => 0);
.. .
204
2-to-4 Decoder
En w w
1 0
y y y y
0 1 2 3
(a) Truth table
w
0
w
1
En
y
0
y
1
y
2
y
3
(b) Graphical symbol
.. .
205
Describing combinational logic using processes

LIBRARY ieee ;
ENTITY dec2to4 IS
PORT ( w
: IN
En : IN
y
: OUT
END dec2to4 ;
STD_LOGIC ;
ARCHITECTURE Behavior OF dec2to4 IS

BEGIN
PROCESS ( w, En )
BEGIN
IF En = '1' THEN
CASE w IS
WHEN "00" =>
WHEN "01" =>
WHEN "10" =>
WHEN OTHERS =>
END CASE ;
ELSE
y <= "0000" ;
END IF ;
END PROCESS ;
END Behavior ;
y <= "1000" ;
y <= "0100" ;
y <= "0010" ;
y <= "0001" ;
.. .
206

LIBRARY ieee ;
ENTITY seg7 IS
PORT ( bcd : IN
leds : OUT
END seg7 ;
ARCHITECTURE Behavior OF seg7 IS
BEGIN
PROCESS ( bcd )
BEGIN
CASE bcd IS
-abcdefg
WHEN "0000" => leds
<= "1111110" ;
WHEN "0001" => leds
<= "0110000" ;
WHEN "0010" => leds
<= "1101101" ;
WHEN "0011" => leds
<= "1111001" ;
WHEN "0100" => leds
<= "0110011" ;
WHEN "0101" => leds
<= "1011011" ;
WHEN "0110" => leds
<= "1011111" ;
WHEN "0111" => leds
<= "1110000" ;
WHEN "1000" => leds
<=
"1111111" ;
WHEN "1001" => leds
<=
"1110011" ;
WHEN OTHERS
=> leds <=
"-------" ;
END CASE ;
END PROCESS ;
END Behavior ;
.. .
207

LIBRARY ieee ;
ENTITY compare1 IS
PORT ( A, B : IN
AeqB : OUT
END compare1 ;
STD_LOGIC ;
STD_LOGIC ) ;
ARCHITECTURE Behavior OF compare1 IS

BEGIN
PROCESS ( A, B )
BEGIN
AeqB <= '0' ;
IF A = B THEN
AeqB <= '1' ;
END IF ;
END PROCESS ;
END Behavior ;
.. .
208
Incorrect code for combinational logic

- Implied latch (1)
LIBRARY ieee ;
ENTITY implied IS
PORT ( A, B : IN
AeqB : OUT
END implied ;
STD_LOGIC ;
STD_LOGIC ) ;
ARCHITECTURE Behavior OF implied IS

BEGIN
PROCESS ( A, B )
BEGIN
IF A = B THEN
AeqB <= '1' ;
END IF ;
END PROCESS ;
END Behavior ;
.. .
209
Incorrect code for combinational logic

- Implied latch (2)
A
B
AeqB
.. .
210

Rules that need to be followed:
1. All inputs to the combinational circuit should be included
in the sensitivity list
2. No other signals should be included
in the sensitivity list
3. None of the statements within the process
should be sensitive to rising or falling edges
4. All possible cases need to be covered in the internal
IF and CASE statements in order to avoid
implied latches
.. .
211
Covering all cases in the IF statement

Using ELSE
IF A = B THEN
AeqB <= '1' ;
ELSE
AeqB <= '0' ;
Using default values
AeqB <= '0' ;
IF A = B THEN
AeqB <= '1' ;
.. .
212
Covering all cases in the CASE statement

Using WHEN OTHERS
CASE y IS
WHEN S1 => Z <= "10";
WHEN S2 => Z <= "01";
WHEN OTHERS => Z <= "00";
END CASE;
CASE y IS
WHEN S1 => Z <= "10";
WHEN S2 => Z <= "01";
WHEN S3 => Z <= "00";
WHEN OTHERS => Z <= --";
END CASE;
Using default values

Z <= "00";
CASE y IS
WHEN S1 => Z <= "10";
WHEN S2 => Z <= "10";
END CASE;
.. .
213
One-dimensional arrays Examples (1)

type word_asc is array(0 to 31) of std_logic;
type word_desc is array(31 downto 0) of std_logic;
..
signal buffer_register: word_desc;
..
buffer_register(6) <= 1;
..
variable tmp : word_asc;
..
tmp(5):= 0;
.. .
214
One-dimensional arrays Examples (2)

type controller_state is (initial, idle, active, error);
type state_counts_imp is array(idle to error) of natural;
type state_counts_exp is array(controller_state range idle
to error) of natural;
type state_counts_full is array(controller_state) of natural;
..
variable counters: state_counts_exp;
..
counters(active) := 0;
..
counters(active) := counters(active) + 1;
.. .
215
Predefined Unconstrained Array Types

Predefined
bit_vector
array of bits
string
array of characters
Defined in the ieee.std_logic_1164 package:

std_logic_vector
array of std_logic_vectors
.. .
216
Predefined Unconstrained Array Types
subtype byte is bit_vector(7 downto 0);

.
variable channel_busy : bit_vector(1 to 4);
.
constant ready_message :string := ready;
.
signal memory_bus: std_logic_vector (31 downto 0);
.. .
217
User-defined Unconstrained Array Types

type sample is array (natural range <>) of
integer;
.
variable long_sample is sample(0 to 255);
.
constant look_up_table_1: sample :=
(127, -45, 63, 23, 76);
.
.. .
218
Array Attributes
Aleft(N)
left bound of index range of dimension N of A
Aright(N)
right bound of index range of dimension N of A
Alow(N)
lower bound of index range of dimension N of A
Ahigh(N)
upper bound of index range of dimension N of A
Arange(N)
index range of dimension N of A
Areverse_range(N) reversed index range of dimension N of A

Alength(N) length of index range of dimension N of A
Aascending(N) true if index range of dimension N of A
is an ascending range, false otherwise
.. .
219
Array Attributes - Examples

type A is array (1 to 4, 31 downto 0);
Aleft(1)
=1
Aright(2)
=0
Alow(1)
=1
Ahigh(2)
= 31
Arange(1)
= 1 to 4
Alength(2)
= 32
Aascending(2)
= false
.. .
220
Subprograms
Include
functions and procedures

Commonly used pieces of code
Can be placed in a library, and then reused and
shared among various projects
Abstract operations that are repeatedly
performed
Type conversions
Use only sequential statements, the same as
processes
221
.. .
Typical locations of subprograms

PACKAGE
PACKAGE BODY
LIBRARY
global
FUNCTION /
PROCEDURE
ENTITY
local for all architectures

of a given entity
ARCHITECTURE
Declarative part
local for a given architecture

.. .
222
Functions basic features

Functions
always return a single value as a result
Are called using formal and actual parameters the same
way as components
never modify parameters passed to them
parameters can only be constants (including generics) and
signals (including ports);
variables are not allowed; the default is a CONSTANT
when passing parameters, no range specification should be
included (for example no RANGE for INTEGERS, or
TO/DOWNTO for STD_LOGIC_VECTOR)
are always used in some expression, and not called on their
own
.. .
223
.. .
224
Function syntax
FUNCTION function_name
(<parameter_list>)
RETURN data_type IS
[declarations]
BEGIN
(sequential statements)
END function_name;
Function parameters - example

FUNCTION f1
(a, b: INTEGER; SIGNAL c: STD_LOGIC_VECTOR)
RETURN BOOLEAN IS
BEGIN
(sequantial statements)
END f1;
.. .
225
Function calls - examples

x <= conv_integer(a);
IF x > maximum(a, b) THEN ....
WHILE minimum(a, b) LOOP
.......
.. .
226
Function Example 1
LIBRARY ieee;
PACKAGE my_package IS
FUNCTION log2_ceil (CONSTANT s: INTEGER) RETURN INTEGER;
END my_package;
PACKAGE body my_package IS
FUNCTION log2_ceil (CONSTANT s: INTEGER) RETURN INTEGER IS
VARIABLE m,n : INTEGER;
BEGIN
m := 0;
n := 1;
WHILE (n < s) LOOP
m := m + 1;
n := n*2;
END LOOP;
RETURN m;
END log2_ceil;
END my_package;
.. .
227
Function call Example 1

LIBRARY ieee;
USE ieee.std_logic_unsigned.all;
USE work.my_package.all;
ENTITY log2_int IS
GENERIC (m: INTEGER :=20);
PORT (x: IN STD_LOGIC_VECTOR(3 DOWNTO 0);
y: OUT STD_LOGIC_VECTOR(7 DOWNTO 0)
);
END log2_int;
ARCHITECTURE log2_int OF log2_int IS
CONSTANT l2m : INTEGER := log2_ceil (m);
SIGNAL
r:
STD_LOGIC_VECTOR(3 DOWNTO 0);
BEGIN
r <= conv_std_logic_vector(l2m,4);
y <= x*r;
END log2_int;
.. .
228
Function Example 2
library IEEE;
use IEEE.std_logic_1164.all;
ENTITY powerOfFour IS
PORT(
X
: IN INTEGER;
Y
: OUT INTEGER;
);
END powerOfFour;
.. .
229
.. .
230
Function Example 2
ARCHITECTURE behavioral OF powerOfFour IS
FUNCTION Pow ( SIGNAL N:INTEGER; Exp : INTEGER)
RETURN INTEGER IS
VARIABLE Result : INTEGER := 1;
BEGIN
FOR i IN 1 TO Exp LOOP
Result := Result * N;
END LOOP;
RETURN( Result );
END Pow;
BEGIN
Y <= Pow(X, 4);
END behavioral;
Package containing a function (1)

LIBRARY IEEE;
USE IEEE.std_logic_1164.all;
PACKAGE specialFunctions IS
FUNCTION Pow( SIGNAL N: INTEGER; Exp : INTEGER)
RETURN INTEGER;
END specialFunctions
.. .
231
Package containing a function (2)

PACKAGE BODY specialFunctions IS
FUNCTION Pow(SIGNAL N: INTEGER; Exp : INTEGER)
RETURN INTEGER IS
VARIABLE Result : INTEGER := 1;
BEGIN
FOR i IN 1 TO Exp LOOP
Result := Result * N;
END LOOP;
RETURN( Result );
END Pow;
END specialFunctions
.. .
232
Type conversion function (1)
LIBRARY ieee;
------------------------------------------------------------------------------------------------PACKAGE my_package IS
FUNCTION conv_integer (SIGNAL vector: STD_LOGIC_VECTOR)
RETURN INTEGER;
END my_package;
-------------------------------------------------------------------------------------------------
.. .
233

PACKAGE BODY my_package IS
FUNCTION conv_integer (SIGNAL vector: STD_LOGIC_VECTOR)
RETURN INTEGER;
VARIABLE result: INTEGER RANGE 0 TO 2**vectorLENGTH - 1;
VARIABLE carry: STD_LOGIC;
BEGIN
IF(vector(vectorHIGH)=1 THEN result:=1;
ELSE result := 0;
FOR i IN (vectorHIGH-1) DOWNTO (vectorLOW) LOOP
result := result*2;
IF (vector(i) = 1) THEN result := result+1;
END IF;
RETURN result;
END conv_integer;
END my_package;
.. .
234

LIBRARY ieee;
USE work.my_package.all;
------------------------------------------------------------------------------------------------ENTITY conv_int2 IS
PORT ( a: IN STD_LOGIC_VECTOR (3 DOWNTO 0);
y: OUT INTEGER RANGE 0 TO 15);
END conv_int2;
------------------------------------------------------------------------------------------------ARCHITECTURE my_arch OF conv_int2 IS
BEGIN
y <= conv_integer(a);
END my_arch;
.. .
235
Procedures basic features

Procedures
do not return a value
are called using formal and actual parameters the same way
as components
may modify parameters passed to them
each parameter must have a mode: IN, OUT, INOUT
parameters can be constants (including generics), signals
(including ports), and variables;
the default for inputs (mode in) is a constant, the default for
outputs (modes out and inout) is a variable
when passing parameters, range specification should be
included (for example RANGE for INTEGERS, and
TO/DOWNTO for STD_LOGIC_VECTOR)
Procedure calls are statements on their own
.. .
236
Procedure syntax
PROCEDURE procedure_name
(<parameter_list>) IS
[declarations]
BEGIN
(sequential statements)
END function_name;
.. .
237
Procedure parameters - example

FUNCTION f1
(a, b: INTEGER; SIGNAL c: STD_LOGIC_VECTOR)
RETURN BOOLEAN IS
BEGIN
(sequantial statements)
END f1;
.. .
238
Procedure calls - examples

compute_min_max(in1, in2, in3, out1, out2);
divide(dividend, divisor, quotient, remainder);
IF (a > b) THEN
compute_min_max(in1, in2, in3, out1, out2);
.......
.. .
239
Procedure example (1)

LIBRARY ieee;
USE work.decProcs.all;
ENTITY decoder IS port (
decIn: IN STD_LOGIC_VECTOR(1 DOWNTO 0);
decOut: OUT STD_LOGIC_VECTOR(3 DOWNTO 0)
);
END decoder;
.. .
240
Procedure example (2)

ARCHITECTURE simple OF decoder IS
PROCEDURE DEC2x4 (inputs : in STD_LOGIC_VECTOR(1 downto 0);
decode: out STD_LOGIC_VECTOR(3 downto 0)
) IS
BEGIN
CASE inputs IS
WHEN "11" =>
decode := "1000";
WHEN "10" =>
decode := "0100";
WHEN "01" =>
decode := "0010";
WHEN "00" =>
decode := "0001";
WHEN others =>
decode := "0001";
END case;
END DEC2x4;
BEGIN
DEC2x4(decIn,decOut);
END simple;
.. .
241
Operator as a function (1)

LIBRARY ieee;
USE ieee.std_logic_1164.al;
------------------------------------------------------------------------------------------------PACKAGE my_package IS
FUNCTION "+" (a, b: STD_LOGIC_VECTOR)
RETURN STD_LOGIC_VECTOR;
END my_package;
-------------------------------------------------------------------------------------------------
.. .
242
Operator as a function (2)

PACKAGE BODY my_package IS
FUNCTION "+" (a, b: STD_LOGIC_VECTOR)
RETURN STD_LOGIC_VECTOR;
VARIABLE result: STD_LOGIC_VECTOR;
VARIABLE carry: STD_LOGIC;
BEGIN
carry := 0;
FOR i IN aREVERSE_RANGE LOOP
result(i) := a(i) XOR b(i) XOR carry;
carry := (a(i) AND b(i)) OR (a(i) AND carry) OR (b(i) AND carry));
END LOOP;
RETURN result;
END "+" ;
END my_package;
.. .
243
Operator overloading
Operator overloading allows different
argument types for a given operation
(function)
The VHDL tools resolve which of these
functions to select based on the types of
the inputs
This selection is transparent to the user as
long as the function has been defined for
the given argument types.
.. .
244
Different declarations for the same operator Example

Declarations in the package ieee.std_logic_unsigned:
function + ( L: std_logic_vector;
R:std_logic_vector)
return std_logic_vector;
R: integer)
R:std_logic)
.. .
245
Different declarations for the same operator Example

signal count: std_logic_vector(7 downto 0);
You can use:
count <= count + 0000_0001;
or
count <= count + 1;
or
count <= count + 1;
.. .
246
Notion of type
Type defines a set of values and a set of
applicable operations
Declaration of a type determines which values
can be stored in an object (signal, variable,
constant) of a given type
Every object can only assume values of its
nominated type
Each operation (e.g., and, +, *) includes the types
of values to which the operation may be applied,
and the type of the result
The goal of strong typing is a detection of errors
at an early stage of the design process
.. .
247
Example of strong typing

architecture incorrect of example1 is
type apples is range 0 to 100;
type oranges is range 0 to 100;
signal apple1: apples;
signal orange1: oranges;
begin
apple1 <= orange1;
end incorrect;
.. .
248
Integer type
Name:
Status:
Contents:
integer
predefined
all integer numbers
representable on a
particular host computer,
but at least numbers in the
range
(231-1) .. 231-1
.. .
249
User defined integer types - Examples

type day_of_month is range 0 to 31;
type year is range 0 to 2100;
type set_index_range is range 999 downto 100;
constant number_of_bits: integer :=32;
type bit_index is range 0 to number_of_bits-1;
Values of bounds can be expressions, but
need to be known when the model is analyzed.
.. .
250
Predefined enumeration types (1)

boolean
(true, false)
bit
(0, 1)
character
VHDL-87:
128 7-bit ASCII characters
VHDL-93:
256 ISO 8859 Latin-1 8-bit characters
.. .
251
Predefined enumeration types (2)

severity_level
(note, warning, error, failure)
Predefined in VHDL-93 only:

file_open_kind
(read_mode, write_mode, append_mode)
file_open_status
(open_ok, status_error,
name_error, mode_error)
.. .
252
User-defined enumeration types Examples

type state is (S0, S1);
type alu_function is (disable, pass, add, subtract,
multiply, divide);
type octal_digit is (0, 1, 2, 3, 4, 5, 6, 7);
type mixed is (lf, cr, ht, -, /, \);
Each value in an enumeration type must be either

an identifier or a character literal
.. .
253
Floating point types

Used to represent real numbers
Numbers are represented using a significand
(mantissa) part and an exponent part
Conform to the IEEE standard 754 or 854
Minimum size of representation that must be
supported by the implementation of the VHDL
standard:
VHDL-2001:
64-bit representation
VHDL-87, VHDL-93: 32-bit representation
.. .
254
Real literals - examples

23.1
46E5
1E+12
1.234E09
34.0e-08
23.1
46 105
1 1012
1.234 109
34.0 10-8
2#0.101#E5
8#0.4#E-6
16#0.a5#E-8
0.1012 25 =(2-1+2-3) 25
0.48 8-6 = (4 8-1) 8-6
0.a516 16-8 =(1016-1+516-2) 16-8
.. .
255
The ANSI/IEEE standard floating-point

number representation formats
.. .
256
User-defined floating-point types Examples

type input_level is range -10.0 to +10.0
type probability is range 0.0 to 1.0;
constant max_output: real := 1.0E6;
constant min_output: real := 1.0E-6;
type output_range is max_output downto min_output;
.. .
257
Attributes of all scalar types

Tleft
Tright
Tlow
Thigh
first (leftmost) value in T

last (rightmost) value in T
least value in T
greatest value in T
Not available in VHDL-87:

Tascending
true if T is an ascending range, false
otherwise
Timage(x) a string representing the value of x
Tvalue(s) the value in T that is represented by s
.. .
258
Attributes of all scalar types - examples

type index_range is range 21 downto 11;
index_rangeleft
index_rangeright
index_rangelow
index_rangehigh
index_rangeascending
index_rangeimage(14)
index_rangevalue(20)
= 21
= 11
= 11
= 21
= false
= 14
= 20
.. .
259
Attributes of discrete types

Tpos(x)
Tval(n)
Tsucc(x)
Tpred(x)
Tleftof(x)
Trightof(x)
position number of x in T
value in T at position n
value in T at position one greater
than position of x
value in T at position one less
than position of x
value in T at position one to the
left of x
value in T at position one to the
right of x
.. .
260
Attributes of discrete types - examples

type logic_level is (unknown, low, undriven, high);
logic_levelpos(unknown)
logic_levelval(3)
logic_levelsucc(unknown)
logic_levelpred(undriven)
logic_levelleftof(unknown)
logic_levelrightof(undriven)
=0
= high
= low
= low
error
= high
.. .
261
Subtype
Defines a subset of a base type values
A condition that is used to determine which
values are included in the subtype is called
a constraint
All operations that are applicable to the
base type also apply to any of its subtypes
Base type and subtype can be mixed in the
operations, but the result must belong to
the subtype, otherwise an error is
generated.
.. .
262
Predefined subtypes
natural
integers t 0
positive
integers > 0
Not predefined in VHDL-87:

delay_length
time t 0
.. .
263
User-defined subtypes - Examples
subtype bit_index is integer range 31 downto 0;

subtype input_range is real range 1.0E-9 to
1.0E+12;
.. .
264
Operators (1)
.. .
265
.. .
266
Operators (2)
Operators (3)
.. .
267
Propagation delay in VHDL - Example

entity MAJORITY is
port
(A_IN, B_IN, C_IN : in
STD_LOGIC;
Z_OUT
: out STD_LOGIC);
end MAJORITY;
architecture DATA_FLOW of MAJORITY is
begin
Z_OUT <= (not A_IN and B_IN and C_IN) or
(A_IN and not B_IN and C_IN) or
(A_IN and B_IN and not C_IN) or
(A_IN and B_IN and C_IN) after 20 ns;
end DATA_FLOW;
.. .
268
Propagation delay - Example
.. .
269
Inertial delay model

Short pulses (spikes) are not passed to the
outputs of logic gates due to the inertia of
physical systems.
Logic gates behave like low pass filters and
effectively filter out high frequency input
changes as if they never occurred.
.. .
270
Inertial delay model - Example

SIG_OUT <= not SIG_IN after 7 ns
.. .
271
VHDL-87 Inertial delay model

Any input signal change that does not persist
for at least a propagation delay of the device
is not reflected at the output.
inertial delay (pulse rejection limit) =
propagation delay
.. .
272
VHDL-93 Enhanced inertial delay model

VHDL-93 allows the inertial delay model to be declared
explicitly as well as implicitly.
Explicitly:
Z_OUT <= inertial (not A_IN and B_IN and C_IN) or
Implicitly:
Z_OUT <= (not A_IN and B_IN and C_IN) or
.. .
273
VHDL-93 Enhanced inertial delay model
VHDL-93 allows inertial delay, also called

a pulse rejection limit, to be different from the
propagation delay.
SIG_OUT <= reject 5 ns inertial not SIG_IN after 7 ns;
.. .
274
Transport delay model

With a transport delay model, all input signal
changes are reflected at the output, regardless of
how long the signal changes persist.
Transport delay model must be declared explicitly using the
keyword transport.
Inertial delay model is a default delay model because it
reflects better the actual behavior of logic components.
Transport delay model is used for high-level modeling.
.. .
275
Transport delay model - Example

SIG_OUT <= transport not SIG_IN after 7 ns
.. .
276
Event-driven simulation
time
List of events scheduled

to occur at time tq
signal
new value
.. .
277
Event list as an array Timing wheel

no events
time
List of events scheduled

to occur at time tc
signal
new value
.. .
278
Delta delay
A propagation delay of 0 time units is
equivalent to omitting the after clause and is
called a delta delay.
Used for functional simulation.
.. .
279
Two-dimensional aspect of time
.. .
280
Simulation engine algorithm

while (event list not empty)
begin
t = next time in list
process entries for time t
end
If next time in list

= previous time
then the previous
iteration of the
loop has advanced
time by one
delta delay
.. .
281
Signals vs Variables
architecture DUMMY_1 of JUNK is
signal Y : bit := 0;
begin
process
variable X : bit := 0;
begin
wait for 10 ns;
X := 1;
Y <= X;
wait for 10 ns;
-- What is Y at this point ? 1
...
end process;
end DUMMY_1;
architecture DUMMY_2 of JUNK is

signal X, Y : bit := 0;
begin
process
begin
wait for 10 ns;
X <= 1;
Y <= X;
wait for 10 ns;
-- What is Y at this point ? 0
...
end process;
end DUMMY_2;
Variable assignment is immediate; signal assignment

with 0 delay take effect only after a delta delay. i.e., in
the next simulation cycle.
.. .
282
Properties of signals
Signals represent a time-ordered list of values
denoting past, present and future values.
This time history of a signal is called a waveform.
A value/time pair (v, t) is called a transaction.
If a transaction changes value of a signal, it is
called an event.
.. .
283
Signal attributes (1)

Stransaction - a signal of type bit that changes
value from 0 to 1 or vice versa each time there
is a transaction on S.
Sevent - True if there is an event on S in the
current simulation cycle, false otherwise.
Sactive True if there is a transaction on S in a
given simulation cycle, false otherwise.
.. .
284

Slast_event
event on S.
- The time interval since the last
Slast_active - The time interval since the last

transaction on S.
Slast_value The value of S just before the last
event on S.
.. .
285

Sdelayed(T)
- A signal that takes on the same
value as S, but is delayed by time T.
Sstable(T) - A Boolean signal that is true if there
has been no event on S in the time interval T up
to the current time, otherwise false.
Squiet(T) A Boolean signal that is true if there
has been no transaction on S in the time interval
T up to the current time, otherwise false.
.. .
286

Sistemi Embedded - I Parte (2011-2012) PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Sistemi Embedded - I Parte (2011-2012) PDF

Загружено:

Авторское право:

Доступные форматы

Embedded Systems Design: A Unified

Design challenge optimizing design metrics

Embedded Systems Design: A Unified

Embedded systems overview

But theres another type of computing system

Embedded Systems Design: A Unified

Embedded systems overview

Embedded Systems Design: A Unified

Computers are in here...

and even here...

Lots more of these,

A short list of embedded systems

And the list goes on and on

Some common characteristics of embedded

Reactive and real-time

Embedded Systems Design: A Unified

An embedded system example -- a digital

ISA bus interface

Single-functioned -- always a digital camera

Embedded Systems Design: A Unified

Design challenge optimizing design metrics

Key design challenge:

Embedded Systems Design: A Unified

Design challenge optimizing design metrics

NRE cost (Non-Recurring Engineering cost): The one-time

Size: the physical space required by the system

Embedded Systems Design: A Unified

Design challenge optimizing design metrics

Time-to-market: the time required to develop a system to the point that it

Embedded Systems Design: A Unified

Design metric competition -- improving one

Digital camera chip

ISA bus interface

Not just a hardware or

Embedded Systems Design: A Unified

Time-to-market: a demanding design metric

Time required to develop a

Embedded Systems Design: A Unified

Losses due to delayed market entry

Product life = 2W, peak at W

Peak revenue from

The difference between the ontime and delayed triangle areas

Embedded Systems Design: A Unified

Losses due to delayed market entry (cont.)

Embedded Systems Design: A Unified

Percentage revenue loss =

Lifetime 2W=52 wks, delay D=4 wks

NRE and unit cost metrics

NRE and unit cost metrics

tota l c ost (x1000)

Numb er of units (volume)

Numb er of units (volume)

But, must also consider time-to-market

The performance design metric

Latency (response time)

Speedup of B over A = Bs performance / As performance

Three key embedded system technologies

Three key technologies for embedded systems

Embedded Systems Design: A Unified

Embedded Systems Design: A Unified

Pentium the most well-known, but

Embedded Systems Design: A Unified