Вы находитесь на странице: 1из 41

Advanced Computer Architecture

Part II: Embedded Computing


Embedded Memory
y Systems
y

Paolo.Ienne@epfl.ch
EPFL I&C LAP
(Largely based on slides by P. R. Panda, IITD
and P. Marwedel, University of Dortmund)

Motivation
Memories are the limiting performance factor
System-on-Chip memories and SRAMs embedded in
FPGAs are fast (1-2 cycles access) but:
On-chip memory might not be enough
eDRAM or eFLASH mayy be coming
g into the picture
p

Memories are a key energy consumer on SoC


Memory systems in embedded systems can be
customised
Large design space to exploit for optimisation
SoC and FPGA technologies support to a good extent
irregular memory systems
irregular
2

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Importance of Memory in SoCs


Some rough rule
rule-of-thumb
of thumb figures:
Area
50-70% of chip area may be memory

Performance
10-90% of system
y
performance mayy be memoryy
p
related

Power
o
25-40% of system power may be memory related

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Sub-banking

Energy
gy
Access times

Applications are
getting larger and
larger
The energy cost of
keeping access
times low is very
high
g

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Sourrce: Marwede
el, 2007

Things Are Only Getting Worse

Some More Recent Estimations

Processor Energy

Cacheless
monoprocessor

Main Mem.
E

71%

Proc. Energy
IC h E
I-Cache
Energy

51,9%

28,1%

Multiprocessor with
I and
nd D caches
he

D-Cache Energy
Main Mem.
gy
Energy

14,8%
5,2%

Average of over 200 benchmarks


5

AdvCompArch Embedded Memory Systems

Ienne 2003-08

So
ource: Verma
a and Marwed
del, Springer 2007

29%

Outline
Memory data layout
Scratchpad memory
Custom memory architectures

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Memory Data Layout


Problem statement
Optimise the placement of data in memory to
maximise cache effectiveness with minimal resources

Normally compilers place data following


language conventions and program order
Can they do any better if they see the complete
embedded application?

Similarity
This reminds
Thi
i d off the
th study
t d off placement
l
t for
f DSP
variables (but problem and strategy was different)

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Array Layout and Data Cache


a
a[i]
int
int
int
...
for

a[1024];
b[1024];
c[1024];
[
];

b[i]

(i = 0; i < N; i++)
c[i] = a[i] + b[i];

c
c[i]

Data Cache

Memory

(Direct-mapped,
512 words)
d )

Problem: Every access leads to cache miss!


8

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Aliasing Example
Cache size C,
C line size M,
M array size N
Addresses and cache position:
a[i]: i
b[i]: i + N
c[i]: i + 2N

(i mod C) / M
((i + N) mod C) / M
((i + 2N) mod C) / M

If N = kC all cache positions identical!


C is normally a power of 2
N is often a power of 2, too

Solutions?
Set-associative cache
Make C larger than N
?!
9

Costly!

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Source: B
Banakar et al., IEEE 200
02

Energy Cost of Associativity

10

AdvCompArch Embedded Memory Systems

Ienne 2003-08

A Solution: Array Padding


a

M words

a[i]
int
int
int
...
for

DUMMY

a[1024];
b[1024];
c[1024];
[
];

(i = 0; i < N; i++)
c[i] = a[i] + b[i];

b[i]
DUMMY

c
c[i]

Data Cache

(Direct-mapped,
512
2 words)
d )

Memory
Data alignment avoids cache conflicts
11

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Classic Transformation
Loop Blocking
Modify loop exploration space in blocks (or tiles)
tiles ) so
that all elements accessed at once fit the cache

Original Code
for i = 1 to N
for k = 1 to N
r = X [i,k]
for j = 1 to N
Z[i,j] = r * Y[k,j]

Blocked Code
for kk = 1 to N step B
for jj = 1 to N step B
for i = 1 to N
for k = kk to min (kk+B-1, N)
r = X [i,k]
for j = jj to min (jj+B-1,
(jj+B-1 N)
Z[i,j] = r * Y[k,j]

B
N
12

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Idea:
Split the array in blocks or
tiles
tiles and group tiles of
each array which are
accessed at once
If the tiles are small
enough, the set of tiles
accessed at once will fit
into the cache
Since theyy are adjacent
j
in
data memory, they will not
conflict in the cache

13

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Sou
urce: Pand
da et al., IEEE 20
001

Array Tiling Reduces Aliasing Too

Real-World Example: FFT


double sigreal[2048]
g
[
]

le = le / 2
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
)
{
= sigreal[i]
= sigreal[i + le]

sigreal[i]
g
[ ] =
sigreal[i + le] =
0
}

1st Outer Loop Iteration


0

1024

Array
sigreal
g

14

AdvCompArch Embedded Memory Systems

511

Cache

Ienne 2003-08

Padded FFT
double sigreal[2048
g
[
+ 16]
]

le = le / 2; le = le + le / 128
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
) {
i = i + i / 128
1st Outer Loop Iteration
= sigreal[i]
= sigreal[i + le]
Pads (~1 cache line, every cache size)

sigreal[i]
sigreal[i
] =
0
sigreal[i + le] =
1032
0
}
Array
sigreal

511

Cache

Padding 15% Speed-up on a Sparc5


15

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Algorithms to Decide Data Layout


Need to make the following decisions:
Tile Size Computation
Largest possible tiles such that working set fits the
cache

Pad Size Computation


Minimum pad size which eliminates aliasing

Interleaving of Tiled Arrays


Arrangement
g
of multiple
p arrays
y so that there is no
aliasing among arrays and all working sets fit the
cache
16

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Algorithms to Decide Data Layout


Matrix Multiplication (Array Sizes 35
35-350)
350)
TSS

ESS

LRW

DAT

DAT uses fixed


fi d tile
til dimensions
di
i
Others use widely varying sizes
17

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Outline
Memory data layout
Scratchpad memory
Custom memory architectures

18

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Scratchpad Memory Idea


Scratchpad
1 cycle

On-chip
Memory

0
P-1
P

CPU
Data
Cache

((on-chip)
p)

Off-chip
Memory

Addressable
Memory

1 cycle

N-1
N
1
10-20 cycles
19

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Scratchpad Memory Advantages


Architecturally
Architecturally visible static software-managed
software managed
cache
Avoid aliasing problems
Decide explicitly which data will be reused

Avoid evicting useful data from the cache

Increase determinism

Data are always in


in the cache
cache when needed

Save power

20

Avoid energy cost of high associativity


Avoid caching irrelevant data
Avoid using 2 levels of the memory system for temporaries

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Energy Cost of Hardware Caches

8
7

Energy perr access [n J]

6
Scratch pad
5

Cache, 2way, 4GB space

Cache, 2way, 16 MB space


Cache, 2way, 1 MB space

3
2
1
0
256

512

1024

2048

4096

8192

16384

me mory size

Energy consumption in tags, comparators, and muxes is significant


21

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Source: B
Banakar et al., IEEE 200
02

Timing Predictability

Most memory hierarchies (i.e.,


(i e caches) for PC-like
PC like systems designed for
good average case, not for good worst case behavior

Worst case execution time


(WCET) larger than
without cache

G.721 using unified


cache
h on a ARM7TDMI
22

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Sourrce: Marwede
el, 2007

Many embedded systems are real-time systems: computations must be


finished in a given amount of time

Scratchpad Memory
Embedded processor
processor-based
based system
Processor core
Embedded memory

Instruction and Data Cache


Embedded SRAM
Embedded DRAM

Scratch Pad Memory

Design problems
1. How much on-chip memory?
2 Partitioning of on-chip
2.
on chip memory in cache and scratchpad?
3. Which variables/arrays in the scratchpad?

Goals
Improve performance
Save power

23

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Architecture Exploration
Explore exhaustively the design space
Requires an algorithm to perform partitioning between
on- and off-chip
on
off chip memory

Algorithm Memory Explore


for On-chip Memory Size T (in powers of 2)
for Cache Size C (in powers of 2, < T)
SRAM Size S = T - C
Data Partition (S)
for line size L (in powers of 2
2, < C,
C < MaxLine)
Estimate Memory Performance (T, C, S)
p
goals
g
Select ((T,, C,, S,, L)) which maximize optimisation
24

AdvCompArch Embedded Memory Systems

Ienne 2003-08

[Example: Histtogram]

Variation of On-chip Memory


Allocation

Effect of different ratios scratchpad/cache sizes


T t l on-chip
Total
hi memory size
i = 2 KB
25

AdvCompArch Embedded Memory Systems

Ienne 2003-08

[Example: Histtogram]

Variation of Total On-chip Memory

Effect of on-chip
on chip memory size
26

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Data Partitioning (I)


procedure Histogram_Evaluation
char
int

BrightnessLevel[512][512];
g
Hist [256];

Regular Access
Off-chip + Cache

for (i = 0; i < 512; i++)


for (j = 0; j < 512; j++) {
/* for each pixel (i,j) in image */
level = BrightnessLevel[i][j];
Hist[level] += 1;
}

Irregular Access
Scratchpad
27

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Data Partitioning (II)


procedure Convolution
int source[128][128], dest[128][128];
int mask[4][4];
for (all points x,y of source)
new = 0;
for (i scanning the mask horizontally)
for (j scanning the mask vertically)
new += source[x+i][y+j] * mask[i,j];
dest[x][y] = new / norm;

Iteration (0,0)

mask
Small
Scratchpad

Iteration (0,1)

source + dest
Large and Regular
Off-chip + Cache
28

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Data Partitioning
Pre Partitioning Scratchpad/Off
Pre-Partitioning
Scratchpad/Off-chip
chip
Scalar variables and constants to scratchpad
Large arrays to off
off-chip
chip memory

Detailed Partitioning
Identify critical data for scratchpad
Criteria:
Life-times of arrays
Access frequency of arrays
Loop conflicts

Similar reasoning can be applied to code


29

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Global Placement Optimization


for j ...{ }
while
while...
main
memoryy

Scratchpad
memory,
capacity SSP

Processor
30

Which object
j
(array,
(
y loop,
p etc.)) to be
stored in a scratchpad?
Non-overlaying
l i
allocation
ll
i

repeat...

Gain gk + size sk for each object k

function
function...

Maximise gain G = gk, respecting the


scratchpad size SSP sk

Array...
y

Array
Array...

Int...

Solution: knapsack
p
algorithm
g
Overlaying allocation
Moving objects back and forth
between hierarchy levels
Solution: more complex
complex...

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Source: S
Steinke et al., IEEE 200
02

for i ...{ }

Symbols:
S (vark ) = size of variable k
n (vark ) = number of accesses to variable k
(vark ) = energy saved
e
d per variable
i bl access, if vark is
i migrated
i t d

E (vark ) = energy saved if variable vark is migrated (= e (vark ) n (vark ))


x (vark ) = decision variable
=1 if variable k is migrated to scratchpad, =0 otherwise
K = set of variables
Similar for functions I
Integer programming formulation:
Maximize kK x (vark ) E (vark ) + iI x (Fi ) E (Fi )
Subject to the constraint
i I S (Fi ) x (Fi ) + k K S (vark ) x (vark ) SSP
31

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Source: S
Steinke et al., IEEE 200
02

Integer Linear Programming

Cycless [x100]
Energy [J]

Source: S
Steinke et al., IEEE 200
02

Reduction in Energy and Runtime

multi_sort
benchmark (mix
of sorting
algorithms)

Numbers will change with technology but algorithms will remain unchanged
32

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Outline
Memory data layout
Scratchpad memory
Custom memory architectures

33

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Array-to-Memory Assignment and


Memory Banking
Bank
#1

Exploit the possibility of designing


a memory system not for general
use (e.g., completely uniform) and
not of standard components (e
(e.g.,
g
off-the-shelf DRAMs)

Bank
#2

Small
bitwidth

Clustering of arrays in memories


Exploit features of eDRAMs
Trade offs power/energy/area

Addresss Space

Ad-hoc bit widths


E.g., 32-bit word-addressed
architecture with specific 6-bit
arrays

Bank
#3

Multiple accesses per cycle


E.g., Allow concurrent accesses by
coprocessors
Multiple CPU busese.g., DSPs
34

AdvCompArch Embedded Memory Systems

Accessible
A
ibl
at once
Ienne 2003-08

Memory Banking Motivation


(e)DRAMs
for (i = 0, i < 1000; i++)
{ A[i] += B[i] * C[2*i]; }

A
Row Address
Addr[15:8]

Page
B
C

Column Address
Addr[7:0]

Address
35

Page Buffer
Data

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Memory Banking Motivation


(e)DRAMs
for (i = 0, i < 1000; i++)
{
A[i] += B[i] * C[2*i];
}

Row

A[I]

Row

B[I]

Row
C[2I]

Col

Col
Page Buffer

Add
Addr

36

T Datapath
To
D t
th

Col
Page Buffer

Add
Addr

T Datapath
To
D t
th

Page Buffer
Add
Addr

AdvCompArch Embedded Memory Systems

T Datapath
To
D t
th

Ienne 2003-08

Typical DRAM Tradeoffs Also in


High-End Servers
DRAMs are complex objects:
Multiple interleaved DRAM banks in a system
Large premium for burst accesses
Tradeoff between leaving a page open and having
neighbouring accesses faster and closing a page and not
requiring precharge time for far accesses
In servers, optimizations are usually of dynamic nature
performed byy the memoryy controller subsystem
p
y
and
controlled by the BIOS/OS
In embedded computers, similar optimizations can be
done statically and application-specific
37

AdvCompArch Embedded Memory Systems

Ienne 2003-08

DFG

Can be extended to model multiple


simultaneous accesses to the same
array ( multi-ported memories)

Conflict Graph

Schedule
Minimal Allocation
38

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Modifiied from P
Panda et a
al., ACM
M 2001

Minimal Number of Banks to


Remove Access Conflicts

Memory Allocation Exploration

Sou
urce: Pand
da et al., ACM 20
001

Useful Exploration
p
Space
p

39

AdvCompArch Embedded Memory Systems

Ienne 2003-08

Summary
In SoCs and FPGAs situation is different from general
purpose computers
Different design space (fast memories almost as fast as logic)
Less constraints to use standard components: any size possible,
more types of memory available (e.g., dual port), etc.
More bandwidth exploitable
p
(no
( pins)
p )

Hence different world where many more things are


possible (whereas in classic computing or normal chip
based embedded systems there is not much freedom)
Companion situation to the customisation of processors:
Optimizations tailored to Data Cache
Memory Data Layout

Memory architecture customized to a given application


Scratchpad Memory
Memory Banking
40

AdvCompArch Embedded Memory Systems

Ienne 2003-08

References
P.
P R
R. Panda et al
al., Data and Memory Optimization
Techniques for Embedded Systems, ACM Transactions
on Design Automation of Electronic Systems, 6(2):149
206 April
206,
A il 2001
M. Verma and P. Marwedel, Advanced Memory

Optimization Techniques for Low Power Embedded


Processors, Springer, 2007
P. R. Panda (ed.), Memory Issues in Embedded Systemson-Chip, Kluwer Academic, 1999
IEEE Design & Test of Computers, Special issue on Large
Embedded Memories,
Memories May-June
May June 2001

41

AdvCompArch Embedded Memory Systems

Ienne 2003-08