Вы находитесь на странице: 1из 34

University of Dortmund

Compiler based Optimization Techniques for Scratchpad Memory


Manish Verma, Peter Marwedel Department of Computer Science XII, University of Dortmund, Germany

Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work

1 : [S. Steinke DATE, 2002]


Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -2-

Embedded Systems
Embedded systems (ES) = information processing systems embedded into a larger product Main reason for buying is not information processing Transportation (e.g. ABS) Telecommunication (e.g. mobile phone) Manufacturing (incl. robotics) Medical instruments (e.g. artificial eye)

www.dobelle.com
Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Power Issues

Power is considered as the most important constraint in embedded systems [in: Eggermont
(ed): Embedded Systems Roadmap 2002, STW]
Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Skadron et al., 30th ISCA

-4-

Power Distribution
Memory subsystem consumes > 50% of total energy budget 1 Memory hierarchy Cache Vs. Scratchpad Power 2 Performance 2 Predictability 3 Software Support

>50%

1 : [S. Segars ISSCC, 2001] 2 : [S. Steinke DATE, 2002] 3 : [P. Marwedel ASPDAC, 2004]
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -5-

Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture 1 Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work

1 : [S. Steinke DATE, 2002]


Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -7-

Focus on memory- & energy- aware compilation: Scratch pad memories (SPM)
Processor Small; no tag memory Scratch pad

Fast, energy-efficient, timingpredictable

Main Memory

Example ARM7TDMI cores, wellknown for low power consumption

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-8-

Scratchpad vs. main memory energy


Example: Atmel ARM-Evaluation board
Energy
32 Bit-Load Instruction (Thumb)

140,0 120,0 100,0 80,0 60,0 40,0 20,0 0,0

115,8 76,5 51,6 16,4


ProgSPM Prog Off-Chip/ Data SPM Data Off-Chip ProgSPM Prog Off-Chip/ Data SPM Data On-Chip ProgSPM Prog On-Chip/ Data SPM Data Off-Chip ProgSPM Prog On-Chip/ Data SPM Data On-Chip

energy reduction: / 7.06


100% predictable

nJ

Energy

-9-

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Static Allocation (Scratchpad only)


Example:
int nat() real sin()

Which objects (functions, variables) to be stored in SPM?


Gain gm and size sm for each object m. Maximise gain G = gm, respecting constraint K sm.

char ch()
int wh () "main" memory

int p []

?
real a [] SPM; capacity K int c[]

Static memory allocation: Knapsack problem

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-10-

Static Allocation (Scratchpad only)


Symbols: s(varm ) = size of variable m n(varm) = number of accesses to variable m e(varm ) = energy saved per variable access, if varm is migrated E(varm ) = energy saved if varm is migrated (= e(varm ) n(varm )) x(varm ) = 1 if variable m is migrated to SPM, else 0 M = set of variables; Similar for functions. Integer programming formulation: Maximize iI x(Fi ) E(Fi ) + mM x(varm ) E(varm ) Subject to the constraint i I s (Fi ) x(Fi ) + m M s (varm ) x(varm ) K
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -11-

Results (Energy & Runtime)

Multi_sort (mix of sort algorithms)


Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -12-

Cycles

Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-13-

Static Allocation (Cache + Scratchpad)


Processor Caches + Scratchpads I-Mem subsystem Trace Generation memory objects (MO) Conflict Graph models I-Cache behavior interaction of MOs Fine Grained Energy Model cache hits cache misses

Scratch pad

D-Cache

I-Cache

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-14-

Example
B1
1

B1 ((B2 B5 B6 B7)9 (B2 B3 B4 B7)))10 B8


0
90

B2
10

B1
B2 B3 B4 B7

B1 B7
B2 B3 B5 B4 B6

[100, 0] [10, 10]


90 99

B3
10

B5 B6
90

2
3

B4
10

[100, 0]

4 5

I-Cache

B8
B5 B6

B7
1

[90, 10]

6 7

B8

Total Cache Misses: 40

I-Mem
-15-

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Trace Generation
T4
B1
1

T1
90

B2
10

T3

T2
B5
90 99

B3
10

B4
10

B6
90

B7
1

Min #jumps across traces NP Complete problem Greedy approach Coalesce most freq exec BB Size of trace <= Scratchpad Size Append NOPs Reduce i-cache misses Improve processor cycles

T1

B8

T5
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -16-

Conflict Graph
T4 ((T1 T2 T1)9 (T1 T3 T1)))10 T5
0 1 T2 [180,20] T2 (180) 20 20

T3 (20)
T5 (1)

2
3 4 T3 5 6 7 T4 T5 [20,20] T1 [200,0] T1 (200)

T4 (1)

Conflict Graph

I-Mem

Weighted Directed Graph Nodes (traces) Execution frequency Edges (conflict relationship) # conflict misses
-17-

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Energy Model

Constant

Variable (program layout)

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-18-

Problem Formulation
NP-complete: Knapsack (no edges) Maximum Independent Set (ESP_Hit = ECache_Hit) Integer Linear Programming / Greedy Heuristic
T2 (180) [360] T1 (200) [200] 20 20 T3 (20)

[200]
T4 (1)

T5 (1)

Conflict Graph

Formal Problem Formulation Given: conflict graph (G), scratchpad, i-cache, energy model Determine: Min. energy mapping Assumption: No new edges; copying traces;
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -20-

Energy Consumption (I-Cache)


Direct Mapped Cache 2-Way Cache 4-Way Cache

21000

Energy Consumption (uJ)

18000 15000 12000 9000 6000 3000 0 1024 2048 4096 8192 16384

8kB: Most Energy Efficient I-Cache


Cache Size (Bytes)

MPEG benchmark

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-22-

Energy Consumption (Cache + Scratchpad)


140%
Energy Consumption (%)

I-Cache + SP (512B)

I-Cache + SP (1024B)

8kB DM ICache

120% 100% 80% 60% 40% 20% 0% 1kB (DM) 2kB (DM) 4kB (DM) 1kB (2-way) 2kB (2-way) 4kB (2-way) 1kB (4-way) 2kB (4-way)

MPEG benchmark

I-Cache Configuration

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-23-

Energy Consumption (Cache + Scratchpad)


160% 140% 120% 100% 80% 60% 40% 20% 0% 128 256 512 1024

Static Allocation (Scratchpad only)

I-Cache Access Scratchpad Access I-Cache Miss I-Mem Energy

MPEG: 20kB Cache: 2K DM

Scratchpad Size (Bytes)

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-24-

Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach (Scratchpad Overlay) Scratchpad only architecture Conclusion & Future Work

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-25-

Motivation (Dynamic Allocation)


SPILL_LOAD(A); for (i=0;i<100;i++) A[i] = ;} for (j=0;j<100;j++) = A[j];} SPILL_STORE(A); SPILL_LOAD(B); for (i=0;i<100;i++) B[i] = ;} for (j=0;j<100;j++) = B[j];} SPILL_STORE(B); { {

A
Main Memory

{ {

A B

Scratchpad Memory

Dynamic Allocation (Scratchpad Overlay) increased scratchpad utilization overhead due to spill routines similar to register allocation
-26-

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

Comparison against Register Allocation


Data Path Data Path Processor

Register File

Register File

Scratch pad

RISC

CISC

Scarce Resource (Register File / Scratchpad) Life-time of variables (temp. regs. / vars + code) Similar to RA for CISC, not for RISC processors Memory objects (vars + code) are of various sizes
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -27-

Workflow (Scratchpad Overlay)


1. Memory Object Determination 2. Liveness Analysis 3. Memory Assignment

Scratchpad Overlay
5. Code Generation 4. Onchip Address Assignment

Scratchpad Overlay: Memory Assignment Onchip Address Assignment


Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -28-

Memory Object Determination


B1 T1 B2

Memory Objects: Global Variables (A) Non-Scalar Local Variables Traces (T1, T2, T3, T4)
T3

B3 T2 B4 B7

B5 B6

MO = {A, T1, T2, T3, T4}

T4 B8
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -29-

Liveness Analysis
B1 B2
MOD A USE T3 DEF A

DEF-MOD-USE: Vars: Profiling Info. Traces: Static Analysis

B3

B5 B6 B7 T4 B8
USE A

T3
USE A

B4

USE T3

LiveRange: fixed-point iterative method

USE T4

USE T4

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-30-

Memory Assignment
Processor Given: MOs, LiveRanges, Scratchpad Determine: Memory Assignment of MOs Assumption: Onchip address to MOs can be assigned Discussion: NP-complete, reduces to register allocation Solutions: Optimal: ILP formulation (16 sec.) Near Optimal: Heuristic

Scratchpad

Main Memory

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-31-

Memory Assignment (Solution)


B1 B2
MOD A DEF A

MO = {A, T1, T2, T3, T3} SP Size = |A| = |T1| = |T4|


B9
USE T3 SPILL_STORE(A); SPILL_LOAD(T3);

B3

B5 B6 B7 B8
USE A

T3 B4
USE A
USE T3

Solution: A SP & T3 SP

B10
SPILL_LOAD(A);

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-32-

Onchip Address Assignment


Fragmentation Problem
0 MO1 20 MO1 MO2 40 MO2 MO3 60 Scratchpad MO3 MO1 MO2 Given: Memory Assignment, Scratchpad Determine: Onchip Address (Offset) of MOs Discussion: NP-complete, reduces to ShipBuilding problem Solution: Optimal: MIP formulation (~4 hours) Near Optimal: First-fit, Best-fit heuristic

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-33-

Results (Edge Detection)


10000 9000
Energy Consumption (uJ)

Total Energy (SO)

Total Energy (SA)

8000 7000 6000 5000 4000 3000 2000 1000 0 0 64 128 256 512 1024

1/8th Scratchpad

Scratchpad Size (Bytes)

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-35-

Results Static vs. SA) (SO


Allocation
Processor Energy 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 64 100 128 200 Scratchpad Size (Bytes) 256 avg.
64% 22% 21% 43%

Memory Energy

Total Energy

Execution Time

Edge Detection

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-36-

Results (SO vs. SA)


2.00 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00

Total Energy

Execution Time

Code Size
36%

Static Allocation
34%

edge_detection

mpeg

adpcm

histogram

multisort

avg.

Benchmarks
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -37-

Conclusion & Future Work


Scratchpads are energy efficient memories. Software allocation methods Static Allocation Approach avg. 30% reduction in energy consumption SP + I-Cache is better than best I-Cache Dynamic Allocation Approach avg. 30% reduction in energy consumption Future Work Multi-memory / Multi-Process. Near-optimal solutions.

Manish Verma, Computer Science XII, Univ. Dortmund, 2004

-38-

Вам также может понравиться