Вы находитесь на странице: 1из 59

Low Power System Design

Feipei Lai 33664924 flai@ntu.edu.tw CSIE 419 Grade: Mid-term 30%, Paper presentation 40%, Final 30%,
1

Key references
Intl Conf. on CADs Intl Symp. on Low Power Electronics and Design IEEE Trans. on CADs ACM Trans. on DAES IEEE/ACM DAC Intl Symp. on Circuits and Systems IEEE Intl Solid-State Circuits Conference
2

Outline
1. Low-Power CMOS VLSI Design 2. Physics of Power Dissipation in CMOS FET Devices 3. Power Estimation 4. Synthesis for Lower Power 5. Low Voltage CMOS Circuits 6. Low-Power SRAM Architectures 7. Energy Recovery Techniques 8. Software Design for Low Power 9. Low Power SOC design 10. Embedded Software
3

Motivation
Energy-efficient computing is required by:
Mobile electronic systems Large-scale electronic systems

The quest for energy efficiency affects all aspects of system design
Packaging costs; cooling costs Power supply rail design Noise immunity
4

Technology directions
year 1999 2002 2005 2008 2011 2014

Feature size 180 (nm)


M trans/cm2 7

130
26

100
47

70
115

50
284

35
701

Chip size (mm2)


Signal pins Clock rate

170
768 600

214
1024 800

235
1204 1100

269
1280 1400

308
1408 1800

354
1472 2200

Wiring level
voltage Power (W)

7
1.8 90

8
1.5 130

9
1.2 160

9
0.9 170

10
0.6 174

10
0.6 183
5

Just as with CMOS replacement of HBTs (heterojunction bipolar transistor) , a lower performance/lower power technology ultimately will deliver superior system throughput because of the higher integration it enables.

The International Roadmap for Semiconductor (ITRS) projects that MOSFETs with equivalent oxide thickness of 5A and for junction depths less than 10nm will be in production in the next decade. While 6nm gate lengths MOSFETs have been demonstrated, performance and Manufacturability problems remain.

Electronic system design


Conceptualization and modeling:
From idea to model

Design:
HW: computation, storage and communication SW: application and system software

Run-time management:
Run-time system management and control of all units including peripherals
8

Examples
Modeling:
Choice of algorithm Application-specific hardware vs. programmable hardware (software) implementation Word-width and precision

Design:
Structural trade-off
Resource sharing and logic supplies

Management:
Operating system Dynamic power management
9

10

System models
Modeling is an abstraction:
Represent important features and hide unnecessary details

Functional models:
Capture functionality and requirements Executable models:
Support hw and/or sw compilation and simulation

Implementation models:
Describe target realization
11

Algorithm selection
Inputs
A target macro-architecture Abstract functional/executable spec. Constraints Library of algorithms

Objective
Select the most energy-efficient algorithm that satisfies constraints
12

Issues in algorithm selection


Applicable only to general-purpose primitives with many alternative implementation Pre-characterization on target architecture Limited search space exploration

13

Approximate processing
Introducing well-controlled errors can be advantageous for power
Reduced data width (coarse discretization) Layered algorithms (successive approximations) Lossy communication

14

Processing elements
Several classes of PEs
General-purpose processors (e.g. RISC core) Digital signal processors (e.g. VLIW core) Programmable logic (e.g. LUT-based FPGA) Specialized processors (e.g. custom DCT core)

Tradeoff flexibility vs. efficiency


Specialized is faster and power-efficient General-purpose is flexible and inexpensive
15

Constrained optimization
Design space
Who does what and when (binding & scheduling) Supply voltage of the various PEs:
TCLK = K Vdd/(Vdd Vt)2

Design target
Minimize power Performance constraint (e.g. Titeration = 21 sec)
16

Datasheet Analysis
PDA #Comp Vdd Iidle
3.3 0.5 3.3 0.1

Ion
50 12

%on
0.7 0.7

%idle
0.3 0.3

I(mA)
36.15 8.43

Processor 1 DRAM 1

FLASH
IR RTC DC-DC

5
1 1 1

3.3 0.0
3.3 0.0 3.3 0.0 0.1

9
64 0.1 5.5

0.7
0.05 1 0.99

0.3
0.95 0 0.01

31.5
3.2 0.1
17 5.44

System Design
Input
The output of the conceptualization phase
A macro-architectural template A hardware-software partition Component by component constraints

Output
Complete hardware design

18

Design process
Specify computation, storage, template components, and software
Synergic process

Fundamental tradeoff: general-purpose vs. application-specific


Flexibility has a cost in terms of power

19

Application-specific computational units


Synthesized from high-level executable specification (behavioral synthesis)
Supply voltage reduction Load capacitance reduction Minimization of switching activity

20

CMOS Gate Power equations


P = CLVDD2f 01 + tsc VDD Ipeak f 0 1 + VDD Ipeak
Dynamic term CLVDD2f 01 Short-circuit term tsc VDD Ipeakf 0 1 Leakage term VDD Ipeak

21

Power-driven voltage scaling


From faster to power efficient by scaling down voltage supply
Traditional speed-enhancing transformations can be exploited for low power design
Pipelining Parallelization Loop unrolling Re-timing
22

Advanced voltage scaling


Multiple voltages
Slow down non-critical path with lower voltage supply Two or more power grids High-efficiency voltage converters

23

Clock frequency reduction


fclk does not decrease energy
But it may increase battery life Reduce power

Multi-frequency clocks

24

Reducing load capacitance


Reduce wiring capacitance
Reduce local loads Reduce global interconnect Global interconnect can be reduced by improving spatial locality: trade off communication for computation

25

Reduce switching activity


Improve correlation between consecutive input to functional macros Reduced glitching All basic high-level-synthesis steps have been modified
A synergic approach lead best results

26

Application-specific processors
Parameterized processors tailored to a specific application
Optimally exploit parallelism Eliminate unneeded features

Applied to different architectures


Single-issue cores instruction subsetting Superscalar cores # and type of Functions VLIW cores Functions and compiler
27

Low power core processors


Low voltage Reduce wasted switching Specialized modes of operations/instructions Variable voltage supply

28

Exploiting variable supply


Supply voltage can be dynamically changed during system operation
Quadratic power savings Circuit slowdown

Just-in-time computation
Stretch execution time up to the max tolerable

29

Variable-supply architecture
High-efficiency adjustable DC-DC converter Adjustable synchronization
Variable-frequency clock generator Self-timed circuits

30

Memory optimization
Custom data processors
Computation is less critical than data storage (for datadominated applications)

General-purpose processors
A significant fraction of system power is consumed by memories

Key idea: exploit locality


Hierarchical memory Partitioned memory
31

Optimization approaches
Fixed memory access patterns
Optimize memory architecture

Fixed memory architecture


Optimize memory access patterns

Concurrently optimize memory architecture and accesses

32

Optimize memory architecture


Data replication to localize accesses
Implicit: multi-level caches Explicit: buffers

Partitioning to minimize cost per access


Multi-bank caches Partitioned memories

33

Optimize memory accesses


Sequentialize memory accesses
Reduce address bus transitions Exploit multiple small memories

Localize program execution


Fit frequently executed code into a small instruction buffer (or cache)

Reduce storage requirements


34

Design of communication units


Trends:
Faster computation blocks, larger chips Communication speed is critical Energy cost of communication is significant

Multifaceted design approach:


On chip, networks, wireless Protocol stack
35

Optimize memory architecture and access patterns


Two phase-process
Specification (program) transformations
Reduce memory requirements Improve regularity of accesses

Build optimized memory architecture

36

Data encoding
Theoretical results:
Bounds on transition activity reduction:
The higher the entropy rate of the source is, the lower is the gain achievable by coding

Practical applications:
Processor-memory (and other) busses
Data busses, address busses

Transition activity reduction does not guarantee energy savings


37

Bus-Invert coding for data busses


Add redundant line INV to bus When INV=0
Data is equal to remaining bus lines

When INV=1
Data is complement of remaining bus lines

Performance:
Peak: at most n/2 bus lines switch Average: code is optimal. No other code with 1-bit redundancy can do better
38

Average switching reduction is bus-width dependent:


Ex: 3.27 for an 8-bit bus

Average switching per line decreases as busses get wider


Use partitioned codes No longer optimal (among redundant codes)

Implementation issues:
Different (XOR) of two data samples and majority vote
39

Encoding instruction addresses


Most instruction addresses are consecutive
Use Gray code

Word-oriented machines:
Increments by 4 (32 bit) or by 8 (64 bit). Modify Gray code to switch 1 bit per increment Gray code adder for jumps
Harder to partition Convert to Gray code after update
40

T0 Code
Add redundant line INC to bus When INC = 0
Address is equal to remaining bus lines

When INC = 1
Transmitter freezes other bus lines Receiver increments previously transmitted address by a parameter called stride

Asymptotically zero transitions for sequences


Better than Gray code
41

Mixed bus encoding techniques


T0_BI:
Use two redundant lines: INC and INV Good for shared address/data busses

Dual encoding:
Good for time-multiplexed address busses Use redundant line SEL:
SEL = 1 denotes addresses SEL is already present in the bus interface

Dual T0:
Use T0 code when SEL is asserted.

Dual T0_BI:
Use T0 when SEL is asserted: otherwise use BI
42

Impact of software
For a given hardware platform, the energy to realize a function depends on software
Operating system Different algorithms to embody a function Different coding styles Application software compilation

43

Coding styles
Use processor-specific instruction style:
Function calls style Conditionalized instructions (for ARM)

Follow general guidelines for software coding


Use table look-up instead of conditionals Make local copies of global variables so that they can be assigned to registers Avoid multiple memory look-ups with pointer chains

44

Example: ARM variable types


Default int variable type is 18.2% more energy efficient than char or short Sign or zero extending is needed for shorter variable types

45

ARM conditional execution


All ARM instructions are conditional Conditional execution reduces the number of branches

46

Instruction-level analysis
Analyze loop execution containing specific instructions
Loop should be long enough to neglect overhead and short enough to avoid cache misses About 200 instructions

Measure instruction base cost Measure inter-instruction effects


47

Compilation for low-power operation scheduling


Reorder instructions:
Reduce inter-instruction effects Switching in control part

Cold scheduling:
Reorder instructions to reduce inter-instruction effects on instruction bus
Consider instruction op-codes Inter-instruction cost is op-code Hamming distance Use list scheduler where priority criterion is tied to Hamming distance
48

Scheduling to reduce off-chip traffic


Schedule instructions to minimize Hamming distance Scheduling algorithm:
Operations within each basic block Searches for linear orders consistent with dataflow Prunes search space by avoiding redundant solutions (hash sub-trees) and heuristically limits the number of sub-trees
49

Compilation for low-power register assignment


Minimize spills to memory Register labeling
Reduce switching in instruction register/bus and register file decoder by encoding Reduce Hamming distance between addresses of consecutive register accesses Approaches is complementary to cold scheduling
50

Other compiler optimizations


Loop unrolling to reduce overhead
Contra: increased code space

Software pipelining
Decreases the number of stalls by fetching instructions in different iterations

Eliminate tail recursion


Reduce overhead and use of stack
51

Dynamic power management


Systems are:
Designed to deliver peak performance Not needing peak performance most of the time

Components are idle at times Dynamic power management (DPM)


Put idle components in low-power non-operational states when idle

Power manager:
Observes and controls the system Power consumption of power manager is negligible
52

Structure of power-manageable systems


Systems consists of several components:
E.g., Laptop: Processor, memory, disk, display E.g., SOC: CPU, DSP, FPU, RF unit

Components may:
Self-manage state transitions Be controlled externally

Power manager:
Abstraction of power control unit May be realized in hardware or software
53

Power manageable components


Components with several internal states
Corresponding to power and service levels

Abstracted as a power state machine


State diagram with:
Power and service annotation on states Power and delay annotation on edges

54

Predictive techniques
Observe time-varying workload
Predict idle period Tpred ~ Tidle Go to sleep state if Tpred is long enough to amortize state transition cost

Main issue: prediction accuracy

55

When to use predictive techniques


When workload has memory Implementing predictive schemes
Predictor families must be chosen based on workload types Predictor parameters must be tuned to the instance-specific workload statistics When workload is non-stationary or unknown, on-line adaptation is required.
56

Operating system-based power management


In systems with an operating system (OS)
The OS knows of tasks running and waiting The OS should perform the DPM decisions

Advanced Configuration and Power Interface (ACPI)


Open standard to facilitate design of OS-based power management
57

58

Implementations of DPM
Shut down idle components Gate clock of idle units Clock setting and voltage setting
Support multiple-voltage multiple-frequency components Components with multiple working power states
59

Вам также может понравиться