Академический Документы
Профессиональный Документы
Культура Документы
l Today’
s Situation: Microprocessor & DRAM
l IRAM Opportunities
l Initial Explorations
l Energy Efficiency
l Directions for New Architectures
l Vector Processing
l Serial I/O
l IRAM Potential, Challenges, & Industrial Impact
1000 µProc
“Moore’s Law”
60%/yr.
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
1
1980
1984
1985
1986
1987
1988
1989
1990
1991
1992
1996
1997
1998
1999
2000
1981
1982
1983
1993
1994
1995
Time
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 4
Processor-Memory
Performance Gap “Tax”
$10,000 $7B
$5,000
$0
94
94
96
96
96
96
97
94
94
95
95
95
95
1Q
3Q
1Q
2Q
3Q
4Q
1Q
2Q
4Q
1Q
2Q
3Q
4Q
16 4 DRAM growth
8 MB @ 60% / year
16 MB 8 2
32 MB 4 1
Memory per 8 2
64 MB
System growth
128 MB @ 25%-30% / year 4 1
256 MB 8 2
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 9
Multiple Motivations for IRAM
l Some apps: energy, board area, memory size
l Gap means performance challenge is memory
SCALAR VECTOR
(1 operation) (N operations)
r1 r2 v1 v2
+ +
r3 v3 vector
length
General vr
vr1
0
Control
Purpose Registers
Registers
(32) vr31 vcr0
$vpw vcr1
vf0
Flag vf1
vcr31
Registers
(32) 32b
vf31
1b
l Conditional Execution
l Most operations can be masked
l No need for conditional move instructions
l Exception Processing
l Integer: overflow, saturation
l IEEE Floating point: Inexact, Underflow, Overflow,
divide by Zero, inValid operation
l Memory: speculative loads
Scalar Vector
l N ops per cycle l N ops per cycle
⇒ O(N2) circuitry ⇒ O(N + εN2) circuitry
l HP PA-8000 l T0 vector micro*
l 4-way issue l 24 ops per cycle
l reorder buffer: l 730K transistors total
850K transistors l only 23 5-bit register
l incl. 6,720 5-bit register number comparators
number comparators
*See http://www.icsi.berkeley.edu/real/spert/t0-intro.html
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 39
MIPS R10000 vs. T0
l 0.13 µm,
1 Gbit DRAM
Memory (512 Mbits / 64 MBytes)
l >1B Xtors:
98% Memory,
Xbar, Vector ⇒
C regular design
P I
8 Vector Lanes (+ 1 spare) U O l Spare Lane &
+$ Memory ⇒
90% die
repairable
Memory (512 Mbits / 64 MBytes) l Short signal
distance ⇒
speed scales
<0.1 µm
Crossbar Switch
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 45
Tentative V-IRAM-1 Floorplan
0.18 µm DRAM, l
32 MB in
Memory (128 Mbits / 16 MBytes) 16 banks x 256b,
128 subbanks
l 0.25 µm,
C 5 Metal Logic
P l 200 MHz CPU
4 Vector Pipes/Lanes
U I/O 4K I$, 4K D$
+$ l 4 Float. Pt./Integer
vector units
Memory (128 Mbits / 16 MBytes) l die: 16 x 16 mm
l xtors: 270M
l Serial
I/O at 2-4 GHz today (v. 0.1 GHz bus)
l IRAM: 2-4 GIPS + 2 * 2-4Gb/s I/O + 2 * 2-4Gb/s Net
l ISIMM: 16 IRAMs + net switch + FC-AL links (+ disks)
l 1 IRAM sorts 9 GB, SmartSIMM sorts 100 GB
See “IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck”, Workshop on
Mixing Logic and DRAM, 24th Int’l Symp. on Computer Architecture, June 1997
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 49
Why IRAM now?
Lower risk than before
l Faster
Logic + DRAM available now/soon?
l DRAM manufacturers now willing to listen
l Before not interested, so early IRAM = SRAM
l Past efforts memory limited ⇒ multiple chips
⇒ 1st solve the unsolved (parallel processing)
l Gigabit DRAM ⇒ ~100 MB; OK for many apps?
l Systems headed to 2 chips: CPU + memory
l Embedded apps leverage energy efficiency,
adjustable memory capacity, smaller board area
⇒ 115M embedded 32b RISC in 1996 [Microproc. Report]
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 50
IRAM Challenges
l Chip
l Good performance and reasonable power?
l Speed, area, power, yield, cost in DRAM process?
l Architecture
l How to turn high memory bandwidth into
performance for real applications?
l Extensible IRAM: Large program/data solution?
(e.g., external DRAM, clusters, CC-NUMA, IDISK ...)
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 51
IRAM Conclusion
l Contact us if you’
re interested:
http://iram.cs.berkeley.edu/
rfromm@cs.berkeley.edu
patterson@cs.berkeley.edu
l Thanks for advice/support:
l DARPA, California MICRO, ARM, IBM, Intel,
LG Semiconductor, Microsoft, Neomagic,
Samsung, SGI/Cray, Sun Microsystems
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 53
Backup Slides
Steps:
row decoder
Word Line
1. Precharge
1k rows
bitline
2. Data-Readout
sense-amp
3. Data-Restore
4. Column Access column decoder
l More banks
l Each bank can independently process a separate address stream
l Independent Sub-Banks
l Hides memory latency
l Increases effective cache size (sense-amps)
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 59
Possible DRAM Innovations #2
l Sub-rows
l Save energy when not accessing all bits within a row
l Row buffers
l Increase access bandwidth by overlapping precharge
and read of next row access with col accss of prev row
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 61
Commercial IRAM highway is
governed by memory per IRAM?
Laptop 32 MB
Network Computer
Super PDA/Phone
8 MB
Video Games
Graphics 2 MB
Acc.
l Architectural models
l Simple CPU
l IRAM and Conventional memory configurations for two different
die sizes (Small ≈StrongARM; Large ≈64 Mb DRAM)
l Record base CPI and activity at each level of memory
hierarchy with shade
l Estimate performance based on access CPU speed
and access times to each level of memory hierarchy
l Benchmarks: hsfsys, noway, nowsort, gs, ispell,
compress, go, perl
Data cache
Instruction cache
0.80
3.00
2.00 0.74
0.59
0.29 0.25
0.38 1.16
1.00 0.90 0.56 0.63
0.46 0.68
0.00
L-C-32
L-C-32
L-C-16
L-C-16
L-C-32
L-C-16
S-I-16
S-I-16
S-I-16
S-I-32
S-I-32
S-I-32
S-C
S-C
S-C
L-I
L-I
L-I
Data cache
Instruction cache
3.00
0.60
2.00
0.41 0.92
0.44 0.58
0.61 0.66
1.00 0.76
0.00
L-C-32
L-C-16
L-C-32
L-C-16
S-I-16
S-I-16
S-I-32
S-I-32
S-C
S-C
L-I
L-I
Model Model
2.00 0.60
0.57
0.00
L-C-32
L-C-16
S-I-16
L-C-32
L-C-32
S-I-32
L-C-16
L-C-16
S-I-16
S-I-16
S-I-32
S-I-32
S-C
S-C
S-C
L-I
L-I
L-I
Model Model Model
…
Memory Latency: ~100 cycles ld.v
mem
VLOAD A T VW add.v
Load -> ALU RAW: ~100 cycles st.v
ld.v
VALU VR X1 X2 … XN VW
mem
add.v
VSTORE A T VR st.v
…
l Load → ALU sees full memory latency (large)
…
ld.v
VLOAD A T VW add.v
Load -> ALU RAW: ~6 cycles st.v
ld.v
VALU FIFO VR X1 X2 … XN VW
add.v
st.v
VSTORE A T VR
…
l Delay ALU instructions until memory data returns
l Load → ALU sees functional unit latency (small)
l MPEG video
l Motion estimation
(Cedric Krumbein, UCB)
l MPEG audio
l FFTs, filtering
l Image Processing
l Adobe Photoshop
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 83
Speech and Handwriting Recognition
l Speech recognition
l Front-end: filters/FFTs
l Phoneme probabilities: Neural net
l Back-end: Viterbi/Beam Search
l Image/video serving
l Format conversion
l Query by image content
l Structure copying
l Standard C libraries: mem*, str*
l Dhrystone 1.1 on T0: 1.98 speedup with vectors
l Dhrystone 2.1 on T0: 1.63 speedup with vectors
l Garbage Collection
l “Vectorized Garbage Collection”, Appel and
Bendiksen, Journal Supercomputing, 3:151-160
l Vector GC 9x faster than scalar GC on Cyber 205
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 114
Operation & Instruction Count:
RISC v. Vector Processor
Spec92fp Operations (M) Instructions (M)
Program RISC Vector R / V RISC Vector R / V
swim256 115 95 1.1x 115 0.8 142x
hydro2d 58 40 1.4x 58 0.8 71x
nasa7 69 41 1.7x 69 2.2 31x
su2cor 51 35 1.4x 51 1.8 29x
tomcatv 15 10 1.4x 15 1.3 11x
wave5 27 25 1.1x 27 7.2 4x
mdljdp2 32 52 0.6x 32 15.8 2x
Vectors reduce ops by 1.2X, instructions by 20X!
from F. Ouintana, University of Barcelona
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 115
V-IRAM-1 Tentative Plan
l Phase I: Feasibility stage (H1’98)
l Test chip, CAD agreement, architecture defined
l Phase 2: Design Stage (H2’98)
l Simulated design
l Phase 3: Layout & Verification (H2’99)
l Tape-out
l Phase 4: Fabrication,Testing, and
Demonstration (H1’00)
l Functional integrated circuit
l First microprocessor ≈250M transistors!
Richard Fromm, IRAM tutorial, ASP-DAC ‘98, February 10, 1998 117