Академический Документы
Профессиональный Документы
Культура Документы
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DRAM: Architectures,
DRAM: why bother? (i mean,
Interfaces, and Systems
besides the “memory wall”
thing? ... is it just a performance
issue?)
UNIVERSITY OF MARYLAND
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
NOTE
• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
DRAM ORGANIZATION
Maryland
Row Decoder
charge ... thus, the bit lines are
Switching
element
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
BUS TRANSMISSION
Maryland
Row Decoder
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
[PRECHARGE and] ROW ACCESS
Maryland
Row Decoder
discharged ... however, when
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
COLUMN ACCESS
Maryland
Row Decoder
read by the memory controller
READ Command
or
CAS: Column Address Strobe
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
DATA TRANSFER
Maryland
Row Decoder
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
BUS TRANSMISSION
Maryland
NOTE DRAM
Column Decoder
Row Decoder
ISCA 2002
Basics
Bruce Jacob
David Wang F
University of DRAM
Maryland
CPU Mem
DRAM “latency” isn’t
Controller E1
deterministic because of CAS or
RAS+CAS, and there may be A
significant queuing delays within
the CPU and the memory
controller B
Each transaction has some D E2/E3
overhead. Some types of
overhead cannot be pipelined.
C
This means that in general,
longer bursts are more efficient.
A: Transaction request may be delayed in Queue
B: Transaction request sent to Memory Controller
C: Transaction converted to Command Sequences
(may be queued)
D: Command/s Sent to DRAM
E1: Requires only a CAS or
E2: Requires RAS + CAS or
E3: Requires PRE + RAS + CAS
F: Transaction sent back to CPU
“DRAM Latency” = A + B + C + D + E + F
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
PHYSICAL ORGANIZATION
Maryland
NOTE
x8 DRAM
x2 DRAM x4 DRAM
Column Decoder Column Decoder Column Decoder
Row Decoder
Row Decoder
Memory Memory Memory
....
....
....
Array Array Array
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
Read Timing for Conventional DRAM
Maryland
Address
Row Column Row Column
Address Address Address Address
DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolutionary Tree
Bruce Jacob
David Wang
........
......
University of
Maryland
MOSYS
since DRAM’s inception, there
have been a stream of changes
to the design, from FPM to EDO Structural
to Burst EDO to SDRAM. the
changes are largely structural
Modifications
modifications -- nimor -- that Targeting
target THROUGHPUT. FCRAM Latency
[discuss FPM up to SDRAM
Conventional
Everything up to and including DRAM
$
SDRAM has been relatively (Mostly) Structural Modifications
inexpensive, especially when
considering the pay-off (FPM
Targeting Throughput VCDRAM
was essentially free, EDO cost a
latch, PBEDO cost a counter,
SDRAM cost a slight re-design).
however, we’re run out of “free”
ideas, and now all changes are
considered expensive ... thus
there is no consensus on new
directions and myriad of choices FPM EDO P/BEDO SDRAM ESDRAM
has appeared
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Conventional DRAM
Maryland
Row Access
NOTE
Column Access
Transfer Overlap
Data Transfer
RAS
CAS
Address
Row Column Row Column
Address Address Address Address
DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Fast Page Mode
Maryland
Row Access
FPM aallows you to keep th
esense amps actuve for multiple Column Access
CAS commands ...
Transfer Overlap
much better throughput
CAS
Address
Row Column Column Column
Address Address Address Address
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Extended Data Out
Maryland
Row Access
solution to that problem --
instead of simple tri-state Column Access
buffers, use a latch as well.
Transfer Overlap
by putting a latch after the
column mux, the next column
address command can begin
Data Transfer
sooner
RAS
CAS
Address
Row Column Column Column
Address Address Address Address
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Burst EDO
Maryland
Row Access
by driving the col-addr latch from
an internal counter rather than Column Access
an external signal, the minimum
cycle time for driving the output Transfer Overlap
bus was reduced by roughly
30%
Data Transfer
RAS
CAS
Address
Row Column
Address Address
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Pipeline Burst EDO
Maryland
Row Access
“pipeline” refers to the setting up
of the read pipeline ... first CAS\ Column Access
toggle latches the column
address, all following CAS\ Transfer Overlap
toggles drive data out onto the
bus. therefore data stops coming
when the memory controller
Data Transfer
stops toggling CAS\
RAS
CAS
Address
Row Column
Address Address
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Synchronous DRAM
Maryland
Row Access
Clock
main benefit: frees up the CPU
or memory controller from Column Access
having to control the DRAM’s
RAS
internal latches directly ... the Transfer Overlap
controller/CPU can go off and do
other things during the idle
cycles instead of “wait” ... even
Data Transfer
though the time-to-first-word CAS
latency actually gets worse, the
scheme increases system
throughput
Command
ACT READ
Address
Row Col
Addr Addr
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Inter-Row Read Timing for ESDRAM
Maryland
Regular CAS-2 SDRAM, R/R to same bank
Clock
output latch on EDO allowed you
to start CAS sooner for next
accesss (to same row) Command
Command
Address
Row Col Bank Row Col
Addr Addr Addr Addr
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Write-Around in ESDRAM
Maryland
Regular CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0
Clock
neat feature of this type of
buffering: write-around
Command
Address
Row Col Row Col Row Col
Addr Addr Bank Addr Addr Bank Addr Addr
DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data
Command
Address
Row Col Bank Row Col Col
Addr Addr Addr Addr Addr
DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data Data
ISCA 2002
DRAM Evolution
Bruce Jacob $
David Wang
University of
Internal Structure of Virtual Channel
Maryland 16 Channels
Bank B (segments)
main thing ... it is like having a Input/Output
bunch of open row buffers (a la Buffer
rambus), but the problem is that Bank A
you must deal with the cache 2Kb Segment
directly (move into and out of it),
not the DRAM banks ... adds an
extra couple of cycles of latency
... however, you get good 2Kb Segment
bandwidth if the data you want is
cache, and you can “prefetch”
2Kbit # DQs DQs
into cache ahead of when you
want it ... originally targetted at
reducing latency, now that 2Kb Segment
SDRAM is CAS-2 and RCD-2,
this make sense only in a
throughput way
2Kb Segment
Sense
Row Decoder Amps Sel/Dec
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Internal Structure of Fast Cycle RAM
Maryland
SDRAM FCRAM
FCRAM opts to break up the
data array .. only activate a
portion of the word line
Row Decoder
Row Decoder
select ... FCRAM uses 15
(assuming the array is 8k x 1k ...
the data sheet does not specify) 8M Array 8M Array
13 bits 15 bits
(8Kr x 1Kb) (?)
ISCA 2002
DRAM Evolution
........
......
Bruce Jacob
David Wang
University of
Internal Structure of MoSys 1T-SRAM
Maryland
DQs
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Comparison of Low-Latency DRAM Cores
Maryland
Data Bus Bus Width Peak BW RAS–CAS RAS–DQ
DRAM Type
here’s an idea of how the Speed (per chip) (per Chip) (tRCD) (tRAC)
designs compare ...
PC133 SDRAM 133 16 266 MB/s 15 ns 30 ns
bus speed == CAS-to-CAS
VCDRAM 133 16 266 MB/s 30 ns 45 ns
RAS-CAS == time to read data
from capacitors into sense amps
FCRAM 200 * 2 16 800 MB/s 5 ns 22 ns
RAS-DQ == RAS to valid data
1T-SRAM 200 32 800 MB/s — 10 ns
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
• Advanced Basics
• Memory System Details (Lots)
ISCA 2002
What Does This All Mean?
Bruce Jacob
David Wang
University of
xDDR II
Maryland
DDR II
Some Technology has legs,
EDO
netDRAM
some do not have legs, and
some have gone belly up.
RLDRAM BEDO
We’ll start by emaining the
fundamental technologies (I/O ESDRAM
packaging etc) then explore ome
of these technologies in depth a
bit later.
FPM
SDRAM
FCRAM
D-RDRAM DDR
SDRAM
SLDRAM
DRAM TUTORIAL
ISCA 2002
Cost - Benefit Criterion
Bruce Jacob
David Wang
University of
Maryland
Package Cost
What is a “good” system?
ISCA 2002
Memory System Design
Bruce Jacob
David Wang
University of
Clock Network I/O Technology
Maryland
ISCA 2002
DRAM Interfaces
Bruce Jacob
David Wang
University of
The Digital Fantasy
Maryland
But...
DRAM TUTORIAL
Read fr omFC RAMTM @400MHz DDR
ISCA 2002 (The
Non-teReal
rminaWorld
ti on c as e)
Bruce Jacob
David Wang FCRAM side VDDQ(Pad)
University of
Maryland
skew=158psec skew=102psec
12
DRAM TUTORIAL
ISCA 2002
Signal Propagation
Bruce Jacob
David Wang
University of A
Maryland B
ISCA 2002
Clocking Issues
Bruce Jacob
David Wang
University of Figure 1:
Maryland
Figure 2:
H Tree?
0 th Nth
Clk
SRC
ISCA 2002
Clocking Issues
Bruce Jacob
David Wang
University of Figure 1:
Maryland
Clk
SRC
Signal Direction
ISCA 2002
Path Length Differential
Bruce Jacob
David Wang
Path #3
Bus Signal 2
University of Path #2 Bus Signal 1
Maryland
Path #1
Controller
We purposefully “routed path #2 A B
to be a bit longer than path #1 to Intermodule
illustrate the point in between
the signal path length
Connectors
differentials. As illustrated,
signals will reach load B at a
later time than load A simply
because it is farther away from
controller than load A.
ISCA 2002
Subdividing Wide Busses
Bruce Jacob
David Wang
University of
Maryland
B
1
It’s hard to bring the Wide
parallel bus from point A to point
B, but it’s easier to bring in
smaller groups of signals from A
2
to B. To ensure proper timing, 3
we also send along a source
synchronous clock signal that is
path length matched with the
signal groun it covers.In this
figure, signal groups 1,2, and 3
may have some timing skew with
respect to each other, but within
the group the signals will have
minimal skew. (smaller channel Obstruction
can be clocked higher)
1 2 Narrow Channels,
3 Source Synchronous
A Local Clock Signals
DRAM TUTORIAL
ISCA 2002
Why Subdivision Helps
Bruce Jacob
David Wang
University of
Maryland
Sub
Analogy, it’s a lot harder to Channel 1
schedule 8 people for a meeting,
but a lot easier to schedule 2
meetings with 4 people each.
The results of the two meeting
can be correlated later.
Sub
Channel 2
Worst Case
skew of
Worst Case Worst Case
{Chan 1 +
skew of skew of
Chan 2}
Chan 1 Chan 2
ISCA 2002
Timing Variations
Bruce Jacob
David Wang
University of
Maryland Controller 4 Loads
Cmd to 1 Load
Cmd to 4 Loads
ISCA 2002
Loading Balance
Bruce Jacob
David Wang
University of
Maryland
Controller
To ensure that a lightly loaded Duplicate
system and a fully loaded
system do not differ significantly
Signal
in timing, we either have
duplicate signals sent to
Lines
different memory modules, or
we have the same signal line, Controller
but the signal line uses variable
strengths to drive the I/O pads,
depending on if the system has
1,2,3 or 4 loads.
Variable
Controller Signal
Drive
Strength
Controller
DRAM TUTORIAL
ISCA 2002
Topology
Bruce Jacob
David Wang
ISCA 2002
SDRAM Topology Example
Bruce Jacob
David Wang
University of
Maryland x16 x16
DRAM DRAM
Very simple topology. The clock Chip Chip
signal that turns around is very
nice. Solves problem of needing
multiple clocks.
Command & x16 x16
Address DRAM DRAM
Single Chip Chip
Channel
SDRAM x16 x16
Controller DRAM DRAM
Data bus Chip Chip
(64 bits)
(16 bits)
x16 x16
DRAM DRAM
Chip Chip
Loading Imbalance
DRAM TUTORIAL
ISCA 2002
RDRAM Topology Example
Bruce Jacob
David Wang
University of RDRAM
Maryland
Controller
All signals in this topology, Addr/
Cmd/Data/Clock are sent from
point to point on channels that is
path length matched by
definition.
Controller
clock
Packets traveling down Chip
turns
Parallel Paths. Skew is around
minimal by design.
DRAM TUTORIAL
ISCA 2002
I/O Technology
Bruce Jacob
David Wang
University of
Maryland Logic High
RSL vs SSTL2 etc.
∆v
(like ECL vs TTL of another era)
∆v
delta t is on the order of ns, we
want it to be on the order of ps.
Slew Rate =
∆t
Smaller ∆ v =
Smaller ∆ t at same slew rate
Increase Rate of bits/s/pin
DRAM TUTORIAL
ISCA 2002
I/O - Differential Pair
Bruce Jacob
David Wang
University of
Maryland
ISCA 2002
I/O - Multi Level Logic
Bruce Jacob
David Wang
University of
Maryland
logic 10
range
Vref_2
logic 11
range
Vref_1
logic 01
voltage
range
Vref_0
logic 00 time
range
ISCA 2002
Packaging
Bruce Jacob
David Wang
University of
Maryland DIP
“good old da ys”
Different packaging types impact
costs and speed. Slow parts can
Features Target Specification
use the cheapest packaging
available. Faster parts may have
to use more expensive
packaging. This has long be
SOJ Package
Speed
FBGA
800MBp
LQFP
550Mbps
accepted in the higher margin Small Outline J-lead
processor world, but to DRAM,
Vdd/Vddq 2.5V/2.5V (1.8V)
each cent has to hard fought for.
To some extent, the demand for
higher performance is pushing
memory makers to use more
TSOP
Thin Small Outline
Interface SSTL_2
FBGA
Fine Ball Grid Array
DRAM TUTORIAL
ISCA 2002
Access Protocol
Bruce Jacob
David Wang
University of Single
Maryland Cycle
16 bit wide command I have to
Cmd
send from A to B, I need 16 pins,
or if I have less than 16, I need
multiple cycles.
Cmd r0
How many bits do I need to send
from point A to point B? How
many pins do I get? Data d0 d0 d0 d0
Cycles = Bits/Pins.
Single Cycle Command
Multiple
Cycle
Cmd
Cmd r0 r0 r0 r0
Data d0 d0 d0 d0
ISCA 2002
Access Protocol (r/r)
Bruce Jacob
David Wang
University of
Maryland
Control DRAM DRAM DRAM DRAM
There is inherant latency
between issuance of a read
command, and the response of
the chip with data. To increase
efficiency, a pipeline structure is
necessary to obtain full
utilization of the command,
address and data busses.
Row a0
Different from an “ordinary”
pipeline on a processor, a Col r0 r1
memory pipeline has data
flowing in both directions.
Data d0 d0 d0 d0 d1 d1 d1 d1
Architecture wise, we should be
concerned with full utilization
everywhere, so we can use the RAS CAS Pipelined Access
least number of pins for the latency latency
greatest benefit, but in actual
use, we are usually concerned
with full utilization of the data Consecutive Cache Line Read Requests to Same DRAM Row
bus.
Command
a = Active (open page)
r = Read (Column Read)
Data d = Data (Data chunk)
DRAM TUTORIAL
ISCA 2002
Access Protocol (r/w)
Bruce Jacob
David Wang DRAM
Data In/Out
University of One Datapath - Two Commands
Buffers
Maryland
Column Decoder
The DRAM chips determine the
latency of data after a read
command is received, but the Sense Amps
controller determines the timing
relationship between the write
command and the data being
written to the dram chips.
Col w0 r1
(If the DRAM device cannot
handle pipelined R/W, then...)
Data d0 d0 d0 d0 d1 d1 d1 d1
Case 1: Controller sends write
data at same time as the write Case 1: Read Following a Write Command to Different DRAM Devices
command to different devices
(pipelined)
Col w0 r1
Data d0 d0 d0 d0 d1 d1 d1 d1
Soln: Delay Data of Write Command to match Read Latency
DRAM TUTORIAL
ISCA 2002
Access Protocol (pipelines)
Bruce Jacob
David Wang
Col r0 r1 r2
University of
Maryland
Data d0 d0 d0 d0 d1 d1 d1 d1 d2 d2 d2 d2
To increase “efficiency”, CAS
pipelines is required. How many
commands must one device
latency
support concurrently?
Three Back-to-Back Pipelined Read Commands
2? 3? 4? (depends on what?)
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
440BX used 132 pins to control
a single SDRAM channel, not
counting Pwr & GND. now 845 • Advanced Basics
chipset only uses 102.
ISCA 2002
SDRAM System In Detail
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3 Dimm4
University of
Maryland
Single
Channel
SDRAM
Controller
ISCA 2002
SDRAM Chip
Bruce Jacob
David Wang
133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus
Maryland 54 pin
TSOP
Programmable Burst Length, 1,2,4 or 8
SDRAM: inexpensive
packaging, lowest cost (LVTTL Quad Banks Internally
signaling), standard 3.3V supply
voltage. DRAM core and I/O Supply Voltage of 3.3V
share same power supply.
ISCA 2002
SDRAM Access Protocol (r/r)
Bruce Jacob 0 1 2 3 4 5 6 7 8 9 10 11 12 13
David Wang 1
SDRAM SDRAM
University of 1
chip # 0 chip # 1
3 read command and address assertion 2
Maryland Memory 3
2
Controller
data bus utilization 4
We’ve spent some time Data bus
discussing some pipelined back CASL 3
to back read commands sent to 4
the same chip, now let’s try to
pipeline commands to different
chips.
ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang
Figure 1:
University of
Maryland Consecutive
Reads
0 th Nth
I show different paths, but
theses signals are sharing the
bi-directional data bus. For a
read to follow a write to a
different chip, the worst case
skew is when we write to the (N-
1)th chip, then expect to pipeline
a read command in the next
cycle right behind it. The worst Worst case = Dist(N) - Dist(0)
case signal path skew is the
sum of the distances. Isn’t from
N to N even worse? No, SDRAM
does not support pipelined read
Figure 1:
th Nth
behind a write on the same chip.
Also, it’s not as bad as I project
here, since read cycles are
Read After 0 (N-1)th
center aligned, and writes are
edge aligned, so in essence, we
Write
get 1 1/2 cycles to pipeline this
case, instead of just 1 cycle.
Still, this problem limits the freq
Write
scalability of SDRAM, an idle Read
cycles maybe inserted to meet
timing.
ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang
University of 1
Maryland
SDRAM SDRAM
Timing bubbles. More dead
cycles
chip # 0 chip # 1
2
Memory
Controller 3
4 Data bus
1 3
Col w0 r1
Data d0 d0 d0 d0 d1 d1 d1 d1
2 4
ISCA 2002
DDR SDRAM System
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3
University of
Maryland
ISCA 2002
DDR SDRAM Chip
Bruce Jacob
David Wang 133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus 66 pin
Maryland
TSOP
Programmable Burst Lengths, 2, 4 or 8*
Slightly larger package, same
width for Addr and Data, new Quad Banks Internally
pins are 2 DQS, Vref, and now
differential clocks. Lower supply Supply Voltage of 2.5V*
voltage.
DQS Pre-amble
DRAM TUTORIAL
ISCA 2002
DDR SDRAM Protocol (r/r)
Bruce Jacob
David Wang
University of r0
Maryland SDRAM SDRAM
Here we see that two
chip # 0 chip # 1
consecutive column read
Memory
d0
commands to different chips on
the DDR memory channel
r1
cannot be placed back to back
Controller
on the Data bus due to the DQS
signal hand-off issue. They may
Data bus
be pipelined with one idle cycle d1
in between bursts.
CASL = 2
ISCA 2002
RDRAM System
Bruce Jacob
David Wang
University of
Maryland RDRAM
Controller
Very different from SDRAM.
Everything is sent around in 8
(half cycle) packets. Most
systems now runs at 400 MHz,
but since everything is DDR, it’s
called “800 MHz”. The only
difference is that packets can
only be initiated at the rising
edge of the clock, other than Bus Clock
that, there’s no diff between 400
DDR and 800. Column Cmd w0 w1 r2
Very clean topology, very clever
clocking scheme. No clock
handoff issue, high efficiency. Data Bus d0 d1 d2
Write delay improves matching
with read latency. (not perfect, tCWD tCAC
as shown) since data bus is 16
bits wide, each read command (Write delay) (CAS access delay) tCAC - tCWD
gets 16*8=128 bits back. Each
cacheline fetch = multiple
packets. Two Write Commands Followed by a Read Command
Up to 32 devices.
Packet Protocol : Everything in 8 (half) cycle packets
DRAM TUTORIAL
ISCA 2002
Direct RDRAM Chip
Bruce Jacob
David Wang
400 MHz (2.5ns cycle time)
University of 256 MBit
Maryland Separate Row-Col Command Busses 86 pin
FBGA
RDRAM packets does not re-
Burst Length = 8*
order the data inside the packet. 49 Pwr/Gnd*
To compute RDRAM latency, we
4/16/32 Banks Internally*
must add in the command 16 Data
packet transmission time as well Supply Voltage of 2.5V* 8 Addr/Cmd
as data packet transmission 4 Clk*
time. Low Latency, CAS = 4 to 6 full cycles* 6 CTL *
RDRAM relies on the multitudes 2 NC
of banks to try to make sure that RSL Signaling (Vref +/- 0.2V) 1 Vref *
a high percentage of requests
would hit open pages, and only (800 mV rail to rail)
incur the cost of a CAS, instead
of a RAS + CAS.
Active Active precharge
ISCA 2002
RDRAM Drawbacks
Bruce Jacob
David Wang
High Frequency
University of I/O Test and
Maryland RSL: Separate
Package Cost
RDRAM provides high
bandwidth, but what are the
Power Plane 30% die cost
costs?
RAMBUS pushed in many for logic @
different areas simultaneously.
The drawback was that with new
64 Mbit node
set of infrastructure, the costs for
first generation products were Control Logic -
exhorbant. Active Decode Row buffers
Logic + Open Single Chip
Row Buffer. Provides All
(High power Data Bits for
for “quiet” state) Each Packet
(Power)
ISCA 2002
System Comparison
Bruce Jacob
David Wang
University of
SDRAM DDR RDRAM
Maryland
Frequency (MHz) 133 133*2 400*2
Low pin count, higher latency. Pin Count (Data Bus) 64 64 16
In general terms, the system
comparison simply points out
the various parts that RDRAM
Pin Count (Controller) 102 101 33
excells in, i.e. high bandwidth
and low pin count., but they also Theoretical Bandwidth 1064 2128 1600
have longer latency, since it
takes 10 ns just to move the (MB/s)
command from the controller
onto the DRAM chip, and Theoretical Efficiency 0.63 0.63 0.48
another 10ns just to get the data
from the DRAM chips back onto (data bits/cycle/pin)
the controller interface
Sustained BW (MB/s)* 655 986 1072
Sustained Efficiency* 0.39 0.29 0.32
(data bits/cycle/pin)
RAS + CAS (tRAC) (ns) 45 ~ 50 45 ~ 50 57 ~ 67
CAS Latency (ns)** 22 ~ 30 22 ~ 30 40 ~ 50
133 MHz P6 Chipset + SDRAM CAS Latency ~ 80 ns
*StreamAdd
**Load to use latency
DRAM TUTORIAL
ISCA 2002
Differences of Philosophy
Bruce Jacob
David Wang
SDRAM - Variants
University of
Maryland
DRAM
RDRAM moves complexity from Controller
interface into DRAM chips. Chips
Is this a good trade off? What
does the future look like?
Complex Inexpensive Simple
Interconnect Interface Logic
RDRAM - Variants
DRAM
Controller
Chips
ISCA 2002
Technology Roadmap (ITRS)
Bruce Jacob
David Wang
ISCA 2002
Choices for Future
Bruce Jacob
David Wang
Direct Connect
DRAM
DRAM
DRAM
DRAM
Custom DRAM:
University of CPU
Highest Bandwidth +
Maryland Low Latency
DRAM
DRAM
DRAM
DRAM
controller on die, frequency will
CPU semi-comm. DRAM:
be much higher. Command-data
path will only cross chip High Bandwidth +
boundaries twice instead of 4 Low/Moderate Latency
times. But interfacing with
memory chips directly means
that you are to be limited by the
lowest common denominator. To Direct Connect
DRAM DRAM
get highest bandwidth (for a Commodity DRAM
given number of pins) AND
CPU Low Bandwidth +
lowest latency, we’ll need DRAM DRAM Low Latency
custom RAM, might as well be
SRAM, but it will be prohibatively DRAM DRAM
expensive
Inexpensive DRAM
Highest Latency
DRAM TUTORIAL
ISCA 2002
EV7 + RDRAM (Compaq/HP)
Bruce Jacob
David Wang
University of
• RDRAM Memory (2 Controllers)
Maryland
• Direct Connection to processor
Two RDRAM controller means 2
independent channels. Only 1
packet has to be generated for • 75ns Load to use latency
each 64 byte cache line
transaction request.
ISCA 2002
What if EV7 Used DDR?
Bruce Jacob
David Wang
University of
• Peak Bandwidth 12.8 GB/s
Maryland
• 6 Channels of 133*2 MHz DDR SDRAM ==
EV7 cacheline is 64 bytes, so
each 4-channel ganged RDRAM • 6 Controllers of 6 64 bit wide channels, or
can fetch 64 bytes with 1 single
packet.
Each DDR SDRAM channel can
fetch 64 bytes by itself. So we
• 3 Controllers of 3 128 bit wide channels
need 6 controllers if we gang
two DDR SDRAMs together into
one channel, we have to reduce
the burst length from 8 to 4. EV7 + 6 controller EV7 + 3 controller
Shorter bursts are less efficient.
System EV7 + RDRAM
DDR SDRAM DDR SDRAM
Sustainable bandwidth drops.
Latency 75 ns ~ 50 ns* ~ 50 ns*
Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd
Controller 2 6*** 3***
Count
Open pages 2048 144 72
ISCA 2002
What’s Next?
Bruce Jacob
David Wang
University of
• DDR II
Maryland
DDR SDRAM was an
• FCRAM
advancedment from SDRAM,
with lowered Vdd, new electrical
signal interface (SSTL), new • RLDRAM
protocol, but fundamentally the
same tRC ~= 60ns. RDRAM has
tRC of ~70ns. All comparable in
Row recovery time.
• RDRAM (Yellowstone etc)
So what’s next? What’s on the
Horizon? DDR II/FCRAM/
RLDRAM/RDRAM-nextGen/
• Kentron QBM
Kentron? What are they, and
what do they bring to the table?
DRAM TUTORIAL
ISCA 2002
DDR II - DDR Next Gen
Bruce Jacob
David Wang Lower I/O DRAM core operates at
University of Voltage (1.8V) 1:4 freq of data bus freq
Maryland (SDRAM 1:1, DDR 1:2)
Backward Compat.
DDR II is a follow on to DDR 400 Mbps to DDR (Common
DDR II command sets are a
superset of DDR SDRAM - multidrop modules possible)
commands.
Lower I/O voltage means lower 800 Mbps
power for I/O and possibly faster
signal switching due to lower rail
-point to point No more Page-
to rail voltage.
DRAM core now operates at 1:4 Transfer-Until-
of data bus frquency. valida
command may be latched on
Interrupted
any given rising edge of clock, FPBGA package Commands
but may be delayed a cycle
since command bus is running (removes speedpath)
at 1:2 frequency to the core now.
In a memory system it can run at
400 MHz per pin, while it can be
Burst Length == 4
cranked up to 800 MHz per pin Only!
in an embedded system without
connectors.
DDR II eliminates the transfer-
until-interrupted commands, as
well as limits the burst length to
4 only. (simple to test) 4 Banks internally Write Latency = CAS -1
(same as SDRAM and DDR) (increased Bus Utilization)
DRAM TUTORIAL
ISCA 2002
DDR II - Continued
Bruce Jacob
David Wang
University of
Posted Commands
Maryland
Active Read SDRAM & DDR
Instead of a controller that keeps
(RAS) (CAS)
track of cycles, we can now have
a “dumber” controller. Control is
now simple, kind of like SRAM. tRCD data
part I of address one cycle, part
II the next cycle.
SDRAM & DDR SDRAM relies on memory controller to know
tRCD and issue CAS after tRCD for lowest latency.
Active Read
DDR II: Posted CAS
(RAS) (CAS)
tRCD data
Internal counter delays CAS command, DRAM chip issues “real”
command after tRCD for lowest latency.
DRAM TUTORIAL
ISCA 2002
FCRAM
Bruce Jacob
David Wang
University of
Fast Cycle RAM (aka Network-DRAM)
Maryland
FCRAM is a trademark of
Fujitsu. Toshiba manufactures
under this trademark, Samsung
Features DDR SDRAM FCRAM/Network-DRAM
sells Network DRAM. Same
thing. Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15
extra die area devoted to circuits
that lowers Row Cycle down to
half of DDR, and Random
Electrical Interface SSTL-2 SSTL-2
Access (RAC) latency down to
22 to 26ns. Clock Frequency 100~167 MHz 154~200 MHz
Writes are delay-matched with
CASL, better bus utilization. tRAC ~40ns 22~26ns
# Banks 4 4
ISCA 2002
FCRAM Continued
Bruce Jacob
David Wang
University of
Maryland
With faster DRAM turn around
time on the tRC, a random
access that hits the same page
over and over again will have the
highest bus utilization. (With
random R/W accesses)
Also, why Peak BW != sustained
BW. Deviations from peak
bandwidth could be due to
architecture related issues such
as tRC (cannot cycle DRAM
arrays to grab data out of same
DRAM array and re-use sense
amps)
ISCA 2002
RLDRAM
Bruce Jacob
David Wang
Peak Random
Bus Width Row Cycle
University of DRAM Type Frequency
(per chip)
Bandwidth Access
Time (tRC)
Maryland (per Chip) Time (tRAC)
Another Variant, but RLDRAM is
targetted toward embedded PC133 133 16 200 MB/s 45 ns 60 ns
systems. There are no SDRAM
connector specifications, so it
can target a higher frequency off DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns
the bat.
PC800 400 * 2 16 1.6 GB/s 60 ns 70 ns
RDRAM
ISCA 2002
RAMBUS Yellowstone
Bruce Jacob
David Wang
University of
• Bi-Directional Differential Signals
Maryland
Unlike other DRAM’s.
• Ultra low 200mV p-p signal swings
Yellowstone is only a voltage
and I/O specification, no DRAM
AFAIK. • 8 data bits transferred per clock
RAMBUS has learned their
lesson, they used expensive
packaging, 8 layer
motherboards, and added cost
• 400 MHz system clock
everywhere. Now the new pitch
is “higher performance with
same infrastructure”.
• 3.2 GHz effective data frequency
• Cheap 4 layer PCB
• Commodity packaging
System Clock
Data 1.2 V
1.0 V
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
Uses Fet switches to control
which DIMM sends output.
Two DDR memory chips are
interleaved to get Quad memory.
Advantages, uses standard
SWITCH
SWITCH
SWITCH
SWITCH
DDR chips, extra cost is low,
only the wrapper electronics.
Modification to memory
controller required, but minimal.
Has to understand that data is
open open open
being burst back at 4X clock
frequency. Does not improve
efficiency, but cheap bandwidth.
Supports more loads than Clock
“ordinary DDR”, so more Controller
capacity.
DDR A d1 d1 d1 d1
DDR B d0 d0 d0 d0
Output d0 d1 d0 d1 d0 d1 d0 d1
ISCA 2002
A Different Perspective
Bruce Jacob
David Wang
University of
Everything is bandwidth
Maryland
Instead of thining about things
on a strict latency-bandwidth
Clock
perspective, it might be more
helpful to think in terms of
latency vs pin-transition
Row Cmd/Addr Bandwidth
efficiency perspective. The idea
is that
Col. Cmd/Addr Bandwidth
Write Data Bandwidth
Read Data Bandwidth
Pin-bandwidth and
ISCA 2002
Research Areas: Topology
Bruce Jacob
David Wang
University of
Maryland
DRAM systems is basically a
networking system with a smart
master controller and a large
number of “dumb” slave devices.
If we are concerned about
“efficiency” on a bit/pin/sec level,
it might behoove us to draw
inspiration from network
interfaces, and design
something like this...
Unidirection command and write
packets from controller to DRAM
chips, and Unidirection bus from
DRAM chips back to the
controller. Then it looks like a
network system with slot ring
interface, no need to deal with
bus-turn around issues.
Unidirectional Topology:
• Write Packets sent on Command Bus
• Pins used for Command/Address/Data
• Further Increase of Logic on DRAM chips
DRAM TUTORIAL
ISCA 2002
Memory Commands?
Bruce Jacob
David Wang
Act Write
University of Act Write 0
Maryland 000000
Certain things simply does not
make sense to do. Such as Instead of A[ ] = 0; Do “write 0”
various STREAM components.
Move multimegabyte arrays from
DRAM to CPU, just to perform
simple “add” function, then move
that multi megabyte arrays right
back. In such extreme Why do A[ ] = B[ ] in CPU?
bandwidth constrained
applications, it would be
beneficial to have some logic or Memory
hardware on DRAM chips that Controller
can perform simple
computation. This is tricky, since
we do not want to add too much
logic as to make the DRAM
chips prohibatively expensive to
manufacture. (logic overhead Move Data inside of DRAM or between DRAMs.
decreases with each generation,
so adding logic is not an
impossible dream) Also, we do
not want to add logic into the
critical path of a DRAM access.
That would serve to slow down a
Why do STREAMadd in CPU?
general access in terms of the
“real latency” in ns.
A[ ] = B[ ] + C[ ]
ISCA 2002
Address Mapping
Bruce Jacob
David Wang
Physical
University of Address
Maryland
For a given physical address,
there are a number of ways to
map the bits of the physical
address to generate the
Device Id Row Addr Col Addr Bank Id
“memory address” in terms of
device ID, Row/col addr, and
bank id.
The mapping policies could
impact performance, since badly
mapped systems can cause
bank conflicts in consecutive
accesses.
Now, mapping policies must also
take temperature control into
account, as consecutive
accesses that hit the same
DRAM chip can potentially
create undesirable hot spots.
One reason for the additional
cost of RDRAM initially was the
use of heat spreaders on the
memory modules to prevent the
hotspots from building up.
ISCA 2002
Example: Bank Conflicts
Bruce Jacob
David Wang
Column Decoder
University of
Maryland Column Decoder
Sense Amps
Each Memory system consists Column Decoder
of one or more memory chips, SenseColumn
Amps Decoder
and most times, accesses to ... Bit Lines...
Sense Amps
these chips can be pipelined.
Each chip also has multitudes of Sense Amps
... Bit Lines...
Row Decoder
banks, and most of the times,
accesses to these banks can ... Bit Lines...
Row Decoder
also be pipelined. (key to Memory ... Bit Lines...
....
efficiency is to pipeline
Row Decoder
commands) Array
Memory
....
Row Decoder
Array
Memory
....
Multiple Banks Array
Memory
....
to Reduce Array
Access Conflicts
ISCA 2002
Example: Access Reordering
Bruce Jacob
David Wang 1 Read 05AE5700 Device id 3, Row id 266, Bank id 0
2 Read 023BB880 Device id 3, Row id 1BA, Bank id 0
University of 3 Read 05AE5780 Device id 3, Row id 266, Bank id 0
Maryland Read 00CBA2C0 Device id 1, Row id 052, Bank id 1
4
Each Load command is
translated to a row command Act 1 Prec Act 2 Prec Act 3
and a column command. If two
commands are mapped to the
same bank, one must be
Read Read
completed before the other can
start. Data Data
Or, if we can re-order the
sequences, then the entire tRC
sequence can be completed
faster. Strict Ordering
By allowing Read 3 to bypass Act 1 Act 4 Prec Prec Act 2 Prec
Read 2, we do not need to
generate another row activation 3
command. Read 4 may also ReadReadRead Read
bypass Read 2, since it operates
on a different Device/bank
entirely.
Data Data Data Data
DRAM now can do auto
precharge, but I put in the Memory Access Re-ordered
precharge explicitly to show that
two rows cannot be active within
tRC (DRAM architecture)
constraints.
Act = Activate Page (Data moved from DRAM cells to row buffer)
Read = Read Data (Data moved from row buffer to memory controller)
Prec = Precharge (close page/evict data in row buffer/sense amp)
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
now -- talk about performance
issues.
• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Simulator Overview
Bruce Jacob
David Wang
University of
CPU: SimpleScalar v3.0a
Maryland
• 8-way out-of-order
NOTE
ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang FPM, EDO, SDRAM, ESDRAM, DDR: x16 DRAM
x16 DRAM
University of
Maryland x16 DRAM
x16 DRAM
NOTE CPU Memory
128-bit 100MHz bus
and caches Controller
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
DIMM
Rambus, Direct Rambus, SLDRAM:
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
CPU Memory
128-bit 100MHz bus
and caches Controller
ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang Rambus & SLDRAM dual-channel:
University of
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Maryland
CPU Memory
NOTE 128-bit 100MHz bus
and caches Controller
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Fast, Narrow Channel x2
DRAM
DRAM
DRAM
CPU Memory
128-bit 100MHz bus
DRAM
and caches Controller
...
DRAM
Multiple Parallel Channels
DRAM TUTORIAL
ISCA 2002
First … Refresh Matters
Bruce Jacob
David Wang
University of compress
Maryland
1200
Bus Wait Time
NOTE
Refresh Time
Data Transfer Time
Data Transfer Time Overlap
Column Access Time
400
0 FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations
ISCA 2002
Overhead: Memory vs. CPU
Bruce Jacob
David Wang Total Execution Time in CPI — SDRAM
0.5
0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
BENCHMARK
ISCA 2002
Definitions (var. on Burger, et al)
Bruce Jacob
David Wang • tPROC — processor with perfect memory
University of
Maryland • tREAL — realistic configuration
NOTE • tBW — CPU with wide memory paths
• tDRAM — time seen by DRAM system
tREAL tDRAM
tBW
Stalls Due to tBW - tPROC
LATENCY
CPU-Memory
OVERLAP tPROC - (tREAL - tDRAM)
ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang
University of
Bandwidth-Enhancing Techniques I:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Overlap between Execution & Memory
5 Processor Execution
Newer DRAMs
To
Cycles Per Instruction (CPI)
m o da
or da y’
Ye
T
ro y’ s C
st
w s
er
’s CP PU
C U
3
PU
2
DRAM Configuration
DRAM TUTORIAL
ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang
University of
Bandwidth-Enhancing Techniques II:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Execution Time in CPI — PERL
5 Overlap between Execution & Memory
Processor Execution
Cycles Per Instruction (CPI)
ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Data Transfer Time Overlap
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time
300
200
100
0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Configurations
ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Data Transfer Time Overlap
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time
300
200
100
0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Configurations
ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang
1.5
0.5
cc1 com go ijpeg li linea mpe mpe peg− perl ran− strea strea
pres r_wal g2de g2en wit dom m m_n
s k c c _wa o_un
Benchmark
DRAM TUTORIAL
ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang
ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang 100
No L2 Cache
100
No L2 Cache FPMDRAM
EDODRAM
SDRAM
ESDRAM
University of 80 80 DDRSDRAM
SLDRAM
NOTE 40 40
20 20
80 80
Hit rate in row buffers
40 40
20 20
80 80
Hit rate in row buffers
60 60
40 40
20 20
ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang Hits vs. Depth in Victim-Row FIFO Buffer
1000 200 1.5e+05
University of 800
Go Li Vortex
Maryland 150
1e+05
600
100
400
NOTE 50000
50
200
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Compress Vortex
100000 10000
10000
1000
1000
100
100
10
10
1 1
2000 4000 6000 8000 10000 2000 4000 6000 8000 10000
Inter-arrival time (CPU Clocks) Inter-arrival time (CPU Clocks)
DRAM TUTORIAL
ISCA 2002
Row Buffers as L2 Cache
Bruce Jacob
David Wang
University of
Maryland Stalls due to Memory Bandwidth
Stalls due to Memory Latency
Overlap between Execution & Memory
NOTE Processor Execution
10
0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
DRAM TUTORIAL
ISCA 2002
Row Buffer Management
Bruce Jacob
David Wang
ROW ACCESS COLUMN ACCESS
University of
Maryland RAS CAS
Data In/Out Data In/Out
Each memory transaction has to
break down into a two part Buffers Buffers
access, a row access and a
column access. In essence the
row buffer/sens amp is action as Column Decoder Column Decoder
a cache. where a page is
brought in from the memory
array and stored in the buffer Sense Amps Sense Amps
then the second step is to move
that data from the row buffers
back into the memory controller. ... Bit Lines... ... Bit Lines...
from a certain perspective, it
makes sense to speculatively
Row Decoder
Row Decoder
....
....
ISCA 2002
Cost-Performance
Bruce Jacob
David Wang
University of
FPM, EDO, SDRAM, ESDRAM:
Maryland
• Lower Latency => Wide/Fast Bus
NOTE
ISCA 2002
Conclusions
Bruce Jacob
David Wang
University of
100MHz/128-bit Bus is Current Bottleneck
Maryland
• Solution: Fast Bus/es & MC on CPU
NOTE
(e.g. Alpha 21364, Emotion Engine, …)
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
now -- let’s talk about DRAM
performance at theSYSTEM
level. • Advanced Basics
previous studies show
MEMORY BUS is significant
bottleneck in today’s high-
• DRAM Evolution: Interface Path
performance systems
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
Even when we restrict our focus …
Maryland
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
... the design space is highly non-linear …
Maryland
3 32-Byte Burst
and yet, even in this restricted GCC 64-Byte Burst
design space, we find highly
EXTREMELY COMPLEX results 128-Byte Burst
System Bandwidth
(GB/s = Channels * Width * 800MHz)
DRAM TUTORIAL
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
... and the cost of poor judgment is high.
Maryland
10
so -- we have the worst possible
scenario: a design space that is
very sensitive to changes in
parameters and execution times Cycles per Instruction (CPI) Worst Organization
that can vary by a FACTOR OF 8 Average Organization
THREE from worst-case to best Best Organization
clearly, we would be well-served
to understand this design space
6
0
bzip gcc mcf parser perl vpr average
SPEC 2000 Benchmarks
DRAM TUTORIAL
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
SDRAM Timing
Maryland
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
this gives the picture of what is
happening at the SYTEM
LEVEL Row Access
Clock
the CPU to MEM Column Access
CONTROLLER is shown as
“ABUS Active” Command Transfer Overlap
the MEM-CONTROLLER to
DRAM activity is shown as
ACT READ Data Transfer
DRAM BANK ACTIVE
Address
and the data read-out is shown Row Col
as “DBUS Active” Addr Addr
i
DQ Valid Valid Valid Valid
Data Data Data Data
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
f the DRAM’s pins do not
connect directly to the CPU (e.g.
in a hierarchical bus Row Access
organization, or if the data is Clock
funnelled through the memory Column Access
controller like the northbridge
chipset), then there is yet Command Transfer Overlap
another DBUS ACTIVE timing
slot that follows below and to the
right ... this can continue to
ACT READ Data Transfer
extend to any number of
hierarchical levels, as seen in Address
huge server systems with Row Col
hundreds of GB of DRAM Addr Addr
DBUS2 Active
DRAM TUTORIAL
ISCA 2002
Request Timing
Bruce Jacob
David Wang
ISCA 2002
Read/Write Request Shapes
Bruce Jacob
David Wang
READ REQUESTS:
University of t0
Maryland ADDRESS BUS 10ns
DRAM BANK 90ns
such a model gives us these DATA BUS 70ns 10ns
types of request shapes for
reads and writes ADDRESS BUS 10ns
DRAM BANK 90ns
this shows a few example bus/
DATA BUS 70ns 20ns
burst configuraitons, in
particular:
ADDRESS BUS 10ns
4-byte bus with burst sizes of 32, DRAM BANK 100ns
64, and 128 bytes per burst DATA BUS 70ns 40ns
WRITE REQUESTS:
t0
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 10ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 20ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 40ns
DRAM TUTORIAL
ISCA 2002
Pipelined/Split Transactions
Bruce Jacob
David Wang (a) Legal if R/R to different banks:
10ns
University of Read: 90ns
Maryland 70ns 20ns
20ns 10ns
the bus is PIPELINED and
Read: 90ns
supports SPLIT
TRANSACTIONS where one 70ns 20ns
request can be NESTLED inside
of another if the timing is right
[explain examples] (b) Nestling of writes inside reads is legal if R/W to different banks:
Legal if turnaround <= 10ns: Legal if no turnaround:
what we’re trying to do is to fit
these 2D puzzle pieces together 10ns 10ns
in TIME Read: 90ns
Read: 90ns
70ns 10ns 70ns 20ns
10 10ns 10 10ns
Write: 90ns Write: 90ns
40ns 10ns 40ns 20ns
10ns
Read: 100ns
70ns 40ns
10 10ns
Write: 90ns
40ns 40ns
DRAM TUTORIAL
ISCA 2002
Channels & Banks
Bruce Jacob
David Wang D D D D D D
D D D D D D D D D D D D D D D
University of
Maryland
... ...
as for physical connections, here
are the ways we modeled C C C C C C
independent DRAM channels
and independent BANKS per One independent channel Two independent channels
CHANNEL Banking degrees of 1, 2, 4, ... Banking degrees of 1, 2, 4, ...
the figure shows a few of the
parameters that we study. in D D D D D D D D D D D D
addition, we look at:
D D D D D D D D D D D D D D D D
- turnaround time (0,1 cycle)
ISCA 2002
Burst Scheduling
Bruce Jacob
David Wang (Back-to-Back Read Requests)
University of 128-Byte Bursts:
Maryland
ISCA 2002
New Bar-Chart Definition
Bruce Jacob
David Wang tREAL tDRAM
University of
Maryland
Stalls Due to tREAL - tBW
DRAM Latency
we run a series of different
simulations to get break-downs:
- CPU Activity
- memory activity overlapped tSYS
with CPU Stalls Due to tSYS - tPROC
- non-overlapped - SYSTEM Queue, Bus, ...
- non-overlapped - due to DRAM
CPU-Memory
so the top two are MEMORY
OVERLAP tPROC - (tREAL - tDRAM)
STALL CYCLES
ISCA 2002
System Overhead
Bruce Jacob
David Wang
University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround
0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8
ISCA 2002
System Overhead
Bruce Jacob
David Wang
University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround
- turnaround is relatively
insignificant (however, 1
remember that this is an
800MHz bus system ...)
0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8
ISCA 2002
Concurrency Effects
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of BANKS per CHANNEL
3 4 Banks per Channel
Maryland
8 Banks per Channel
the figure also shows that
0
1.6 3.2 6.4
System Bandwidth
(GB/s = Channels * Width * Speed)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
MEMORY OVERHEAD is
substantial
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
so -- if CONCURRENCY were
all-important, we would expect
2
small bursts to be best, because
they would allow a LOWER
AVERAGE TIME-TO-CRITICAL-
WORD for a larger number of
simultaneous requests.
1
what we actually see is that the
optimal burst width scalies with
the bus width, sugesting an
optimal number of DATA
TRANSFERS per BANK
ACTIVATION/PRECHARGE
cycle
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
i’ll illustrate that ... b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
(GB/s = Channels * Width * 800MHz)
ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang
University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns
this figure shows the entire DRAM BANK 90ns 64-bit channel x 32-byte burst
range of burst widths that we DATA BUS 70ns 5ns
modeled. note that some of the
figure represent several different
ADDRESS BUS 10ns
combinations ... for example, the
64-bit channel x 64-byte burst
one THIRD DOWN FROM TOP DRAM BANK 90ns
32-bit channel x 32-byte burst
is
DATA BUS 70ns 10ns
2-bute channel + 32-byte burst
4-byte channel + 64-byte burst ADDRESS BUS 10ns
8-byte channel + 128-byte burst 64-bit channel x 128-byte burst
DRAM BANK 90ns 32-bit channel x 64-byte burst
DATA BUS 70ns 20ns 16-bit channel x 32-byte burst
ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang
University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns
the optimal configurations are in DRAM BANK 90ns 64-bit channel x 32-byte burst
the middle, suggesting an DATA BUS 70ns 5ns
optimal number of DATA
TRANSFERS per BANK
ADDRESS BUS 10ns
ACTIVATION/PRECHARGE
64-bit channel x 64-byte burst
cycle DRAM BANK 90ns
32-bit channel x 32-byte burst
[BOTTOM] -- too many transfers DATA BUS 70ns 10ns OPTIMAL
per burst crowds out other BURST WIDTHS
requests ADDRESS BUS 10ns
64-bit channel x 128-byte burst
DRAM BANK 90ns 32-bit channel x 64-byte burst
[TOP] -- too few transfers per
request lets the bank overhead DATA BUS 70ns 20ns 16-bit channel x 32-byte burst
(activation/precharge cycle)
dominate
ADDRESS BUS 10ns
32-bit channel x 128-byte burst
however, though this tells us DRAM BANK 100ns
16-bit channel x 64-byte burst
how to best organize a channel
DATA BUS 70ns 40ns 8-bit channel x 32-byte burst
with a given bandwidth, these
rules of thumb do not say
anything about how the different ADDRESS BUS 10ns
configurations (wide/narrow/ DRAM BANK 140ns 16-bit channel x 128-byte burst
medium channels) compare to
8-bit channel x 64-byte burst
each other ... DATA BUS 70ns 80ns
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
this is like saying
6
- something in between?
3
like before, we see that more
banks is better, but not always
by much
2
0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
with large bursts, there is less
6
interleaving of requests, so extra
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
with multiple independent
6
channels, you have a degree of
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
however, trying to reduce
6
execution time via multiple
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
another way to look at it:
6
4x 1-byte channels
2x 2-byte channels 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
1x 4-byte channels
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4
ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
David Wang 1 Bank per Channel
2
2 Banks per Channel
University of
Maryland 4 Banks per Channel
8 Banks per Channel
we see the same trends in all
the benchmarks surveyed
0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x n x x n x x n
an an ha an an ha an an ha
ch ch 4
c ch ch 4
c ch ch 4
c
1 2 1 2 1 2
3.2 GB/s System Bandwidth (channels x width x speed)
DRAM TUTORIAL
ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
David Wang 1 Bank per Channel
2
2 Banks per Channel
University of
Maryland 4 Banks per Channel
8 Banks per Channel
for example, in BZIP, the best
configurations are at smaller
ISCA 2002
Queue Size & Reordering
Bruce Jacob
David Wang
BZIP: 1.6 GB/s (1 channel)
University of
Maryland
3
Infinite Queue
0
el
el
el
el
el
el
l
ne
ne
ne
ne
ne
ne
nn
nn
nn
nn
nn
nn
an
an
an
an
an
an
ha
ha
ha
ha
ha
ch
ch
ch
ch
ch
ch
ch
/c
/c
/c
c
s/
s/
s/
s/
s/
s/
s/
s/
s/
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
1
1
2
8
32-byte burst 64-byte burst 128-byte burst
DRAM TUTORIAL
ISCA 2002
Conclusions
Bruce Jacob
David Wang
University of
DESIGN SPACE is NON-LINEAR,
Maryland
COST of MISJUDGING is HIGH
we have a complex edsign
space where neighboring
designs differ significantly CAREFUL TUNING YIELDS 30–40% GAIN
if you are careful, you can beat
the performance of the average
organization by 30-40%
MORE CONCURRENCY == BETTER
supporting memory concurrency
improves system performance,
as long as it is not done at the
(but not at expense of LATENCY)
expense of memory latency:
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
NOTE
• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Embedded DRAM Primer
Bruce Jacob
David Wang
University of
Maryland
NOTE
CPU DRAM
Core Array
Embedded
CPU DRAM
Core Array
Not Embedded
DRAM TUTORIAL
ISCA 2002
Whither Embedded DRAM?
Bruce Jacob
David Wang
University of
Microprocessor Report, August 1996: “[Five]
Maryland
Architects Look to Processors of Future”
NOTE
• Two predict imminent merger
of CPU and DRAM
• Another states we cannot keep cramming
more data over the pins at faster rates
(implication: embedded DRAM)
• A fourth wants gigantic on-chip L3 cache
(perhaps DRAM L3 implementation?)
SO WHAT HAPPENED?
DRAM TUTORIAL
ISCA 2002
Embedded DRAM for DSPs
Bruce Jacob
David Wang
University of
MOTIVATION
Maryland
TAGLESS SRAM TRADITIONAL CACHE
(hardware-managed)
NOTE
SOFTWARE The cache “covers” HARDWARE
manages this the entire address manages this
movement of space: any datum movement of
CACHE data in the space data
may be
cached
CACHE
Address space Address space
Move from includes both includes only
memory space “cache” and MAIN primary memory
to “cache” space primary memory MEMORY (and memory-
creates a new, (and memory- mapped I/O)
equivalent data MAIN mapped I/O) Copying
object, not a MEMORY from memory
mere copy of to cache creates a
the original. subordinate copy of
the datum that is
kept consistent with
the datum still in
memory. Hardware
ensures consistency.
ISCA 2002
DSP Buffer Organization
Bruce Jacob
David Wang Used for Study
University of
Maryland Fully Assoc 4-Block Cache
victim-0 victim-1
NOTE
S0 S1
buffer-0 buffer-1
LdSt0 LdSt1
DSP
LdSt0 LdSt1
DSP
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 50, 100, 200 MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 50MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
Increasing bus width # # #
0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 100MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
Increasing bus width
0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP: 200MHz Memory
10 #
Cache Line Size
NOTE
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
Increasing bus width
0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
Performance-Data Sources
Bruce Jacob
David Wang
“A Performance Study of Contemporary DRAM Architectures,”
University of Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
Maryland
“Organizational Design Trade-Offs at the DRAM, Memory Bus, and
Memory Controller Level: Initial Results,” University of Maryland
Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.
ISCA 2002
Acknowledgments
Bruce Jacob
David Wang
University of
The preceding work was supported
Maryland
in part by the following sources:
• NSF CAREER Award CCR-9983618
• NSF grant EIA-9806645
• NSF grant EIA-0000439
• DOD award AFOSR-F496200110374
• … and by Compaq and IBM.
DRAM TUTORIAL
ISCA 2002
CONTACT INFO
Bruce Jacob
David Wang
University of
Bruce Jacob
Maryland
Dave Wang
UNIVERSITY OF MARYLAND