DRAM Tutorial Isca2002

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DRAM: Architectures,
DRAM: why bother? (i mean,
Interfaces, and Systems
besides the “memory wall”
thing? ... is it just a performance
issue?)
think about embedded systems:

think cellphones, think printers,
A Tutorial
think switches ... nearly every
embedded product that used to
be expensive is now cheap.
why?
for one thing, rapid turnover from
Bruce Jacob and David Wang
high performance to
obsolescence guarantees
generous supply of CHEAP,
HIGH-PERFORMANCE Electrical & Computer Engineering Dept.
embedded processors to suit
nearly any design need. University of Maryland at College Park
what does the “memory wall”
mean in this context? perhaps it
will take longer for a high-
performance design to become http://www.ece.umd.edu/~blj/DRAM/
obsolete?
UNIVERSITY OF MARYLAND
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• DRAM Evolution: Structural Path
NOTE
• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
Break at 10 a.m. — Stop us or starve

DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
DRAM ORGANIZATION
Maryland
first off -- what is DRAM? an DRAM

array of storage elements
(capacitor-transistor pairs)
Storage element Column Decoder
“DRAM” is an acronym (explain)
why “dynamic”? (capacitor)
Word Line Data In/Out Sense Amps
- capacitors are not perfect ...
need recharging Buffers
... Bit Lines...
- very dense parts; very small;
capactiros have very little
Bit Line
Row Decoder
charge ... thus, the bit lines are
. .. Word Lines ...

charged up to 1/2 voltage level
and the ssense amps detect the
minute change on the lines, then
Memory
recover the full signal Array
Switching
element
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
BUS TRANSMISSION
Maryland
so how do you interact with this DRAM

thing? let’s look at a traditional
organization first ... CPU
connects to a memory controller Column Decoder
that connects to the DRAM itself.
let’s look at a read operation Data In/Out Sense Amps

Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER
Row Decoder
. .. Word Lines ...

Memory
Array
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
[PRECHARGE and] ROW ACCESS
Maryland
at this point, all but lines are attt DRAM

the 1/2 voltage level.
the read discharges the Column Decoder

capacitors onto the bit lines ...
this pulls the lines just a little bit
high or a little bit low; the sense Data In/Out Sense Amps
amps detect the change and
recover the full signal Buffers
MEMORY
... Bit Lines...
the read is destructive -- the
capacitors have been CPU BUS CONTROLLER
Row Decoder
discharged ... however, when
. .. Word Lines ...

the sense amps pull the lines to
the full logic-level (either high or
low), the transistors are kept
Memory
open and so allow their attached Array
capacitors to become recharged
(if they hold a ‘1’ value) AKA: OPEN a DRAM Page/Row
or
ACT (Activate a DRAM Page/Row)
or
RAS (Row Address Strobe)
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
COLUMN ACCESS
Maryland
once the data is valid on ALL of DRAM

the bit lines, you can select a
subset of the bits and send them
to the output buffers ... CAS Column Decoder
picks one of the bits
big point: cannot do another Data In/Out Sense Amps

RAS or precharge of the lines
until finished reading the column Buffers
data ... can’t change the values
MEMORY
... Bit Lines...
on the bit lines or the output of
the sense amps until it has been CPU BUS CONTROLLER
Row Decoder
read by the memory controller
. .. Word Lines ...

Memory
Array
READ Command
or
CAS: Column Address Strobe
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
DATA TRANSFER
Maryland
then the data is valid on the data DRAM

bus ... depending on what you
are using for in/out buffers, you
might be able to overlap a litttle Column Decoder
or a lot of the data transfer with
the next CAS to the same page
(this is PAGE MODE) Data In/Out Sense Amps
Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER
Row Decoder
. .. Word Lines ...

Memory
Array
Data Out
... with optional additional

CAS: Column Address Strobe
note: page mode enables overlap with CAS

DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
BUS TRANSMISSION
Maryland
NOTE DRAM
Column Decoder
Data In/Out Sense Amps

Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER
Row Decoder
. .. Word Lines ...

Memory
Array
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang F
University of DRAM
Maryland
CPU Mem
DRAM “latency” isn’t
Controller E1
deterministic because of CAS or
RAS+CAS, and there may be A
significant queuing delays within
the CPU and the memory
controller B
Each transaction has some D E2/E3
overhead. Some types of
overhead cannot be pipelined.
C
This means that in general,
longer bursts are more efficient.
A: Transaction request may be delayed in Queue
B: Transaction request sent to Memory Controller
C: Transaction converted to Command Sequences
(may be queued)
D: Command/s Sent to DRAM
E1: Requires only a CAS or
E2: Requires RAS + CAS or
E3: Requires PRE + RAS + CAS
F: Transaction sent back to CPU
“DRAM Latency” = A + B + C + D + E + F
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
PHYSICAL ORGANIZATION
Maryland
NOTE
x8 DRAM
x2 DRAM x4 DRAM
Column Decoder Column Decoder Column Decoder
Data Sense Amps Data Sense Amps Data Sense Amps

Buffers Buffers Buffers
... Bit Lines... ... Bit Lines... ... Bit Lines...
Row Decoder
Row Decoder
Row Decoder
Memory Memory Memory
....
....
....
Array Array Array
x2 DRAM x4 DRAM x8 DRAM
This is per bank …

Typical DRAMs have 2+ banks
DRAM TUTORIAL
ISCA 2002
Basics
Bruce Jacob
David Wang
University of
Read Timing for Conventional DRAM
Maryland
let’s look at the interface another

way .. the say the data sheets
portray it. RAS
Row Access
[explain]
main point: the RAS\ and CAS\ Column Access

signals directly control the CAS
latches that hold the row and Data Transfer
column addresses ...
Address
Row Column Row Column
Address Address Address Address
DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolutionary Tree
Bruce Jacob
David Wang
........
......
University of
Maryland
MOSYS
since DRAM’s inception, there
have been a stream of changes
to the design, from FPM to EDO Structural
to Burst EDO to SDRAM. the
changes are largely structural
Modifications
modifications -- nimor -- that Targeting
target THROUGHPUT. FCRAM Latency
[discuss FPM up to SDRAM
Conventional
Everything up to and including DRAM
$
SDRAM has been relatively (Mostly) Structural Modifications
inexpensive, especially when
considering the pay-off (FPM
Targeting Throughput VCDRAM
was essentially free, EDO cost a
latch, PBEDO cost a counter,
SDRAM cost a slight re-design).
however, we’re run out of “free”
ideas, and now all changes are
considered expensive ... thus
there is no consensus on new
directions and myriad of choices FPM EDO P/BEDO SDRAM ESDRAM
has appeared
[ do LATENCY mods starting Interface Modifications

with ESDRAM ... and then the
INTERFACE mods ]
Targeting Throughput
Rambus, DDR/2 Future Trends

DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Conventional DRAM
Maryland
Row Access
NOTE
Column Access
Transfer Overlap
Data Transfer
RAS
CAS
Address
Row Column Row Column
DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Fast Page Mode
Maryland
Row Access
FPM aallows you to keep th
esense amps actuve for multiple Column Access
CAS commands ...
Transfer Overlap
much better throughput
problem: cannot latch a new RAS Data Transfer

value in the column address
buffer until the read-out of the
data is complete
CAS
Address
Row Column Column Column
DQ Valid Valid Valid

Dataout Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Extended Data Out
Maryland
Row Access
solution to that problem --
instead of simple tri-state Column Access
buffers, use a latch as well.
Transfer Overlap
by putting a latch after the
column mux, the next column
address command can begin
Data Transfer
sooner
RAS
CAS
Address
Row Column Column Column
DQ Valid Valid Valid

Dataout Dataout Dataout
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Burst EDO
Maryland
Row Access
by driving the col-addr latch from
an internal counter rather than Column Access
an external signal, the minimum
cycle time for driving the output Transfer Overlap
bus was reduced by roughly
30%
Data Transfer
RAS
CAS
Address
Row Column
Address Address
DQ Valid Valid Valid Valid

Data Data Data Data
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Pipeline Burst EDO
Maryland
Row Access
“pipeline” refers to the setting up
of the read pipeline ... first CAS\ Column Access
toggle latches the column
address, all following CAS\ Transfer Overlap
toggles drive data out onto the
bus. therefore data stops coming
when the memory controller
Data Transfer
stops toggling CAS\
RAS
CAS
Address
Row Column
Address Address

Data Data Data Data
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Read Timing for Synchronous DRAM
Maryland
Row Access
Clock
main benefit: frees up the CPU
or memory controller from Column Access
having to control the DRAM’s
RAS
internal latches directly ... the Transfer Overlap
controller/CPU can go off and do
other things during the idle
cycles instead of “wait” ... even
Data Transfer
though the time-to-first-word CAS
latency actually gets worse, the
scheme increases system
throughput
Command
ACT READ
Address
Row Col
Addr Addr

Data Data Data Data
(RAS + CAS + OE ... == Command Bus)

DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Inter-Row Read Timing for ESDRAM
Maryland
Regular CAS-2 SDRAM, R/R to same bank
Clock
output latch on EDO allowed you
to start CAS sooner for next
accesss (to same row) Command
ACT READ PRE ACT READ

latch whole row in ESDRAM --
allows you to start precharge & Address
RAS sooner for thee next page
Row Col Row Col
access -- HIDE THE Addr Addr Bank Addr Addr
PRECHARGE OVERHEAD.
DQ Valid Valid Valid Valid Valid Valid Valid Valid

Data Data Data Data Data Data Data Data
ESDRAM, R/R to same bank

Clock
Command
ACT READ PRE ACT READ
Address
Row Col Bank Row Col
Addr Addr Addr Addr
DQ Valid Valid Valid Valid Valid Valid Valid Valid

Data Data Data Data Data Data Data Data
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Write-Around in ESDRAM
Maryland
Regular CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0
Clock
neat feature of this type of
buffering: write-around
Command
ACT READ PRE ACT WRITE PRE ACT READ
Address
Row Col Row Col Row Col
Addr Addr Bank Addr Addr Bank Addr Addr
DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data
ESDRAM, R/W/R to same bank, rows 0/1/0

Clock
Command
ACT READ PRE ACT WRITE READ
Address
Row Col Bank Row Col Col
Addr Addr Addr Addr Addr
DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data Data
(can second READ be this aggressive?)

DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob $
David Wang
University of
Internal Structure of Virtual Channel
Maryland 16 Channels
Bank B (segments)
main thing ... it is like having a Input/Output
bunch of open row buffers (a la Buffer
rambus), but the problem is that Bank A
you must deal with the cache 2Kb Segment
directly (move into and out of it),
not the DRAM banks ... adds an
extra couple of cycles of latency
... however, you get good 2Kb Segment
bandwidth if the data you want is
cache, and you can “prefetch”
2Kbit # DQs DQs
into cache ahead of when you
want it ... originally targetted at
reducing latency, now that 2Kb Segment
SDRAM is CAS-2 and RCD-2,
this make sense only in a
throughput way
2Kb Segment
Sense
Row Decoder Amps Sel/Dec
Activate Prefetch Read

Restore Write
Segment cache is software-managed, reduces energy

DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Internal Structure of Fast Cycle RAM
Maryland
SDRAM FCRAM
FCRAM opts to break up the
data array .. only activate a
portion of the word line
8K rows requires 13 bits tto
Row Decoder
Row Decoder
select ... FCRAM uses 15
(assuming the array is 8k x 1k ...
the data sheet does not specify) 8M Array 8M Array
13 bits 15 bits
(8Kr x 1Kb) (?)
Sense Amps Sense Amps
tRCD = 15ns tRCD = 5ns

(two clocks) (one clock)
Reduces access time and energy/access

DRAM TUTORIAL
ISCA 2002
DRAM Evolution
........
......
Bruce Jacob
David Wang
University of
Internal Structure of MoSys 1T-SRAM
Maryland
MoSys takes this one step addr

further ... DRAM with an SRAM
interface & speed but DRAM
energy
[physical partitioning: 72 banks]

Bank
Select
auto refresh -- how to do this
transparently? the logic moves
tthrough the arrays, refreshing
them when not active.
but what is one bank gets

repeated access for a long
duration? all other banks will be
refreshed, but that one will not.
solution: they have a bank-sized

CACHE of lines ... in theory,
should never have a problem
(magic) Auto
Refresh
DQs
DRAM TUTORIAL
ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang
University of
Comparison of Low-Latency DRAM Cores
Maryland
Data Bus Bus Width Peak BW RAS–CAS RAS–DQ
DRAM Type
here’s an idea of how the Speed (per chip) (per Chip) (tRCD) (tRAC)
designs compare ...
PC133 SDRAM 133 16 266 MB/s 15 ns 30 ns
bus speed == CAS-to-CAS
VCDRAM 133 16 266 MB/s 30 ns 45 ns
RAS-CAS == time to read data
from capacitors into sense amps
FCRAM 200 * 2 16 800 MB/s 5 ns 22 ns
RAS-DQ == RAS to valid data
1T-SRAM 200 32 800 MB/s — 10 ns
DDR 266 133 * 2 16 532 MB/s 20 ns 45 ns
DRDRAM 400 * 2 16 1.6 GB/s 22.5 ns 60 ns
RLDRAM 300 * 2 32 2.4 GB/s ??? 25 ns

DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
• Advanced Basics
• Memory System Details (Lots)

DRAM TUTORIAL
ISCA 2002
What Does This All Mean?
Bruce Jacob
David Wang
University of
xDDR II
Maryland
DDR II
Some Technology has legs,
EDO
netDRAM
some do not have legs, and
some have gone belly up.
RLDRAM BEDO
We’ll start by emaining the
fundamental technologies (I/O ESDRAM
packaging etc) then explore ome
of these technologies in depth a
bit later.
FPM
SDRAM
FCRAM
D-RDRAM DDR
SDRAM
SLDRAM
DRAM TUTORIAL
ISCA 2002
Cost - Benefit Criterion
Bruce Jacob
David Wang
University of
Maryland
Package Cost
What is a “good” system?
It’s all about the cost of a

system. This is a multi-
dimensional tradeoff problem. Interconnect
Especially tough when the Cost Bandwidth
relative cost factors of pins, die
area, and the demands of
bandwidth and latency keeps on DRAM
changing. Good decisions for
one generation may not be good
Logic Overhead System
for future generations. This is
why we don’t keep a DRAM
Design
protocol for a long time. FPM
lasted a while, but we’ve quickly Test and
progressed through EDO, Implementation
SDRAM, DDR/RDRAM, and
now DDR II and whatever else is Latency
on the horizon.
Power
Consumption
DRAM TUTORIAL
ISCA 2002
Memory System Design
Bruce Jacob
David Wang
University of
Clock Network I/O Technology
Maryland
Now we’ll really get our hands

dirty, and try to become DRAM
designers. That is, we want to Topology Chip Packaging
understand the tradeoffs, and
design our own memory system
with DRAM cells. By doing this,
we can gain some insight into
some of the basis of claims by
DRAM Chip DRAM
proponents of various DRAM
memory systems. Architecture Memory Pin Count
A Memory System is a system
System
that has many parts. It’s a set of
technologies and design
decisions. All of the parts are
inter-related, but for the sake of
discussion, we’ll splite the Address Mapping Access Protocol
components into ovals seen
here, and try to examine each
part of a DRAM system
separately.
Row Buffer
Management
DRAM TUTORIAL
ISCA 2002
DRAM Interfaces
Bruce Jacob
David Wang
University of
The Digital Fantasy
Maryland
Professor Jacob has shown

yoou some nice timing
diagrams, I too will show you
some nice timing diagrams, but
Row a0
the timing diagrams are a
simplification that hides the Col r0 r1
details of implemetation. Why
don’t they just run the system at
XXX MHz like the other guy? Data d0 d0 d0 d0 d1 d1 d1 d1
Then the latency would be much
better, and the bandwidth would
be extreme. Perhaps they can’t, RAS CAS Pipelined Access
and we’ll explain why. To latency latency
understand the reason why
some systems can operate at
XXX MHz while others cannot,
we must go digging past the nice
timing digrams and the
Pretend that the world looks like this
architectural block diagrams and
see what turns up underneath.
So underneath the timing
diagram, we find this....
But...
DRAM TUTORIAL
Read fr omFC RAMTM @400MHz DDR
ISCA 2002 (The
Non-teReal
rminaWorld
ti on c as e)
Bruce Jacob
David Wang FCRAM side VDDQ(Pad)
University of
Maryland
We don’t get nice square or DQS (Pin)

even nicely shaped waveforms. DQ0-15 (Pin)
Jitter, skew, etc. Vddq and Vssq
are the voltage supplies to the I/
O pads on the DRAM chips. The
signal bounces around, and very
non ideal. So what are the
problems, and what are the
solution(s) that are used to solve
these problems? (note, the 158 VSSQ(Pad)
ps skew on parallel data
channels If your cycle time is
10ns, or 10,000ps, a skew of Controller side
158 ps is no big deal, but if your
cycle time is 1ns, or 1000ps,
then a skew of 158ps is a big
deal)
Already, we see hints of some DQ0-15 (Pin)
problems as we try to push DQS (Pin)
systems to higher and higher
clock frequencies.
skew=158psec skew=102psec
*Toshiba Presentation, Denali MemCon 2002
12
DRAM TUTORIAL
ISCA 2002
Signal Propagation
Bruce Jacob
David Wang
University of A
Maryland B
First, we have to introduce the

concept that signal propagation
takes finite time. Limited by the
speed of light, or rather ideal
transmission lines we should
have speed of approximately 2/3
the speed of light. That gets us
20cm/ns. All signals, including
system wide clock signals has to
be sent on a system board, so if
you sent a clock signal from
point A to point B on an ideal
signal line, point B won’t be able
to tell that the clock has change
until at the earliest, 1/20 ns/cm *
distance later that the clock has
risen. Ideal Transmission Line
Then again, PC boards are not
exactly ideal transmission lines.
(ringing effect, drive strength,
~ 0.66c = 20 cm/ns
etc)
The concept of “Synchronous”
breaks down when different
PC Board + Module Connectors +
parts of the system observe
different clocks. Kind of like Varying Electrical Loads
relativity
= Rather non-Ideal Transmission Line

DRAM TUTORIAL
ISCA 2002
Clocking Issues
Bruce Jacob
David Wang
University of Figure 1:
Maryland
When we build a “synchronous

Sliding Time
0th Nth
system” on a PCB board, how
do we distribute the clock
signal? Do we want a sliding
time domain? Is H Tree do-able
to N-modules in parallel? Skew
Clk
compensation?
SRC
Figure 2:
H Tree?
0 th Nth
Clk
SRC
What Kind of Clocking System?

DRAM TUTORIAL
ISCA 2002
Clocking Issues
Bruce Jacob
David Wang
University of Figure 1:
Maryland
We would want the chips to be

Write Data 0th Nth
on a “global clock”, everyone is
perfectly synchronous, but since
clock signals are delivered
through wires, different chips in
the system will see the rising
Clk
edge of a clock a little bit earlier/
later than other chips.
SRC Signal Direction
While an H-Tree may work for a
low freq system, we really need
a clock for sending (writing)
signals from the controller to the
chips, and another one for
Figure 2:
snding signals from chips to
controller (reading)
Read Data
0 th Nth
Clk
SRC
Signal Direction
We need different “clocks” for R/W

DRAM TUTORIAL
ISCA 2002
Path Length Differential
Bruce Jacob
David Wang
Path #3
Bus Signal 2
University of Path #2 Bus Signal 1
Maryland
Path #1
Controller
We purposefully “routed path #2 A B
to be a bit longer than path #1 to Intermodule
illustrate the point in between
the signal path length
Connectors
differentials. As illustrated,
signals will reach load B at a
later time than load A simply
because it is farther away from
controller than load A.
It is also difficult to do path

length and impedence matching
on a system board. Sometimes
heroic efforts must be utilized to
get us a nice “parallel” bus.
High Frequency AND Wide Parallel

Busses are Difficult to Implement
DRAM TUTORIAL
ISCA 2002
Subdividing Wide Busses
Bruce Jacob
David Wang
University of
Maryland
B
1
It’s hard to bring the Wide
parallel bus from point A to point
B, but it’s easier to bring in
smaller groups of signals from A
2
to B. To ensure proper timing, 3
we also send along a source
synchronous clock signal that is
path length matched with the
signal groun it covers.In this
figure, signal groups 1,2, and 3
may have some timing skew with
respect to each other, but within
the group the signals will have
minimal skew. (smaller channel Obstruction
can be clocked higher)
1 2 Narrow Channels,
3 Source Synchronous
A Local Clock Signals
DRAM TUTORIAL
ISCA 2002
Why Subdivision Helps
Bruce Jacob
David Wang
University of
Maryland
Sub
Analogy, it’s a lot harder to Channel 1
schedule 8 people for a meeting,
but a lot easier to schedule 2
meetings with 4 people each.
The results of the two meeting
can be correlated later.
Sub
Channel 2
Worst Case
skew of
Worst Case Worst Case
{Chan 1 +
skew of skew of
Chan 2}
Chan 1 Chan 2
Worst Case Skew must be Considered in System Timing

DRAM TUTORIAL
ISCA 2002
Timing Variations
Bruce Jacob
David Wang
University of
Maryland Controller 4 Loads
A “System” is a hard thing to

design. Especially one that
allows end users to perform
configurations that will impact Controller 1 Load
timing. To guarentee functional
correctness of the system, all
corner cases of variances in
loading and timing must be
accounted for.
Clock
Cmd to 1 Load
Cmd to 4 Loads
How many DIMMs in System?

How many devices on each DIMM?
Who built the memory module?
Infinite variations on timing!
DRAM TUTORIAL
ISCA 2002
Loading Balance
Bruce Jacob
David Wang
University of
Maryland
Controller
To ensure that a lightly loaded Duplicate
system and a fully loaded
system do not differ significantly
Signal
in timing, we either have
duplicate signals sent to
Lines
different memory modules, or
we have the same signal line, Controller
but the signal line uses variable
strengths to drive the I/O pads,
depending on if the system has
1,2,3 or 4 loads.
Variable
Controller Signal
Drive
Strength
Controller
DRAM TUTORIAL
ISCA 2002
Topology
Bruce Jacob
David Wang
University of DRAM DRAM DRAM DRAM

Maryland
Chip Chip Chip Chip
Self Explanatory. topology
determines loading and signal
propagation lengths. DRAM DRAM DRAM DRAM
Controller
? Chip Chip Chip Chip
DRAM DRAM DRAM DRAM

Chip Chip Chip Chip
DRAM DRAM DRAM DRAM

Chip Chip Chip Chip
DRAM System Topology Determines

Electrical Loading Conditions
and Signal Propagation Lengths
DRAM TUTORIAL
ISCA 2002
SDRAM Topology Example
Bruce Jacob
David Wang
University of
Maryland x16 x16
DRAM DRAM
Very simple topology. The clock Chip Chip
signal that turns around is very
nice. Solves problem of needing
multiple clocks.
Command & x16 x16
Address DRAM DRAM
Single Chip Chip
Channel
SDRAM x16 x16
Controller DRAM DRAM
Data bus Chip Chip
(64 bits)
(16 bits)
x16 x16
DRAM DRAM
Chip Chip
Loading Imbalance
DRAM TUTORIAL
ISCA 2002
RDRAM Topology Example
Bruce Jacob
David Wang
University of RDRAM
Maryland
Controller
All signals in this topology, Addr/
Cmd/Data/Clock are sent from
point to point on channels that is
path length matched by
definition.
Controller
clock
Packets traveling down Chip
turns
Parallel Paths. Skew is around
minimal by design.
DRAM TUTORIAL
ISCA 2002
I/O Technology
Bruce Jacob
David Wang
University of
Maryland Logic High
RSL vs SSTL2 etc.
∆v
(like ECL vs TTL of another era)
What is “Logic Low”, what is Logic Low

“Logic High”? Different Electrical
Signalling protocols differ on ∆t Time
voltage swing, high/low level,
etc.
∆v
delta t is on the order of ns, we
want it to be on the order of ps.
Slew Rate =
∆t
Smaller ∆ v =
Smaller ∆ t at same slew rate
Increase Rate of bits/s/pin
DRAM TUTORIAL
ISCA 2002
I/O - Differential Pair
Bruce Jacob
David Wang
University of
Maryland
Used on clocking systems, i.e.

RDRAM (all clock signals are
pairs, clk and clk#).
Single Ended Transmission Line
Highest noise tolerance, does
not need as many ground
signals. Where as singled ended
signals need many ground
connections. Also differential
pair signals may be clocked
even higher, so pin-bandwidth
disadvantage is not nearly 2:1
as implied by the diagram.
Differential Pair Transmission Line
Increase Rate of bits/s/pin ?

Cost Per Pin?
Pin Count?
DRAM TUTORIAL
ISCA 2002
I/O - Multi Level Logic
Bruce Jacob
David Wang
University of
Maryland
One of several ways on the table

to further increase the bit-rate of
the interconnects.
logic 10
range
Vref_2
logic 11
range
Vref_1
logic 01
voltage
range
Vref_0
logic 00 time
range
Increase Rate of bits/s/pin

DRAM TUTORIAL
ISCA 2002
Packaging
Bruce Jacob
David Wang
University of
Maryland DIP
“good old da ys”
Different packaging types impact
costs and speed. Slow parts can
Features Target Specification
use the cheapest packaging
available. Faster parts may have
to use more expensive
packaging. This has long be
SOJ Package
Speed
FBGA
800MBp
LQFP
550Mbps
accepted in the higher margin Small Outline J-lead
processor world, but to DRAM,
Vdd/Vddq 2.5V/2.5V (1.8V)
each cent has to hard fought for.
To some extent, the demand for
higher performance is pushing
memory makers to use more
TSOP
Thin Small Outline
Interface SSTL_2
expensive packaging to Row Cycle 35ns

accommodate higher frequency Package Time tRC
parts. When RAMBUS first
spec’ed FBGA, module makers
complained, since they have to
purchase expensive equipment
to validate that chips were
LQFP Memory Roadmap for
Hynix NetDDR II
properly soldered to the module Low Profile Quad
board, whereas something like Flat Package
TSOP can be done with visual
inspection.
FBGA
Fine Ball Grid Array
DRAM TUTORIAL
ISCA 2002
Access Protocol
Bruce Jacob
David Wang
University of Single
Maryland Cycle
16 bit wide command I have to
Cmd
send from A to B, I need 16 pins,
or if I have less than 16, I need
multiple cycles.
Cmd r0
How many bits do I need to send
from point A to point B? How
many pins do I get? Data d0 d0 d0 d0
Cycles = Bits/Pins.
Single Cycle Command
Multiple
Cycle
Cmd
Cmd r0 r0 r0 r0
Data d0 d0 d0 d0
Multiple Cycle Command

DRAM TUTORIAL
ISCA 2002
Access Protocol (r/r)
Bruce Jacob
David Wang
University of
Maryland
Control DRAM DRAM DRAM DRAM
There is inherant latency
between issuance of a read
command, and the response of
the chip with data. To increase
efficiency, a pipeline structure is
necessary to obtain full
utilization of the command,
address and data busses.
Row a0
Different from an “ordinary”
pipeline on a processor, a Col r0 r1
memory pipeline has data
flowing in both directions.
Data d0 d0 d0 d0 d1 d1 d1 d1
Architecture wise, we should be
concerned with full utilization
everywhere, so we can use the RAS CAS Pipelined Access
least number of pins for the latency latency
greatest benefit, but in actual
use, we are usually concerned
with full utilization of the data Consecutive Cache Line Read Requests to Same DRAM Row
bus.
Command
a = Active (open page)
r = Read (Column Read)
Data d = Data (Data chunk)
DRAM TUTORIAL
ISCA 2002
Access Protocol (r/w)
Bruce Jacob
David Wang DRAM
Data In/Out
University of One Datapath - Two Commands
Buffers
Maryland
Column Decoder
The DRAM chips determine the
latency of data after a read
command is received, but the Sense Amps
controller determines the timing
relationship between the write
command and the data being
written to the dram chips.
Col w0 r1
(If the DRAM device cannot
handle pipelined R/W, then...)
Case 1: Controller sends write
data at same time as the write Case 1: Read Following a Write Command to Different DRAM Devices
command to different devices
(pipelined)
Case 2: Controller sends write Col w0 r1

data at same time as the write
command to same device. (not
pipelined)
Case 2: Read Following a Write Command to Same DRAM Device
Col w0 r1
Soln: Delay Data of Write Command to match Read Latency
DRAM TUTORIAL
ISCA 2002
Access Protocol (pipelines)
Bruce Jacob
David Wang
Col r0 r1 r2
University of
Maryland
Data d0 d0 d0 d0 d1 d1 d1 d1 d2 d2 d2 d2
To increase “efficiency”, CAS
pipelines is required. How many
commands must one device
latency
support concurrently?
Three Back-to-Back Pipelined Read Commands
2? 3? 4? (depends on what?)
Imagine we must increast data

rate (higher pin freq), but allow
DRAM core to operate slightly Col r0 r1 r2
slower. (2X pin freq., same core
latency)
Data d0 d0 d0 d0 d1 d1 d1 d1d2 d2 d2 d2
This issue ties access protocol
to internal DRAM architecture CAS
issues.
latency
“Same” Latency, 2X pin frequency, Deeper Pipeline
When pin frequency increases, chips must either

reduce “real latency”,or
support longer bursts, or
pipeline more commands.
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
440BX used 132 pins to control
a single SDRAM channel, not
counting Pwr & GND. now 845 • Advanced Basics
chipset only uses 102.
Also slower versions. 66/100 • DRAM Evolution: Interface Path

Also page burst, an entire page.
Burst length programmed to
• SDRAM, DDR SDRAM, RDRAM Memory System
match cacheline size. (i.e. 32
byte = 256 bits = 4 cycles of 64
Comparisons
bits)
• Processor-Memory System Trends
Latency as seen by the
controller is really CAS + 1 • RLDRAM, FCRAM, DDR II Memory Systems Summary
cycles .
DRAM TUTORIAL
ISCA 2002
SDRAM System In Detail
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3 Dimm4
University of
Maryland
Single
Channel
SDRAM
Controller
Addr & Cmd

“Mesh T opology” Data Bus
Chip (DIMM) Select
DRAM TUTORIAL
ISCA 2002
SDRAM Chip
Bruce Jacob
David Wang
133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus
Maryland 54 pin
TSOP
Programmable Burst Length, 1,2,4 or 8
SDRAM: inexpensive
packaging, lowest cost (LVTTL Quad Banks Internally
signaling), standard 3.3V supply
voltage. DRAM core and I/O Supply Voltage of 3.3V
share same power supply.
“Power cost” is between 560mW

Low Latency, CAS = 2 , 3
and 1W per chip for the duration 14 Pwr/Gnd
of the cache line burst. (Note: it LVTTL Signaling (0.8V to 2.0V) 16 Data
costs power just to keep lines
stored in sense amp/row buffers, (0 to 3.3V rail to rail.) 15 Addr
something for row buffer 7 Cmd
management policies to think
about.) 1 Clk
1 NC
About 2/7 of pins used for Addr,
2/7 used for Data, 2/7 used fo
Condition Specification Cur. Pwr
Pwr/Gnd, and 1/7 used for cmd
signals. Operating (Active) Burst = Continous 300mA 1W
Row Commands and column
commands are sent on the
Operating (Active) Burst = 2 170mA 560mW
same bus, so they have to be
demultiplexed inside the DRAM Standby (Active) All banks active 60mA 200mW
and decoded.
Standby (powerdown) All banks inactive 2mA 6.6mW
DRAM TUTORIAL
ISCA 2002
SDRAM Access Protocol (r/r)
Bruce Jacob 0 1 2 3 4 5 6 7 8 9 10 11 12 13
David Wang 1
SDRAM SDRAM
University of 1
chip # 0 chip # 1
3 read command and address assertion 2
Maryland Memory 3
2
Controller
data bus utilization 4
We’ve spent some time Data bus
discussing some pipelined back CASL 3
to back read commands sent to 4
the same chip, now let’s try to
pipeline commands to different
chips.
In order for the memory

controller to be able to latch in
Clock signal
data on the data bus on
consecutive cycles, chip #0 has
to hold the data value past the Data return from
rising edge of the clock to satisfy 2 4 Data return from
the hold time requirements, then chip # 0 chip # 1
chip #0 has to stop, allow the
bus to go “quiet”, then chip #1
can start to drive the data bus at
least some “setup time” ahead of
the rising edge of the next clock. chip # 0 bus idle chip # 1
Clock cycles has to be long time
enough to tolerate all of the hold time setup time
timing requirements.
Back-to-back Memory Read Accesses to Different Chips in SDRAM
Clock Cycles are still long enough to

allow for pipelined back-to-back Reads
DRAM TUTORIAL
ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang
Figure 1:
University of
Maryland Consecutive
Reads
0 th Nth
I show different paths, but
theses signals are sharing the
bi-directional data bus. For a
read to follow a write to a
different chip, the worst case
skew is when we write to the (N-
1)th chip, then expect to pipeline
a read command in the next
cycle right behind it. The worst Worst case = Dist(N) - Dist(0)
case signal path skew is the
sum of the distances. Isn’t from
N to N even worse? No, SDRAM
does not support pipelined read
Figure 1:
th Nth
behind a write on the same chip.
Also, it’s not as bad as I project
here, since read cycles are
Read After 0 (N-1)th
center aligned, and writes are
edge aligned, so in essence, we
Write
get 1 1/2 cycles to pipeline this
case, instead of just 1 cycle.
Still, this problem limits the freq
Write
scalability of SDRAM, an idle Read
cycles maybe inserted to meet
timing.
Looks Just like SDRAM!

Worst case = Dist(N) + Dist(N-1)
Bus Turn Around

DRAM TUTORIAL
ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang
University of 1
Maryland
SDRAM SDRAM
Timing bubbles. More dead
cycles
chip # 0 chip # 1
2
Memory
Controller 3
4 Data bus
1 3
Col w0 r1
2 4
Read Following a Write Command to Same SDRAM Device

DRAM TUTORIAL
ISCA 2002
DDR SDRAM System
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3
University of
Maryland
Since data bus has a much

lighter load, if we can use better
signaling technology, perhaps Single
we can run just the data bus at a
higher frequency.
Channel
At the higher frequency, the
DDR
skews we talked about would be SDRAM
terrible with a 64 bit wide data
bus. So we use a source Controller
synchonous strobe signals
(called DQS) that is routed
parallel to each 8 bit wide sub
channel.
DDR is newer, so let’s use lower

core voltage, saves on power
too!
Addr & Cmd

Same Topology Data Bus
as SDRAM DQS (Data Strobe)
Chip (DIMM) Select
DRAM TUTORIAL
ISCA 2002
DDR SDRAM Chip
Bruce Jacob
David Wang 133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus 66 pin
Maryland
TSOP
Programmable Burst Lengths, 2, 4 or 8*
Slightly larger package, same
width for Addr and Data, new Quad Banks Internally
pins are 2 DQS, Vref, and now
differential clocks. Lower supply Supply Voltage of 2.5V*
voltage.
Low voltage swing, now with

Low Latency, CAS = 2 , 2.5, 3 *
reference to Vref instead of (0.8
to 2.0V) SSTL-2 Signaling (Vref +/- 0.15V)
No power discussion here (0 to 2.5V rail to rail) 16 Pwr/Gnd*
because data sheet (micron) is
incomplete. 16 Data
0 1 2 3 4 5 15 Addr
Read Data returned from DRAM Clk #
chips now gets read in with
7 Cmd
Clk 2 Clk *
respect to the timing of the DQS
signals, sent by the dram chips Cmd Read 7 NC *
in parallel with the data itself.
DQS 2 DQS *
The use of DQS introduces Data 1 Vref *
“bubbles” in between bursts from
different chips, and reduces
bandwidth efficiency. CASL = 2 DQS Post-amble
DQS Pre-amble
DRAM TUTORIAL
ISCA 2002
DDR SDRAM Protocol (r/r)
Bruce Jacob
David Wang
University of r0
Maryland SDRAM SDRAM
Here we see that two
chip # 0 chip # 1
consecutive column read
Memory
d0
commands to different chips on
the DDR memory channel
r1
cannot be placed back to back
Controller
on the Data bus due to the DQS
signal hand-off issue. They may
Data bus
be pipelined with one idle cycle d1
in between bursts.
This situation is true for all

consecutive accesses to
DQS pre-amble DQS post-amble
different chips, r/r. r/w. w/r.
(except w/w, when the controller
keeps control of the DQS signal, 0 1 2 3 4 5 6 7 8
just changes target chips) Clk #
Clk
Because of this overhead, short
nursts are inefficient on DDR, Cmd r0 r1
longer bursts are more efficient.
(32 byte cache line = 4 burst, 64
DQS
byte line = 8 burst) Data d0 d0 d0 d0 d1 d1 d1 d1
CASL = 2
Back-to-back Memory Read Accesses to Different Chips in DDR SDRAM

DRAM TUTORIAL
ISCA 2002
RDRAM System
Bruce Jacob
David Wang
University of
Maryland RDRAM
Controller
Very different from SDRAM.
Everything is sent around in 8
(half cycle) packets. Most
systems now runs at 400 MHz,
but since everything is DDR, it’s
called “800 MHz”. The only
difference is that packets can
only be initiated at the rising
edge of the clock, other than Bus Clock
that, there’s no diff between 400
DDR and 800. Column Cmd w0 w1 r2
Very clean topology, very clever
clocking scheme. No clock
handoff issue, high efficiency. Data Bus d0 d1 d2
Write delay improves matching
with read latency. (not perfect, tCWD tCAC
as shown) since data bus is 16
bits wide, each read command (Write delay) (CAS access delay) tCAC - tCWD
gets 16*8=128 bits back. Each
cacheline fetch = multiple
packets. Two Write Commands Followed by a Read Command
Up to 32 devices.
Packet Protocol : Everything in 8 (half) cycle packets
DRAM TUTORIAL
ISCA 2002
Direct RDRAM Chip
Bruce Jacob
David Wang
400 MHz (2.5ns cycle time)
University of 256 MBit
Maryland Separate Row-Col Command Busses 86 pin
FBGA
RDRAM packets does not re-
Burst Length = 8*
order the data inside the packet. 49 Pwr/Gnd*
To compute RDRAM latency, we
4/16/32 Banks Internally*
must add in the command 16 Data
packet transmission time as well Supply Voltage of 2.5V* 8 Addr/Cmd
as data packet transmission 4 Clk*
time. Low Latency, CAS = 4 to 6 full cycles* 6 CTL *
RDRAM relies on the multitudes 2 NC
of banks to try to make sure that RSL Signaling (Vref +/- 0.2V) 1 Vref *
a high percentage of requests
would hit open pages, and only (800 mV rail to rail)
incur the cost of a CAS, instead
of a RAS + CAS.
Active Active precharge
read read read read
data data data data
All packets are 8 (half) cycles in length,

the protocol allows near 100% bandwidth
utilization on all channels. (Addr/Cmd/Data)
DRAM TUTORIAL
ISCA 2002
RDRAM Drawbacks
Bruce Jacob
David Wang
High Frequency
University of I/O Test and
Maryland RSL: Separate
Package Cost
RDRAM provides high
bandwidth, but what are the
Power Plane 30% die cost
costs?
RAMBUS pushed in many for logic @
different areas simultaneously.
The drawback was that with new
64 Mbit node
set of infrastructure, the costs for
first generation products were Control Logic -
exhorbant. Active Decode Row buffers
Logic + Open Single Chip
Row Buffer. Provides All
(High power Data Bits for
for “quiet” state) Each Packet
(Power)
Significant Cost Delta for First Generation

DRAM TUTORIAL
ISCA 2002
System Comparison
Bruce Jacob
David Wang
University of
SDRAM DDR RDRAM
Maryland
Frequency (MHz) 133 133*2 400*2
Low pin count, higher latency. Pin Count (Data Bus) 64 64 16
In general terms, the system
comparison simply points out
the various parts that RDRAM
Pin Count (Controller) 102 101 33
excells in, i.e. high bandwidth
and low pin count., but they also Theoretical Bandwidth 1064 2128 1600
have longer latency, since it
takes 10 ns just to move the (MB/s)
command from the controller
onto the DRAM chip, and Theoretical Efficiency 0.63 0.63 0.48
another 10ns just to get the data
from the DRAM chips back onto (data bits/cycle/pin)
the controller interface
Sustained BW (MB/s)* 655 986 1072
Sustained Efficiency* 0.39 0.29 0.32
(data bits/cycle/pin)
RAS + CAS (tRAC) (ns) 45 ~ 50 45 ~ 50 57 ~ 67
CAS Latency (ns)** 22 ~ 30 22 ~ 30 40 ~ 50
133 MHz P6 Chipset + SDRAM CAS Latency ~ 80 ns
*StreamAdd
**Load to use latency
DRAM TUTORIAL
ISCA 2002
Differences of Philosophy
Bruce Jacob
David Wang
SDRAM - Variants
University of
Maryland
DRAM
RDRAM moves complexity from Controller
interface into DRAM chips. Chips
Is this a good trade off? What
does the future look like?
Complex Inexpensive Simple
Interconnect Interface Logic
RDRAM - Variants
DRAM
Controller
Chips
Simplified expensive Complex

Interconnect Interface Logic
Complexity Moved to DRAM

DRAM TUTORIAL
ISCA 2002
Technology Roadmap (ITRS)
Bruce Jacob
David Wang
University of 2004 2007 2010 2013 2016

Maryland
Semi Generation (nm) 90 65 45 32 22
To begin with, we look in a
crystal ball to look for trends that CPU MHz 3990 6740 12000 19000 29000
will cause changes or limit
scalability in areas that we are MLogicTransistors/ 77.2 154.3 309 617 1235
interested in.
ITRS = International Technology cm^2
Roadmap for Semiconductors.
High Perf chip pin count 2263 3012 4009 5335 7100
Transistor Frequecies are
supposed to nearly double every High Performance chip 1.88 1.61 1.68 1.44 1.22
generation, and transistor cost (cents/pin)
budget (as indicated by Million
Logic Transistors per cm^2) are Memory pin cost 0.34 - 0.27 - 0.22 - 0.19 - 0.19 -
projected to double.
(cents/pin) 1.39 0.84 0.34 0.39 0.33
Interconnects between chips are
a different story. Measured in Memory pin count 48-160 48-160 62-208 81-270 105-351
cents/pin, pin cost decreases
only slowly, and pin budget
grows slowly each generation.
Punchline: In the future, Free

Transistors and Costly
interconnects.
Trend: Free Transistors &
Costly Interconnects
DRAM TUTORIAL
ISCA 2002
Choices for Future
Bruce Jacob
David Wang
Direct Connect
DRAM
DRAM
DRAM
DRAM
Custom DRAM:
University of CPU
Highest Bandwidth +
Maryland Low Latency
So we have some choices to

make. Integration of memory
controller will move the memory Direct Connect
DRAM
DRAM
DRAM
DRAM
controller on die, frequency will
CPU semi-comm. DRAM:
be much higher. Command-data
path will only cross chip High Bandwidth +
boundaries twice instead of 4 Low/Moderate Latency
times. But interfacing with
memory chips directly means
that you are to be limited by the
lowest common denominator. To Direct Connect
DRAM DRAM
get highest bandwidth (for a Commodity DRAM
given number of pins) AND
CPU Low Bandwidth +
lowest latency, we’ll need DRAM DRAM Low Latency
custom RAM, might as well be
SRAM, but it will be prohibatively DRAM DRAM
expensive
DRAM DRAM DRAM DRAM
Memory DRAM DRAM

CPU DRAM DRAM
Controller
DRAM DRAM DRAM DRAM

Indirect Connection
Highest Bandwidth DRAM DRAM
Inexpensive DRAM
Highest Latency
DRAM TUTORIAL
ISCA 2002
EV7 + RDRAM (Compaq/HP)
Bruce Jacob
David Wang
University of
• RDRAM Memory (2 Controllers)
Maryland
• Direct Connection to processor
Two RDRAM controller means 2
independent channels. Only 1
packet has to be generated for • 75ns Load to use latency
each 64 byte cache line
transaction request.
(extra channel stores cache

• 12.8 GB/s Peak bandwidth
coherence data. i.e. I belong to
CPU#2, exclusively.) • 6 GB/s read or write bandwidth
Very aggressive use of available
pages on RDRAM memory.
• 2048 open pages (2 * 32 * 32)
16
16 Each column read
MC 64 16 fetches 128 * 4 = 512 b
16 (data)
16
16
MC 64 16
16
DRAM TUTORIAL
ISCA 2002
What if EV7 Used DDR?
Bruce Jacob
David Wang
University of
• Peak Bandwidth 12.8 GB/s
Maryland
• 6 Channels of 133*2 MHz DDR SDRAM ==
EV7 cacheline is 64 bytes, so
each 4-channel ganged RDRAM • 6 Controllers of 6 64 bit wide channels, or
can fetch 64 bytes with 1 single
packet.
Each DDR SDRAM channel can
fetch 64 bytes by itself. So we
• 3 Controllers of 3 128 bit wide channels
need 6 controllers if we gang
two DDR SDRAMs together into
one channel, we have to reduce
the burst length from 8 to 4. EV7 + 6 controller EV7 + 3 controller
Shorter bursts are less efficient.
System EV7 + RDRAM
DDR SDRAM DDR SDRAM
Sustainable bandwidth drops.
Latency 75 ns ~ 50 ns* ~ 50 ns*
Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd
Controller 2 6*** 3***
Count
Open pages 2048 144 72
* page hit CAS + memory controller latency.

** including all signals, address, command, data, clock, not including ECC or parity
*** 3 controller design is less bandwidth efficient.
DRAM TUTORIAL
ISCA 2002
What’s Next?
Bruce Jacob
David Wang
University of
• DDR II
Maryland
DDR SDRAM was an
• FCRAM
advancedment from SDRAM,
with lowered Vdd, new electrical
signal interface (SSTL), new • RLDRAM
protocol, but fundamentally the
same tRC ~= 60ns. RDRAM has
tRC of ~70ns. All comparable in
Row recovery time.
• RDRAM (Yellowstone etc)
So what’s next? What’s on the
Horizon? DDR II/FCRAM/
RLDRAM/RDRAM-nextGen/
• Kentron QBM
Kentron? What are they, and
what do they bring to the table?
DRAM TUTORIAL
ISCA 2002
DDR II - DDR Next Gen
Bruce Jacob
David Wang Lower I/O DRAM core operates at
University of Voltage (1.8V) 1:4 freq of data bus freq
Maryland (SDRAM 1:1, DDR 1:2)
Backward Compat.
DDR II is a follow on to DDR 400 Mbps to DDR (Common
DDR II command sets are a
superset of DDR SDRAM - multidrop modules possible)
commands.
Lower I/O voltage means lower 800 Mbps
power for I/O and possibly faster
signal switching due to lower rail
-point to point No more Page-
to rail voltage.
DRAM core now operates at 1:4 Transfer-Until-
of data bus frquency. valida
command may be latched on
Interrupted
any given rising edge of clock, FPBGA package Commands
but may be delayed a cycle
since command bus is running (removes speedpath)
at 1:2 frequency to the core now.
In a memory system it can run at
400 MHz per pin, while it can be
Burst Length == 4
cranked up to 800 MHz per pin Only!
in an embedded system without
connectors.
DDR II eliminates the transfer-
until-interrupted commands, as
well as limits the burst length to
4 only. (simple to test) 4 Banks internally Write Latency = CAS -1
(same as SDRAM and DDR) (increased Bus Utilization)
DRAM TUTORIAL
ISCA 2002
DDR II - Continued
Bruce Jacob
David Wang
University of
Posted Commands
Maryland
Active Read SDRAM & DDR
Instead of a controller that keeps
(RAS) (CAS)
track of cycles, we can now have
a “dumber” controller. Control is
now simple, kind of like SRAM. tRCD data
part I of address one cycle, part
II the next cycle.
SDRAM & DDR SDRAM relies on memory controller to know
tRCD and issue CAS after tRCD for lowest latency.
Active Read
DDR II: Posted CAS
(RAS) (CAS)
tRCD data
Internal counter delays CAS command, DRAM chip issues “real”
command after tRCD for lowest latency.
DRAM TUTORIAL
ISCA 2002
FCRAM
Bruce Jacob
David Wang
University of
Fast Cycle RAM (aka Network-DRAM)
Maryland
FCRAM is a trademark of
Fujitsu. Toshiba manufactures
under this trademark, Samsung
Features DDR SDRAM FCRAM/Network-DRAM
sells Network DRAM. Same
thing. Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15
extra die area devoted to circuits
that lowers Row Cycle down to
half of DDR, and Random
Electrical Interface SSTL-2 SSTL-2
Access (RAC) latency down to
22 to 26ns. Clock Frequency 100~167 MHz 154~200 MHz
Writes are delay-matched with
CASL, better bus utilization. tRAC ~40ns 22~26ns
tRC ~60ns 25~30ns
# Banks 4 4
Burst Length 2,4,8 2,4
Write Latency 1 Clock CASL -1
FCRAM/Network-DRAM looks like DDR+

DRAM TUTORIAL
ISCA 2002
FCRAM Continued
Bruce Jacob
David Wang
University of
Maryland
With faster DRAM turn around
time on the tRC, a random
access that hits the same page
over and over again will have the
highest bus utilization. (With
random R/W accesses)
Also, why Peak BW != sustained
BW. Deviations from peak
bandwidth could be due to
architecture related issues such
as tRC (cannot cycle DRAM
arrays to grab data out of same
DRAM array and re-use sense
amps)
Faster tRC allows Samsung to claim higher bus efficiency

* Samsung Electronics, Denali MemCon 2002
DRAM TUTORIAL
ISCA 2002
RLDRAM
Bruce Jacob
David Wang
Peak Random
Bus Width Row Cycle
University of DRAM Type Frequency
(per chip)
Bandwidth Access
Time (tRC)
Maryland (per Chip) Time (tRAC)
Another Variant, but RLDRAM is
targetted toward embedded PC133 133 16 200 MB/s 45 ns 60 ns
systems. There are no SDRAM
connector specifications, so it
can target a higher frequency off DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns
the bat.
PC800 400 * 2 16 1.6 GB/s 60 ns 70 ns
RDRAM
FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns
RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns
Comparable to FCRAM in latency

Higher Frequency (No Connectors)
non-Multiplexed Address (SRAM like)
DRAM TUTORIAL
RLDRAM Applications---L3
(II) Cache
ISCA 2002
RLDRAM Continued
Bruce Jacob
David Wang High-end PC and Server
University of
Maryland
Infineon proposes that RLDRAM Northbridge
could be integrated onto the
motherboard as an L3 cache. 64
MB of L3. Shaves 25 ns off of Processor Memory
tRAC from going to SDRAM or
DDR SDRAM. Not to be used as
Controller
main memory due to capacity
constraints.
2.4GB/s X32 2.4GB/s X32
256M b ??? 256Mb

RLDRAM RLDRAM
RLDRAM is a great replacement to SRAM in L3

MP SM M GS cache applications because of its high density,
low power and low cost
2001-09-04
Page 6
*Infineon Presentation, Denali MemCon 2002
DRAM TUTORIAL
ISCA 2002
RAMBUS Yellowstone
Bruce Jacob
David Wang
University of
• Bi-Directional Differential Signals
Maryland
Unlike other DRAM’s.
• Ultra low 200mV p-p signal swings
Yellowstone is only a voltage
and I/O specification, no DRAM
AFAIK. • 8 data bits transferred per clock
RAMBUS has learned their
lesson, they used expensive
packaging, 8 layer
motherboards, and added cost
• 400 MHz system clock
everywhere. Now the new pitch
is “higher performance with
same infrastructure”.
• 3.2 GHz effective data frequency
• Cheap 4 layer PCB
• Commodity packaging
System Clock
Data 1.2 V
1.0 V
Octal Data Rate (ODR) Signaling

DRAM TUTORIAL
TM
ISCA 2002
Kentron QBM
Bruce Jacob
David Wang
DDR A DDR B DDR A DDR B DDR A DDR B DDR A DDR B
University of
Maryland
Quad Band Memory
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
SWITCH
Uses Fet switches to control
which DIMM sends output.
Two DDR memory chips are
interleaved to get Quad memory.
Advantages, uses standard
SWITCH
SWITCH
SWITCH
SWITCH
DDR chips, extra cost is low,
only the wrapper electronics.
Modification to memory
controller required, but minimal.
Has to understand that data is
open open open
being burst back at 4X clock
frequency. Does not improve
efficiency, but cheap bandwidth.
Supports more loads than Clock
“ordinary DDR”, so more Controller
capacity.
DDR A d1 d1 d1 d1
DDR B d0 d0 d0 d0
Output d0 d1 d0 d1 d0 d1 d0 d1
“Wrapper Electr onics around DDR memory”

Generates 4 data bits per cycle instead of 2.
Quad Band Memory
DRAM TUTORIAL
ISCA 2002
A Different Perspective
Bruce Jacob
David Wang
University of
Everything is bandwidth
Maryland
Instead of thining about things
on a strict latency-bandwidth
Clock
perspective, it might be more
helpful to think in terms of
latency vs pin-transition
Row Cmd/Addr Bandwidth
efficiency perspective. The idea
is that
Col. Cmd/Addr Bandwidth
Write Data Bandwidth
Read Data Bandwidth
Latency and Bandwidth
Pin-bandwidth and
Pin-transition *Efficiency (bits/cycle/sec)

DRAM TUTORIAL
ISCA 2002
Research Areas: Topology
Bruce Jacob
David Wang
University of
Maryland
DRAM systems is basically a
networking system with a smart
master controller and a large
number of “dumb” slave devices.
If we are concerned about
“efficiency” on a bit/pin/sec level,
it might behoove us to draw
inspiration from network
interfaces, and design
something like this...
Unidirection command and write
packets from controller to DRAM
chips, and Unidirection bus from
DRAM chips back to the
controller. Then it looks like a
network system with slot ring
interface, no need to deal with
bus-turn around issues.
Unidirectional Topology:
• Write Packets sent on Command Bus
• Pins used for Command/Address/Data
• Further Increase of Logic on DRAM chips
DRAM TUTORIAL
ISCA 2002
Memory Commands?
Bruce Jacob
David Wang
Act Write
University of Act Write 0
Maryland 000000
Certain things simply does not
make sense to do. Such as Instead of A[ ] = 0; Do “write 0”
various STREAM components.
Move multimegabyte arrays from
DRAM to CPU, just to perform
simple “add” function, then move
that multi megabyte arrays right
back. In such extreme Why do A[ ] = B[ ] in CPU?
bandwidth constrained
applications, it would be
beneficial to have some logic or Memory
hardware on DRAM chips that Controller
can perform simple
computation. This is tricky, since
we do not want to add too much
logic as to make the DRAM
chips prohibatively expensive to
manufacture. (logic overhead Move Data inside of DRAM or between DRAMs.
decreases with each generation,
so adding logic is not an
impossible dream) Also, we do
not want to add logic into the
critical path of a DRAM access.
That would serve to slow down a
Why do STREAMadd in CPU?
general access in terms of the
“real latency” in ns.
A[ ] = B[ ] + C[ ]
Active Pages *(Chong et. al. ISCA ‘98)

DRAM TUTORIAL
ISCA 2002
Address Mapping
Bruce Jacob
David Wang
Physical
University of Address
Maryland
For a given physical address,
there are a number of ways to
map the bits of the physical
address to generate the
Device Id Row Addr Col Addr Bank Id
“memory address” in terms of
device ID, Row/col addr, and
bank id.
The mapping policies could
impact performance, since badly
mapped systems can cause
bank conflicts in consecutive
accesses.
Now, mapping policies must also
take temperature control into
account, as consecutive
accesses that hit the same
DRAM chip can potentially
create undesirable hot spots.
One reason for the additional
cost of RDRAM initially was the
use of heat spreaders on the
memory modules to prevent the
hotspots from building up.
Access Distribution for Temp Control

Avoid Bank Conflicts
Access Reordering for performance
DRAM TUTORIAL
ISCA 2002
Example: Bank Conflicts
Bruce Jacob
David Wang
Column Decoder
University of
Maryland Column Decoder
Sense Amps
Each Memory system consists Column Decoder
of one or more memory chips, SenseColumn
Amps Decoder
and most times, accesses to ... Bit Lines...
Sense Amps
these chips can be pipelined.
Each chip also has multitudes of Sense Amps
... Bit Lines...
Row Decoder
banks, and most of the times,
accesses to these banks can ... Bit Lines...
Row Decoder
also be pipelined. (key to Memory ... Bit Lines...
....
efficiency is to pipeline
Row Decoder
commands) Array
Memory
....
Row Decoder
Array
Memory
....
Multiple Banks Array
Memory
....
to Reduce Array
Access Conflicts
Read 05AE5700 Device id 3, Row id 266, Bank id 0

Read 023BB880 Device id 3, Row id 1BA, Bank id 0
Read 05AE5780 Device id 3, Row id 266, Bank id 0
Read 00CBA2C0 Device id 3, Row id 052, Bank id 1
More Banks per Chip == Performance == Logic Overhead

DRAM TUTORIAL
ISCA 2002
Example: Access Reordering
Bruce Jacob
David Wang 1 Read 05AE5700 Device id 3, Row id 266, Bank id 0
2 Read 023BB880 Device id 3, Row id 1BA, Bank id 0
University of 3 Read 05AE5780 Device id 3, Row id 266, Bank id 0
Maryland Read 00CBA2C0 Device id 1, Row id 052, Bank id 1
4
Each Load command is
translated to a row command Act 1 Prec Act 2 Prec Act 3
and a column command. If two
commands are mapped to the
same bank, one must be
Read Read
completed before the other can
start. Data Data
Or, if we can re-order the
sequences, then the entire tRC
sequence can be completed
faster. Strict Ordering
By allowing Read 3 to bypass Act 1 Act 4 Prec Prec Act 2 Prec
Read 2, we do not need to
generate another row activation 3
command. Read 4 may also ReadReadRead Read
bypass Read 2, since it operates
on a different Device/bank
entirely.
Data Data Data Data
DRAM now can do auto
precharge, but I put in the Memory Access Re-ordered
precharge explicitly to show that
two rows cannot be active within
tRC (DRAM architecture)
constraints.
Act = Activate Page (Data moved from DRAM cells to row buffer)
Read = Read Data (Data moved from row buffer to memory controller)
Prec = Precharge (close page/evict data in row buffer/sense amp)
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
now -- talk about performance
issues.
• Advanced Basics
DRAM TUTORIAL
ISCA 2002
Simulator Overview
Bruce Jacob
David Wang
University of
CPU: SimpleScalar v3.0a
Maryland
• 8-way out-of-order
NOTE
• L1 cache: split 64K/64K, lockup free x32

• L2 cache: unified 1MB, lockup free x1
• L2 blocksize: 128 bytes
Main Memory: 8 64Mb DRAMs

• 100MHz/128-bit memory bus
• Optimistic open-page policy
Benchmarks: SPEC ’95

DRAM TUTORIAL
ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang FPM, EDO, SDRAM, ESDRAM, DDR: x16 DRAM
x16 DRAM
University of
Maryland x16 DRAM
x16 DRAM
NOTE CPU Memory
128-bit 100MHz bus
and caches Controller
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
DIMM
Rambus, Direct Rambus, SLDRAM:
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
CPU Memory
128-bit 100MHz bus
Fast, Narrow Channel
Note: TRANSFER WIDTH of Direct Rambus Channel

• equals that of ganged FPM, EDO, etc.
• is 2x that of Rambus & SLDRAM
DRAM TUTORIAL
ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang Rambus & SLDRAM dual-channel:
University of
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Maryland
CPU Memory
NOTE 128-bit 100MHz bus
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Fast, Narrow Channel x2
Strawman: Rambus, etc.
DRAM
DRAM
DRAM
CPU Memory
128-bit 100MHz bus
DRAM
...
DRAM
Multiple Parallel Channels
DRAM TUTORIAL
ISCA 2002
First … Refresh Matters
Bruce Jacob
David Wang
University of compress
Maryland
1200
Bus Wait Time
NOTE
Refresh Time
Data Transfer Time
Data Transfer Time Overlap
Column Access Time
Time per Access (ns)

800 Row Access Time
Bus Transmission Time
400
0 FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations
Assumes refresh of each bank every 64ms

DRAM TUTORIAL
ISCA 2002
Overhead: Memory vs. CPU
Bruce Jacob
David Wang Total Execution Time in CPI — SDRAM
University of Stalls due to Memory Access Time

Maryland 3 Overlap between Execution & Memory
Processor Execution (includes caches)
NOTE
2.5
Clocks Per Instruction (CPI)

To
mo
rr
2 Ye To ow’
ste day s C
rda ’s PU
y’s CP
CP U
1.5 U
0.5
0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
BENCHMARK
Variable: speed of processor & caches

DRAM TUTORIAL
ISCA 2002
Definitions (var. on Burger, et al)
Bruce Jacob
David Wang • tPROC — processor with perfect memory
University of
Maryland • tREAL — realistic configuration
NOTE • tBW — CPU with wide memory paths
• tDRAM — time seen by DRAM system
tREAL tDRAM
Stalls Due to tREAL - tBW

BANDWIDTH
tBW
Stalls Due to tBW - tPROC
LATENCY
CPU-Memory
OVERLAP tPROC - (tREAL - tDRAM)
CPU+L1+L2 tREAL - tDRAM

Execution
tPROC
DRAM TUTORIAL
ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang
University of
Bandwidth-Enhancing Techniques I:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Overlap between Execution & Memory
5 Processor Execution
Newer DRAMs
To
Cycles Per Instruction (CPI)
m o da
or da y’
Ye
T
ro y’ s C
st
w s
er
’s CP PU
C U
3
PU
2
0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Configuration
DRAM TUTORIAL
ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang
University of
Bandwidth-Enhancing Techniques II:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Execution Time in CPI — PERL
5 Overlap between Execution & Memory
Processor Execution
Cycles Per Instruction (CPI)
0 FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2
DRAM Configuration (10GHz CPUs)

DRAM TUTORIAL
ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time
300
200
100
0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Configurations
note: SLDRAM & RDRAM 2x data transfers

DRAM TUTORIAL
ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time
300
200
100
0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Configurations
note: SLDRAM & RDRAM 2x data transfers

DRAM TUTORIAL
ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang
University of Architectural Comparison
Normalized Execution Time (DDR2)

Maryland pc100
3
ddr133
NOTE
drd
2.5
ddr2
ddr2ems
2 ddr2vc
1.5
0.5
cc1 com go ijpeg li linea mpe mpe peg− perl ran− strea strea
pres r_wal g2de g2en wit dom m m_n
s k c c _wa o_un
Benchmark
DRAM TUTORIAL
ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang
University of Perl Runtime

Maryland
0.9 pc100
ddr133
NOTE 0.8
drd
Execution Time (Sec.) 0.7 ddr2
ddr2ems
0.6
ddr2vc
0.5
0.4
0.3
0.2
0.1
0
1 Ghz 5 Ghz 10 Ghz
Processor Frequency
DRAM TUTORIAL
ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang 100
No L2 Cache
100
No L2 Cache FPMDRAM
EDODRAM
SDRAM
ESDRAM
University of 80 80 DDRSDRAM
SLDRAM
Hit rate in row buffers

Maryland RDRAM
DRDRAM
60 60
NOTE 40 40
20 20
0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

0 AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword
SPEC INT 95 Benchmarks ETCH Traces
1MB L2 Cache 1MB L2 Cache
100 100
80 80

60 60
40 40
20 20

4MB Cache 4MB L2 Cache

100 100
80 80
60 60
40 40
20 20

DRAM TUTORIAL
ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang Hits vs. Depth in Victim-Row FIFO Buffer
1000 200 1.5e+05
University of 800
Go Li Vortex
Maryland 150
1e+05
600
100
400
NOTE 50000
50
200
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
50000 2e+05 4e+05
40000 Compress Perl

1.5e+05 Ijpeg 3e+05
30000
1e+05 2e+05
20000
50000 1e+05
10000
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Compress Vortex
100000 10000
10000
1000
1000
100
100
10
10
1 1
2000 4000 6000 8000 10000 2000 4000 6000 8000 10000
Inter-arrival time (CPU Clocks) Inter-arrival time (CPU Clocks)
DRAM TUTORIAL
ISCA 2002
Row Buffers as L2 Cache
Bruce Jacob
David Wang
University of
Maryland Stalls due to Memory Bandwidth
Stalls due to Memory Latency
Overlap between Execution & Memory
NOTE Processor Execution
Clocks Per Instruction (CPI)

20
10
0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
DRAM TUTORIAL
ISCA 2002
Row Buffer Management
Bruce Jacob
David Wang
ROW ACCESS COLUMN ACCESS
University of
Maryland RAS CAS
Data In/Out Data In/Out
Each memory transaction has to
break down into a two part Buffers Buffers
access, a row access and a
column access. In essence the
row buffer/sens amp is action as Column Decoder Column Decoder
a cache. where a page is
brought in from the memory
array and stored in the buffer Sense Amps Sense Amps
then the second step is to move
that data from the row buffers
back into the memory controller. ... Bit Lines... ... Bit Lines...
from a certain perspective, it
makes sense to speculatively
Row Decoder
Row Decoder
move pages from memory

arrays into the row buffers to
maximize the page hit of a Memory Memory
....
....
column access, and reduce

latency. The cost of a Array Array
speculative row activation
command is the ~20 bit of
bandwidth sent on the command
channel from controller to
DRAM. instead of prefetching
into DRAM, we’re just
prefetching inside of DRAM.
Row buffer hit rates 40~90%,
depending on application.
*could* be near 100% if memory
system gets speculative row RAS is like Cache Access
buffer management commands.
(This only makes sense if
memory controller is integrated)
Why not Speculate?
DRAM TUTORIAL
ISCA 2002
Cost-Performance
Bruce Jacob
David Wang
University of
FPM, EDO, SDRAM, ESDRAM:
Maryland
• Lower Latency => Wide/Fast Bus
NOTE
• Increase Capacity => Decrease Latency

• Low System Cost
Rambus, Direct Rambus, SLDRAM:

• Lower Latency => Multiple Channels
• Increase Capacity => Increase Capacity
• High System Cost
However, 1 DRDRAM = Multiple SDRAM

DRAM TUTORIAL
ISCA 2002
Conclusions
Bruce Jacob
David Wang
University of
100MHz/128-bit Bus is Current Bottleneck
Maryland
• Solution: Fast Bus/es & MC on CPU
NOTE
(e.g. Alpha 21364, Emotion Engine, …)
Current DRAMs Solving Bandwidth Problem

(but not Latency Problem)
• Solution: New cores with on-chip SRAM
(e.g. ESDRAM, VCDRAM, …)
• Solution: New cores with smaller banks
(e.g. MoSys “SRAM”, FCRAM, …)
Direct Rambus seems to scale best for future

high-speed CPUs
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
now -- let’s talk about DRAM
performance at theSYSTEM
level. • Advanced Basics
previous studies show
MEMORY BUS is significant
bottleneck in today’s high-
performance systems
- Schumann reports that in

Alpha workstations, 30-60% of
PRIMARY MEMORY LATENCY,
is due to SYSTEM OVERHEAD
other than DRAM latency
- Harvard study cites BUS
TURNAROUND as responsible
for factor-of-two difference
between PREDICTED and
MEASURED performance in P6
systems
- our previous work shows

today’s busses (1999’s busses)
are bottlenecks for tomorrow’s
DRAMs
so -- look at bus, model system

overhead
DRAM TUTORIAL
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
Even when we restrict our focus …
Maryland
at the SYSTEM LEVEL -- i.e.

outside the CPU -- we find a
large number of parameters SYSTEM-LEVEL PARAMETERS
this study only VARIES a
handful, but it still yields a fairly
large space
• Number of channels Width of channels
the parameters we VARY are in • Channel latency Channel bandwidth
blue & green
• Banks per channel Turnaround time
by “partially” independent, i
mean that we looked at a small • Request-queue size Request reordering
number of possibilities:
• Row-access Column-access
- turnaround is 0/1 cycle on
800MHz bus • DRAM precharge CAS-to-CAS latency
- request ordering is
INTERLEAVED or NOT
• DRAM buffering L2 cache blocksize
• Number of MSHRs Bus protocol
Fully | partially | not independent (this study)

DRAM TUTORIAL
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
... the design space is highly non-linear …
Maryland
3 32-Byte Burst
and yet, even in this restricted GCC 64-Byte Burst
design space, we find highly
EXTREMELY COMPLEX results 128-Byte Burst
Cycles per Instruction (CPI)

SYSTEM is very SENSITIVE to
CHANGES in these parameters
2
[discuss graph]
if you hold all else constant and

vary one parameter, you can
see extremely large changes in
end performance ... up to 40% 1
difference by changing ONE
PARAMETER by a FACTOR OF
TWO (e.g. doubling the number
of banks, doubling the size of the
burst, doubling the number of
channels, etc.)
0
0.8 1.6 3.2 6.4 12.8 25.6
yte est yte t es
t es yte t es
t es t es t es tes te
s
b by 1 b by by 1 b by by by by by by
x1 x 2 x x 4 x 2 x x 8 x 4 x 2 x 8 x 4 x 8
an an cha
n
an han ha
n n n n n n n
ch c h ch c c c ha cha cha c ha cha c ha
1 1 2 1 2 4 1 2 4 2 4 4
System Bandwidth
(GB/s = Channels * Width * 800MHz)
DRAM TUTORIAL
ISCA 2002
Motivation
Bruce Jacob
David Wang
University of
... and the cost of poor judgment is high.
Maryland
10
so -- we have the worst possible
scenario: a design space that is
very sensitive to changes in
parameters and execution times Cycles per Instruction (CPI) Worst Organization
that can vary by a FACTOR OF 8 Average Organization
THREE from worst-case to best Best Organization
clearly, we would be well-served
to understand this design space
6
0
bzip gcc mcf parser perl vpr average
SPEC 2000 Benchmarks
DRAM TUTORIAL
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
SDRAM Timing
Maryland
so by now we’re very familiar

with this picture ...
Row Access
we cannot use it in this study, Clock
because this represents the Column Access
interface between the DRAM
and the MEMORY Command Transfer Overlap
CONTROLLER.
typically, the CPU’s interface is

ACT READ Data Transfer
much simpler: the CPU sends all
of the address bits at once with Address
CONTROL INFO (r/w), and the Row Col
memory controller handles the Addr Addr
bit addressing and the RAS/CAS
timing
Data Data Data Data
DRAM TUTORIAL
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
this gives the picture of what is
happening at the SYTEM
LEVEL Row Access
Clock
the CPU to MEM Column Access
CONTROLLER is shown as
“ABUS Active” Command Transfer Overlap
the MEM-CONTROLLER to
DRAM activity is shown as
DRAM BANK ACTIVE
Address
and the data read-out is shown Row Col
as “DBUS Active” Addr Addr
i
Data Data Data Data
DRAM Bank Active
ABUS Active DBUS Active

DRAM TUTORIAL
ISCA 2002
System-Level Model
Bruce Jacob
David Wang
University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
f the DRAM’s pins do not
connect directly to the CPU (e.g.
in a hierarchical bus Row Access
organization, or if the data is Clock
funnelled through the memory Column Access
controller like the northbridge
chipset), then there is yet Command Transfer Overlap
another DBUS ACTIVE timing
slot that follows below and to the
right ... this can continue to
extend to any number of
hierarchical levels, as seen in Address
huge server systems with Row Col
hundreds of GB of DRAM Addr Addr

Data Data Data Data
DRAM Bank Active
ABUS Active DBUS1 Active
DBUS2 Active
DRAM TUTORIAL
ISCA 2002
Request Timing
Bruce Jacob
David Wang
University of Backside bus Frontside bus

Maryland
Address
so let’s formalize this system- Cache CPU Data bus (800 MHz)
DRAM DRAM DRAM DRAM
level interface. here’s the Data bus
request timing in a slightly
different way, as well as an
example system model taken
from schumann’s paper Address
describing the 21174 memory
controller (800 MHz) MC Row/Column Addresses & Control
Control
the DRAM’s data pins are
connected directly to the CPU
(simplest possible model), the
memory controller handles the READ REQUEST TIMING:
RAS/CAS timing, and the CPU t0
and memory controller only talk
in terms of addresses and
ADDRESS BUS
control information
DRAM BANK <ROW> <COL> <PRE>
DATA BUS <DB0><DB1><DB2><DB3>
DRAM TUTORIAL
ISCA 2002
Read/Write Request Shapes
Bruce Jacob
David Wang
READ REQUESTS:
University of t0
Maryland ADDRESS BUS 10ns
DRAM BANK 90ns
such a model gives us these DATA BUS 70ns 10ns
types of request shapes for
reads and writes ADDRESS BUS 10ns
DRAM BANK 90ns
this shows a few example bus/
DATA BUS 70ns 20ns
burst configuraitons, in
particular:
ADDRESS BUS 10ns
4-byte bus with burst sizes of 32, DRAM BANK 100ns
64, and 128 bytes per burst DATA BUS 70ns 40ns
WRITE REQUESTS:
t0
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 10ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 20ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 40ns
DRAM TUTORIAL
ISCA 2002
Pipelined/Split Transactions
Bruce Jacob
David Wang (a) Legal if R/R to different banks:
10ns
University of Read: 90ns
Maryland 70ns 20ns
20ns 10ns
the bus is PIPELINED and
Read: 90ns
supports SPLIT
TRANSACTIONS where one 70ns 20ns
request can be NESTLED inside
of another if the timing is right
[explain examples] (b) Nestling of writes inside reads is legal if R/W to different banks:
Legal if turnaround <= 10ns: Legal if no turnaround:
what we’re trying to do is to fit
these 2D puzzle pieces together 10ns 10ns
in TIME Read: 90ns
Read: 90ns
70ns 10ns 70ns 20ns
10 10ns 10 10ns
Write: 90ns Write: 90ns
40ns 10ns 40ns 20ns
(c) Back-to-back R/W pair that cannot be nestled:
10ns
Read: 100ns
70ns 40ns
10 10ns
Write: 90ns
40ns 40ns
DRAM TUTORIAL
ISCA 2002
Channels & Banks
Bruce Jacob
David Wang D D D D D D
D D D D D D D D D D D D D D D
University of
Maryland
... ...
as for physical connections, here
are the ways we modeled C C C C C C
independent DRAM channels
and independent BANKS per One independent channel Two independent channels
CHANNEL Banking degrees of 1, 2, 4, ... Banking degrees of 1, 2, 4, ...
the figure shows a few of the
parameters that we study. in D D D D D D D D D D D D
addition, we look at:
D D D D D D D D D D D D D D D D
- turnaround time (0,1 cycle)
- queue size (0, 1, 2, 4, 8, 16,

32, infinite requests per channel)
...
C C C
Four independent channels
Banking degrees of 1, 2, 4, ...
1, 2, 4 800 MHz Channels

8, 16, 32, 64 Data Bits per Channel
1, 2, 4, 8 Banks per Channel (Indep.)
32, 64, 128 Bytes per Burst
DRAM TUTORIAL
ISCA 2002
Burst Scheduling
Bruce Jacob
David Wang (Back-to-Back Read Requests)
University of 128-Byte Bursts:
Maryland
how do you chunk up a cache

block? (L@ caches use 128-
byte blocks) 64-Byte Bursts:
[read the bullets]
LONGER BURSTS amortize the

cost of activating and 32-Byte Bursts:
precharging the row over more
data transferred
SHORTER BURSTS allow the

critical word of a FOLLOWING
REQUEST to be serviced
sooner • Critical-burst-first
so -- this is not novel, but it is
fairly aggressive • Non-critical bursts are promoted
NOTE: we use close-page,
autoprecharge policy with
ESDRAM-style buffering of
• Writes have lowest priority
ROW in SRAM. result: get the
best possible precharge overlap (tend back up in request queue …)
AND multiple burst requests to
the same row will not re-invoke a
RAS cycle unleass an
intervening READ request goes
• Tension between large & small bursts:
to a different ROW in same
BANK amortization vs. faster time to data
DRAM TUTORIAL
ISCA 2002
New Bar-Chart Definition
Bruce Jacob
David Wang tREAL tDRAM
University of
Maryland
Stalls Due to tREAL - tBW
DRAM Latency
we run a series of different
simulations to get break-downs:
- CPU Activity
- memory activity overlapped tSYS
with CPU Stalls Due to tSYS - tPROC
- non-overlapped - SYSTEM Queue, Bus, ...
- non-overlapped - due to DRAM
CPU-Memory
so the top two are MEMORY
OVERLAP tPROC - (tREAL - tDRAM)
STALL CYCLES
bottom two are PERFECT CPU+L1+L2 tREAL - tDRAM

MEMORY Execution
tPROC
Note: MEMORY LATENCY is
not further divided into latency/
bandwidth/etc.
• tPROC — CPU with 1-cycle L2 miss

• tREAL — realistic CPU/DRAM config
• tSYS — CPU with 1-cycle DRAM latency
• tDRAM — time seen by DRAM system
DRAM TUTORIAL
ISCA 2002
System Overhead
Bruce Jacob
David Wang
University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround
so -- we’re modeling a memory

system that is fairly aggressive
in terms of
- scheduling policies 2
- support for concurrency Perfect
Memory
and we’re trying to find which of
the following is to blame for the
most overhead:
- concurrency
- latency 1
- system (queueing, precharge,
chunks, etc)
0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8
1.6 3.2 6.4

System Bandwidth
(GB/s = Channels * Width * Speed)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
DRAM TUTORIAL
ISCA 2002
System Overhead
Bruce Jacob
David Wang
University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround
the figure shows:

- system overhead is significant
(usually 20-40% of the total 2
memory overhead)
System overhead 10–100%
over perfect memory
- the most significant overhead
tends to be the DRAM latency
- turnaround is relatively
insignificant (however, 1
remember that this is an
800MHz bus system ...)
0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8
1.6 3.2 6.4

System Bandwidth
DRAM TUTORIAL
ISCA 2002
Concurrency Effects
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of BANKS per CHANNEL
3 4 Banks per Channel
Maryland
8 Banks per Channel
the figure also shows that

increasing BANKS PER CHANNEL BANDWIDTH
CHANNEL gives you almost as
much benefit as increasing 2
CHANNEL BANDWIDTH, which
is much more costly to
implement
=> clearly, there are some

concurrency effects going on,
and we’d like to quantify and 1
better understand them
0
1.6 3.2 6.4
System Bandwidth
Banks/channel as significant as channel BW

DRAM TUTORIAL
ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang
University of 32-Byte Burst

Maryland
3 64-Byte Burst

take a look at a larger portion of
128-Byte Burst
the design space
x-axis: SYSTEM BANDWIDTH,

which == channels X channel
width X 800MHz
2
y-axis: execution time of

application on the given
configuration
different colored bars represent 1

different burst widths
MEMORY OVERHEAD is
substantial
obvious trend: more bandwidth

is better
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
another obvious trend: more b b yt by b yt byt b b yt byt byt b yt byt b yt
bandwidth is NOT x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
NECESSARILY better ... 1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth
Benchmark = GCC (SPEC 2000), 2 banks/channel

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang

Maryland
3 64-Byte Burst

so -- there are some obvious
128-Byte Burst
trade-offs related to BURST
SIZE, which can affect the
TOTAL EXECUTION TIME by
30% or more, keeping all else
constant.
2
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang

Maryland
3 64-Byte Burst

if we look more closely at
128-Byte Burst
individual system organizations,
there are some clear RULES of
THUMBthat apear ...
2
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang

Maryland
3 64-Byte Burst

for one, LARGE BURSTS are
128-Byte Burst
optimal for WIDER CHANNELS
Wide channels (32/64-bit)
want large bursts
2
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang

Maryland Narrow channels (8-bit)
3 want small bursts 64-Byte Burst

for another, SMALL BURSTS
128-Byte Burst
are optimal for NARROW
CHANNELS
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang

Maryland
3 64-Byte Burst
Medium channels (16-bit)

and MEDIUM BURSTS are want medium bursts 128-Byte Burst
optimal for MEDIUM
CHANNELS
so -- if CONCURRENCY were
all-important, we would expect
2
small bursts to be best, because
they would allow a LOWER
AVERAGE TIME-TO-CRITICAL-
WORD for a larger number of
simultaneous requests.
1
what we actually see is that the
optimal burst width scalies with
the bus width, sugesting an
optimal number of DATA
TRANSFERS per BANK
ACTIVATION/PRECHARGE
cycle
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
i’ll illustrate that ... b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h
System Bandwidth

DRAM TUTORIAL
ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang
University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns
this figure shows the entire DRAM BANK 90ns 64-bit channel x 32-byte burst
range of burst widths that we DATA BUS 70ns 5ns
modeled. note that some of the
figure represent several different
ADDRESS BUS 10ns
combinations ... for example, the
64-bit channel x 64-byte burst
one THIRD DOWN FROM TOP DRAM BANK 90ns
is
DATA BUS 70ns 10ns
2-bute channel + 32-byte burst
4-byte channel + 64-byte burst ADDRESS BUS 10ns
8-byte channel + 128-byte burst 64-bit channel x 128-byte burst
DRAM BANK 90ns 32-bit channel x 64-byte burst
DATA BUS 70ns 20ns 16-bit channel x 32-byte burst
ADDRESS BUS 10ns

DRAM BANK 100ns
ADDRESS BUS 10ns

DATA BUS 70ns 80ns
ADDRESS BUS 10ns 8-bit channel x 128-byte burst
DRAM BANK 220ns

DATA BUS 70ns 160ns
DRAM TUTORIAL
ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang
University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns
the optimal configurations are in DRAM BANK 90ns 64-bit channel x 32-byte burst
the middle, suggesting an DATA BUS 70ns 5ns
optimal number of DATA
TRANSFERS per BANK
ADDRESS BUS 10ns
ACTIVATION/PRECHARGE
cycle DRAM BANK 90ns
[BOTTOM] -- too many transfers DATA BUS 70ns 10ns OPTIMAL
per burst crowds out other BURST WIDTHS
requests ADDRESS BUS 10ns
[TOP] -- too few transfers per
request lets the bank overhead DATA BUS 70ns 20ns 16-bit channel x 32-byte burst
(activation/precharge cycle)
dominate
ADDRESS BUS 10ns
however, though this tells us DRAM BANK 100ns
how to best organize a channel
with a given bandwidth, these
rules of thumb do not say
anything about how the different ADDRESS BUS 10ns
configurations (wide/narrow/ DRAM BANK 140ns 16-bit channel x 128-byte burst
medium channels) compare to
each other ... DATA BUS 70ns 80ns
so let’s focus on multiple

configurations for ONE ADDRESS BUS 10ns 8-bit channel x 128-byte burst
BANDWIDTH CLASS ... DRAM BANK 220ns
DATA BUS 70ns 160ns
DRAM TUTORIAL
ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
this is like saying
6

i have 4 800MHz 8-bit Rambus
channels ... what should i do? 5
- gang them together?
- keep them independent?

4
- something in between?
3
like before, we see that more
banks is better, but not always
by much
2
0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4
3.2 GB/s System Bandwidth (channels x width x speed)

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
2 Banks per Channel
University of 7
8 Banks per Channel
with large bursts, there is less
6
interleaving of requests, so extra

banks are not needed
5
#Banks not particularly important

given large burst sizes ... 0
4 2 x 4 2 x 4 2 x
x x x x x x
1 2 4 1 2 4 1 2 4

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
2 Banks per Channel
University of 7
8 Banks per Channel
with multiple independent
6
channels, you have a degree of

concurrency that, to some
extent, OBVIATES THE NEED 5
for BANKING
... even less so

with multi-channel systems 0
4 2 x 4 2 x 4 2 x
x x x x x x
1 2 4 1 2 4 1 2 4

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
2 Banks per Channel
University of 7
8 Banks per Channel
however, trying to reduce
6
execution time via multiple

channels is a bit risky:
5
it is very SENSITIVE to BURST
SIZE, becaus emultiple
channels givees the longest
latency possible to EVERY
4
REQUEST in the system, and
having LONG BURSTS
exacerbates that problem -- byt 3
increasing the length of time that
a request can be delayed by
waiting for another ahead of it in
the queue 2
Multi-channel systems sometimes

(but not always) a good idea 0
4 2 x 4 2 x 4 2 x
x x x x x x
1 2 4 1 2 4 1 2 4

DRAM TUTORIAL
ISCA 2002
Bruce Jacob
2 Banks per Channel
University of 7
8 Banks per Channel
another way to look at it:
6

how does the choice of burst
size affect the NUMBER or 5
WIDTH of channels?
=> WIDE CHANNELS: an

improvement is seen by
4
increasing the burst size
=> NARROW CHANNELS: 3

either no improvement is seen,
or a slight degradation is seen
by increasing burst size
2
4x 1-byte channels
2x 2-byte channels 0
1x 4-byte channels
4 2 x 4 2 x 4 2 x
x x x x x x
1 2 4 1 2 4 1 2 4

DRAM TUTORIAL
ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
2
2 Banks per Channel
University of
8 Banks per Channel
we see the same trends in all
the benchmarks surveyed

the onllly thing that changes is
some of the relations between
parameters ...
0
4 2 x 4 2 x 4 2 x
x x n x x n x x n
an an ha an an ha an an ha
ch ch 4
c ch ch 4
c ch ch 4
c
1 2 1 2 1 2
DRAM TUTORIAL
ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
2
2 Banks per Channel
University of
8 Banks per Channel
for example, in BZIP, the best
configurations are at smaller

burst sizes than MCF
however, though THE OPTIMAL

CONFIG changes from
banchmark to benchmark, there
are always several configs that
1
are within 5-10% of the optimal
config -- IN ALL BENCHMARKS
BEST CONFIGS are at

SMALLER BURST SIZES 0
4 2 x 4 2 x 4 2 x
x x n x x n x x n
an an ha an an ha an an ha
ch ch 4
c ch ch 4
c ch ch 4
c
1 2 1 2 1 2
DRAM TUTORIAL
ISCA 2002
Queue Size & Reordering
Bruce Jacob
David Wang
BZIP: 1.6 GB/s (1 channel)
University of
Maryland
3
Infinite Queue

32-Entry Queue
1-Entry Queue
No Queue
2
0
el
el
el
el
el
el
l
ne
ne
ne
ne
ne
ne
nn
nn
nn
nn
nn
nn
an
an
an
an
an
an
ha
ha
ha
ha
ha
ch
ch
ch
ch
ch
ch
ch
/c
/c
/c
c
s/
s/
s/
s/
s/
s/
s/
s/
s/
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
nk
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
ba
1
1
2
8
32-byte burst 64-byte burst 128-byte burst
DRAM TUTORIAL
ISCA 2002
Conclusions
Bruce Jacob
David Wang
University of
DESIGN SPACE is NON-LINEAR,
Maryland
COST of MISJUDGING is HIGH
we have a complex edsign
space where neighboring
designs differ significantly CAREFUL TUNING YIELDS 30–40% GAIN
if you are careful, you can beat
the performance of the average
organization by 30-40%
MORE CONCURRENCY == BETTER
supporting memory concurrency
improves system performance,
as long as it is not done at the
(but not at expense of LATENCY)
expense of memory latency:
- using MULTIPLE CHANNELS

• Via Channels → NOT w/ LARGE BURSTS
is good, but not best solution
- multiple banks/channel is • Via Banks → ALWAYS SAFE
→ DOESN’T PAY OFF
always good idea
- trying to interleave small bursts • Via Bursts
is intuitively appealing, but it
doesn’t work • Via MSHRs → NECESSARY
- MSHRs: alwas a god idea
In general, bursts should be

large enough to amortize the BURSTS AMORTIZE COST OF PRECHARGE
precharge cost.
Direct Rambus = 16 bytes
DDR2 = 16/32 bytes
THIS IS NOT ENOUGH
• Typical Systems: 32 bytes (even DDR2)
→ THIS IS NOT ENOUGH
DRAM TUTORIAL
ISCA 2002
Outline
Bruce Jacob
David Wang
University of
• Basics
Maryland
NOTE
• Advanced Basics
DRAM TUTORIAL
ISCA 2002
Embedded DRAM Primer
Bruce Jacob
David Wang
University of
Maryland
NOTE
CPU DRAM
Core Array
Embedded
CPU DRAM
Core Array
Not Embedded
DRAM TUTORIAL
ISCA 2002
Whither Embedded DRAM?
Bruce Jacob
David Wang
University of
Microprocessor Report, August 1996: “[Five]
Maryland
Architects Look to Processors of Future”
NOTE
• Two predict imminent merger
of CPU and DRAM
• Another states we cannot keep cramming
more data over the pins at faster rates
(implication: embedded DRAM)
• A fourth wants gigantic on-chip L3 cache
(perhaps DRAM L3 implementation?)
SO WHAT HAPPENED?
DRAM TUTORIAL
ISCA 2002
Embedded DRAM for DSPs
Bruce Jacob
David Wang
University of
MOTIVATION
Maryland
TAGLESS SRAM TRADITIONAL CACHE
(hardware-managed)
NOTE
SOFTWARE The cache “covers” HARDWARE
manages this the entire address manages this
movement of space: any datum movement of
CACHE data in the space data
may be
cached
CACHE
Address space Address space
Move from includes both includes only
memory space “cache” and MAIN primary memory
to “cache” space primary memory MEMORY (and memory-
creates a new, (and memory- mapped I/O)
equivalent data MAIN mapped I/O) Copying
object, not a MEMORY from memory
mere copy of to cache creates a
the original. subordinate copy of
the datum that is
kept consistent with
the datum still in
memory. Hardware
ensures consistency.
NON-TRANSPARENT addressing TRANSPARENT addressing

EXPLICITLY MANAGED contents TRANSPARENTLY MANAGED contents
DSP Compilers => Transparent Cache Model

DRAM TUTORIAL
ISCA 2002
DSP Buffer Organization
Bruce Jacob
David Wang Used for Study
University of
Maryland Fully Assoc 4-Block Cache
victim-0 victim-1
NOTE
S0 S1
buffer-0 buffer-1
LdSt0 LdSt1
DSP
LdSt0 LdSt1
DSP
Bandwidth vs. Die-Area Trade-Off

for DSP Performance
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 50, 100, 200 MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
University of
Maryland 200MHz C6000 DSP : 50MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
Increasing bus width # # #
0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
University of
Maryland 200MHz C6000 DSP : 100MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
Increasing bus width
0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
University of
Maryland 200MHz C6000 DSP: 200MHz Memory
10 #
Cache Line Size
NOTE
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI
4
#
#
# #
2 #
# # #
# # #
Increasing bus width
0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL
ISCA 2002
Performance-Data Sources
Bruce Jacob
David Wang
“A Performance Study of Contemporary DRAM Architectures,”
University of Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
Maryland
“Organizational Design Trade-Offs at the DRAM, Memory Bus, and
Memory Controller Level: Initial Results,” University of Maryland
Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.
“DDR2 and Low Latency Variants,” Memory Wall Workshop 2000, in

conjunction w/ ISCA ’00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob.
“Concurrency, Latency, or System Overhead: Which Has the Largest

Impact on DRAM-System Performance?”
Proc. ISCA ’01. V. Cuppu and B. Jacob.
“Transparent Data-Memory Organizations for Digital Signal Processors,”

Proc. CASES ’01. S. Srinivasan, V. Cuppu, and B. Jacob.
“High Performance DRAMs in Workstation Environments,”

IEEE Transactions on Computers, November 2001.
V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
Recent experiments by Sadagopan Srinivasan, Ph.D. student at

University of Maryland.
DRAM TUTORIAL
ISCA 2002
Acknowledgments
Bruce Jacob
David Wang
University of
The preceding work was supported
Maryland
in part by the following sources:
• NSF CAREER Award CCR-9983618
• NSF grant EIA-9806645
• NSF grant EIA-0000439
• DOD award AFOSR-F496200110374
• … and by Compaq and IBM.
DRAM TUTORIAL
ISCA 2002
CONTACT INFO
Bruce Jacob
David Wang
University of
Bruce Jacob
Maryland
Electrical & Computer Engineering

University of Maryland, College Park
http://www.ece.umd.edu/~blj/
blj@eng.umd.edu
Dave Wang
Electrical & Computer Engineering

University of Maryland, College Park
http://www.wam.umd.edu/~davewang/
davewang@wam.umd.edu
UNIVERSITY OF MARYLAND

DRAM Tutorial Isca2002

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DRAM Tutorial Isca2002

Загружено:

Авторское право:

Доступные форматы

DRAM TUTORIAL

think about embedded systems:

Break at 10 a.m. — Stop us or starve

first off -- what is DRAM? an DRAM

. .. Word Lines ...

so how do you interact with this DRAM

let’s look at a read operation Data In/Out Sense Amps

. .. Word Lines ...

at this point, all but lines are attt DRAM

the read discharges the Column Decoder

. .. Word Lines ...

once the data is valid on ALL of DRAM

big point: cannot do another Data In/Out Sense Amps

. .. Word Lines ...

then the data is valid on the data DRAM

. .. Word Lines ...

... with optional additional

note: page mode enables overlap with CAS

Data In/Out Sense Amps

. .. Word Lines ...

Data Sense Amps Data Sense Amps Data Sense Amps

x2 DRAM x4 DRAM x8 DRAM

This is per bank …

let’s look at the interface another

main point: the RAS\ and CAS\ Column Access

[ do LATENCY mods starting Interface Modifications

Rambus, DDR/2 Future Trends

problem: cannot latch a new RAS Data Transfer

DQ Valid Valid Valid

DQ Valid Valid Valid

DQ Valid Valid Valid Valid

DQ Valid Valid Valid Valid

DQ Valid Valid Valid Valid

(RAS + CAS + OE ... == Command Bus)

ACT READ PRE ACT READ

DQ Valid Valid Valid Valid Valid Valid Valid Valid

ESDRAM, R/R to same bank

ACT READ PRE ACT READ

DQ Valid Valid Valid Valid Valid Valid Valid Valid

ACT READ PRE ACT WRITE PRE ACT READ

ESDRAM, R/W/R to same bank, rows 0/1/0

ACT READ PRE ACT WRITE READ

(can second READ be this aggressive?)

Activate Prefetch Read

Segment cache is software-managed, reduces energy

8K rows requires 13 bits tto

Sense Amps Sense Amps

tRCD = 15ns tRCD = 5ns

Reduces access time and energy/access

MoSys takes this one step addr

[physical partitioning: 72 banks]

but what is one bank gets

solution: they have a bank-sized

DDR 266 133 * 2 16 532 MB/s 20 ns 45 ns

DRDRAM 400 * 2 16 1.6 GB/s 22.5 ns 60 ns

RLDRAM 300 * 2 32 2.4 GB/s ??? 25 ns

• DRAM Evolution: Interface Path

It’s all about the cost of a

Now we’ll really get our hands

Professor Jacob has shown

We don’t get nice square or DQS (Pin)

*Toshiba Presentation, Denali MemCon 2002

First, we have to introduce the