Вы находитесь на странице: 1из 146

DRAM TUTORIAL

ISCA 2002

Bruce Jacob
David Wang

University of
Maryland
DRAM: Architectures,
DRAM: why bother? (i mean,
Interfaces, and Systems
besides the “memory wall”
thing? ... is it just a performance
issue?)

think about embedded systems:


think cellphones, think printers,
A Tutorial
think switches ... nearly every
embedded product that used to
be expensive is now cheap.
why?
for one thing, rapid turnover from
Bruce Jacob and David Wang
high performance to
obsolescence guarantees
generous supply of CHEAP,
HIGH-PERFORMANCE Electrical & Computer Engineering Dept.
embedded processors to suit
nearly any design need. University of Maryland at College Park
what does the “memory wall”
mean in this context? perhaps it
will take longer for a high-
performance design to become http://www.ece.umd.edu/~blj/DRAM/
obsolete?

UNIVERSITY OF MARYLAND
DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
NOTE

• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded

Break at 10 a.m. — Stop us or starve


DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
DRAM ORGANIZATION
Maryland

first off -- what is DRAM? an DRAM


array of storage elements
(capacitor-transistor pairs)
Storage element Column Decoder
“DRAM” is an acronym (explain)
why “dynamic”? (capacitor)
Word Line Data In/Out Sense Amps
- capacitors are not perfect ...
need recharging Buffers
... Bit Lines...
- very dense parts; very small;
capactiros have very little
Bit Line

Row Decoder
charge ... thus, the bit lines are

. .. Word Lines ...


charged up to 1/2 voltage level
and the ssense amps detect the
minute change on the lines, then
Memory
recover the full signal Array

Switching
element
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
BUS TRANSMISSION
Maryland

so how do you interact with this DRAM


thing? let’s look at a traditional
organization first ... CPU
connects to a memory controller Column Decoder
that connects to the DRAM itself.

let’s look at a read operation Data In/Out Sense Amps


Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER

Row Decoder

. .. Word Lines ...


Memory
Array
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
[PRECHARGE and] ROW ACCESS
Maryland

at this point, all but lines are attt DRAM


the 1/2 voltage level.

the read discharges the Column Decoder


capacitors onto the bit lines ...
this pulls the lines just a little bit
high or a little bit low; the sense Data In/Out Sense Amps
amps detect the change and
recover the full signal Buffers
MEMORY
... Bit Lines...
the read is destructive -- the
capacitors have been CPU BUS CONTROLLER

Row Decoder
discharged ... however, when

. .. Word Lines ...


the sense amps pull the lines to
the full logic-level (either high or
low), the transistors are kept
Memory
open and so allow their attached Array
capacitors to become recharged
(if they hold a ‘1’ value) AKA: OPEN a DRAM Page/Row
or
ACT (Activate a DRAM Page/Row)
or
RAS (Row Address Strobe)
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
COLUMN ACCESS
Maryland

once the data is valid on ALL of DRAM


the bit lines, you can select a
subset of the bits and send them
to the output buffers ... CAS Column Decoder
picks one of the bits

big point: cannot do another Data In/Out Sense Amps


RAS or precharge of the lines
until finished reading the column Buffers
data ... can’t change the values
MEMORY
... Bit Lines...
on the bit lines or the output of
the sense amps until it has been CPU BUS CONTROLLER

Row Decoder
read by the memory controller

. .. Word Lines ...


Memory
Array

READ Command
or
CAS: Column Address Strobe
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
DATA TRANSFER
Maryland

then the data is valid on the data DRAM


bus ... depending on what you
are using for in/out buffers, you
might be able to overlap a litttle Column Decoder
or a lot of the data transfer with
the next CAS to the same page
(this is PAGE MODE) Data In/Out Sense Amps
Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER

Row Decoder

. .. Word Lines ...


Memory
Array
Data Out

... with optional additional


CAS: Column Address Strobe

note: page mode enables overlap with CAS


DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
BUS TRANSMISSION
Maryland

NOTE DRAM

Column Decoder

Data In/Out Sense Amps


Buffers
MEMORY
... Bit Lines...
CPU BUS CONTROLLER

Row Decoder

. .. Word Lines ...


Memory
Array
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang F
University of DRAM
Maryland
CPU Mem
DRAM “latency” isn’t
Controller E1
deterministic because of CAS or
RAS+CAS, and there may be A
significant queuing delays within
the CPU and the memory
controller B
Each transaction has some D E2/E3
overhead. Some types of
overhead cannot be pipelined.
C
This means that in general,
longer bursts are more efficient.
A: Transaction request may be delayed in Queue
B: Transaction request sent to Memory Controller
C: Transaction converted to Command Sequences
(may be queued)
D: Command/s Sent to DRAM
E1: Requires only a CAS or
E2: Requires RAS + CAS or
E3: Requires PRE + RAS + CAS
F: Transaction sent back to CPU
“DRAM Latency” = A + B + C + D + E + F
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
PHYSICAL ORGANIZATION
Maryland

NOTE
x8 DRAM
x2 DRAM x4 DRAM
Column Decoder Column Decoder Column Decoder

Data Sense Amps Data Sense Amps Data Sense Amps


Buffers Buffers Buffers
... Bit Lines... ... Bit Lines... ... Bit Lines...
Row Decoder

Row Decoder

Row Decoder
Memory Memory Memory
....

....

....
Array Array Array

x2 DRAM x4 DRAM x8 DRAM

This is per bank …


Typical DRAMs have 2+ banks
DRAM TUTORIAL

ISCA 2002
Basics
Bruce Jacob
David Wang

University of
Read Timing for Conventional DRAM
Maryland

let’s look at the interface another


way .. the say the data sheets
portray it. RAS
Row Access
[explain]

main point: the RAS\ and CAS\ Column Access


signals directly control the CAS
latches that hold the row and Data Transfer
column addresses ...

Address
Row Column Row Column
Address Address Address Address

DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL

ISCA 2002
DRAM Evolutionary Tree
Bruce Jacob
David Wang

........
......
University of
Maryland
MOSYS
since DRAM’s inception, there
have been a stream of changes
to the design, from FPM to EDO Structural
to Burst EDO to SDRAM. the
changes are largely structural
Modifications
modifications -- nimor -- that Targeting
target THROUGHPUT. FCRAM Latency
[discuss FPM up to SDRAM
Conventional
Everything up to and including DRAM
$
SDRAM has been relatively (Mostly) Structural Modifications
inexpensive, especially when
considering the pay-off (FPM
Targeting Throughput VCDRAM
was essentially free, EDO cost a
latch, PBEDO cost a counter,
SDRAM cost a slight re-design).
however, we’re run out of “free”
ideas, and now all changes are
considered expensive ... thus
there is no consensus on new
directions and myriad of choices FPM EDO P/BEDO SDRAM ESDRAM
has appeared

[ do LATENCY mods starting Interface Modifications


with ESDRAM ... and then the
INTERFACE mods ]
Targeting Throughput

Rambus, DDR/2 Future Trends


DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Conventional DRAM
Maryland
Row Access
NOTE
Column Access
Transfer Overlap
Data Transfer
RAS

CAS

Address
Row Column Row Column
Address Address Address Address

DQ Valid Valid
Dataout Dataout
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Fast Page Mode
Maryland
Row Access
FPM aallows you to keep th
esense amps actuve for multiple Column Access
CAS commands ...
Transfer Overlap
much better throughput

problem: cannot latch a new RAS Data Transfer


value in the column address
buffer until the read-out of the
data is complete

CAS

Address
Row Column Column Column
Address Address Address Address

DQ Valid Valid Valid


Dataout Dataout Dataout
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Extended Data Out
Maryland
Row Access
solution to that problem --
instead of simple tri-state Column Access
buffers, use a latch as well.
Transfer Overlap
by putting a latch after the
column mux, the next column
address command can begin
Data Transfer
sooner
RAS

CAS

Address
Row Column Column Column
Address Address Address Address

DQ Valid Valid Valid


Dataout Dataout Dataout
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Burst EDO
Maryland
Row Access
by driving the col-addr latch from
an internal counter rather than Column Access
an external signal, the minimum
cycle time for driving the output Transfer Overlap
bus was reduced by roughly
30%
Data Transfer
RAS

CAS

Address
Row Column
Address Address

DQ Valid Valid Valid Valid


Data Data Data Data
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Pipeline Burst EDO
Maryland
Row Access
“pipeline” refers to the setting up
of the read pipeline ... first CAS\ Column Access
toggle latches the column
address, all following CAS\ Transfer Overlap
toggles drive data out onto the
bus. therefore data stops coming
when the memory controller
Data Transfer
stops toggling CAS\
RAS

CAS

Address
Row Column
Address Address

DQ Valid Valid Valid Valid


Data Data Data Data
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Read Timing for Synchronous DRAM
Maryland
Row Access
Clock
main benefit: frees up the CPU
or memory controller from Column Access
having to control the DRAM’s
RAS
internal latches directly ... the Transfer Overlap
controller/CPU can go off and do
other things during the idle
cycles instead of “wait” ... even
Data Transfer
though the time-to-first-word CAS
latency actually gets worse, the
scheme increases system
throughput
Command

ACT READ

Address
Row Col
Addr Addr

DQ Valid Valid Valid Valid


Data Data Data Data

(RAS + CAS + OE ... == Command Bus)


DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Inter-Row Read Timing for ESDRAM
Maryland
Regular CAS-2 SDRAM, R/R to same bank
Clock
output latch on EDO allowed you
to start CAS sooner for next
accesss (to same row) Command

ACT READ PRE ACT READ


latch whole row in ESDRAM --
allows you to start precharge & Address
RAS sooner for thee next page
Row Col Row Col
access -- HIDE THE Addr Addr Bank Addr Addr
PRECHARGE OVERHEAD.

DQ Valid Valid Valid Valid Valid Valid Valid Valid


Data Data Data Data Data Data Data Data

ESDRAM, R/R to same bank


Clock

Command

ACT READ PRE ACT READ

Address
Row Col Bank Row Col
Addr Addr Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid


Data Data Data Data Data Data Data Data
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Write-Around in ESDRAM
Maryland
Regular CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0
Clock
neat feature of this type of
buffering: write-around
Command

ACT READ PRE ACT WRITE PRE ACT READ

Address
Row Col Row Col Row Col
Addr Addr Bank Addr Addr Bank Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data

ESDRAM, R/W/R to same bank, rows 0/1/0


Clock

Command

ACT READ PRE ACT WRITE READ

Address
Row Col Bank Row Col Col
Addr Addr Addr Addr Addr

DQ Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Data Data Data Data Data Data Data Data Data Data Data Data

(can second READ be this aggressive?)


DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob $
David Wang

University of
Internal Structure of Virtual Channel
Maryland 16 Channels
Bank B (segments)
main thing ... it is like having a Input/Output
bunch of open row buffers (a la Buffer
rambus), but the problem is that Bank A
you must deal with the cache 2Kb Segment
directly (move into and out of it),
not the DRAM banks ... adds an
extra couple of cycles of latency
... however, you get good 2Kb Segment
bandwidth if the data you want is
cache, and you can “prefetch”
2Kbit # DQs DQs
into cache ahead of when you
want it ... originally targetted at
reducing latency, now that 2Kb Segment
SDRAM is CAS-2 and RCD-2,
this make sense only in a
throughput way
2Kb Segment

Sense
Row Decoder Amps Sel/Dec

Activate Prefetch Read


Restore Write

Segment cache is software-managed, reduces energy


DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Internal Structure of Fast Cycle RAM
Maryland
SDRAM FCRAM
FCRAM opts to break up the
data array .. only activate a
portion of the word line

8K rows requires 13 bits tto

Row Decoder

Row Decoder
select ... FCRAM uses 15
(assuming the array is 8k x 1k ...
the data sheet does not specify) 8M Array 8M Array
13 bits 15 bits
(8Kr x 1Kb) (?)

Sense Amps Sense Amps

tRCD = 15ns tRCD = 5ns


(two clocks) (one clock)

Reduces access time and energy/access


DRAM TUTORIAL

ISCA 2002
DRAM Evolution

........
......
Bruce Jacob
David Wang

University of
Internal Structure of MoSys 1T-SRAM
Maryland

MoSys takes this one step addr


further ... DRAM with an SRAM
interface & speed but DRAM
energy

[physical partitioning: 72 banks]


Bank
Select
auto refresh -- how to do this
transparently? the logic moves
tthrough the arrays, refreshing
them when not active.

but what is one bank gets


repeated access for a long
duration? all other banks will be
refreshed, but that one will not.

solution: they have a bank-sized


CACHE of lines ... in theory,
should never have a problem
(magic) Auto
Refresh

DQs
DRAM TUTORIAL

ISCA 2002
DRAM Evolution
Bruce Jacob
David Wang

University of
Comparison of Low-Latency DRAM Cores
Maryland
Data Bus Bus Width Peak BW RAS–CAS RAS–DQ
DRAM Type
here’s an idea of how the Speed (per chip) (per Chip) (tRCD) (tRAC)
designs compare ...
PC133 SDRAM 133 16 266 MB/s 15 ns 30 ns
bus speed == CAS-to-CAS
VCDRAM 133 16 266 MB/s 30 ns 45 ns
RAS-CAS == time to read data
from capacitors into sense amps
FCRAM 200 * 2 16 800 MB/s 5 ns 22 ns
RAS-DQ == RAS to valid data
1T-SRAM 200 32 800 MB/s — 10 ns

DDR 266 133 * 2 16 532 MB/s 20 ns 45 ns

DRDRAM 400 * 2 16 1.6 GB/s 22.5 ns 60 ns

RLDRAM 300 * 2 32 2.4 GB/s ??? 25 ns


DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
• Advanced Basics
• Memory System Details (Lots)

• DRAM Evolution: Interface Path


• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL

ISCA 2002
What Does This All Mean?
Bruce Jacob
David Wang

University of
xDDR II
Maryland
DDR II
Some Technology has legs,
EDO
netDRAM
some do not have legs, and
some have gone belly up.
RLDRAM BEDO
We’ll start by emaining the
fundamental technologies (I/O ESDRAM
packaging etc) then explore ome
of these technologies in depth a
bit later.
FPM
SDRAM
FCRAM

D-RDRAM DDR
SDRAM
SLDRAM
DRAM TUTORIAL

ISCA 2002
Cost - Benefit Criterion
Bruce Jacob
David Wang

University of
Maryland
Package Cost
What is a “good” system?

It’s all about the cost of a


system. This is a multi-
dimensional tradeoff problem. Interconnect
Especially tough when the Cost Bandwidth
relative cost factors of pins, die
area, and the demands of
bandwidth and latency keeps on DRAM
changing. Good decisions for
one generation may not be good
Logic Overhead System
for future generations. This is
why we don’t keep a DRAM
Design
protocol for a long time. FPM
lasted a while, but we’ve quickly Test and
progressed through EDO, Implementation
SDRAM, DDR/RDRAM, and
now DDR II and whatever else is Latency
on the horizon.
Power
Consumption
DRAM TUTORIAL

ISCA 2002
Memory System Design
Bruce Jacob
David Wang

University of
Clock Network I/O Technology
Maryland

Now we’ll really get our hands


dirty, and try to become DRAM
designers. That is, we want to Topology Chip Packaging
understand the tradeoffs, and
design our own memory system
with DRAM cells. By doing this,
we can gain some insight into
some of the basis of claims by
DRAM Chip DRAM
proponents of various DRAM
memory systems. Architecture Memory Pin Count
A Memory System is a system
System
that has many parts. It’s a set of
technologies and design
decisions. All of the parts are
inter-related, but for the sake of
discussion, we’ll splite the Address Mapping Access Protocol
components into ovals seen
here, and try to examine each
part of a DRAM system
separately.
Row Buffer
Management
DRAM TUTORIAL

ISCA 2002
DRAM Interfaces
Bruce Jacob
David Wang

University of
The Digital Fantasy
Maryland

Professor Jacob has shown


yoou some nice timing
diagrams, I too will show you
some nice timing diagrams, but
Row a0
the timing diagrams are a
simplification that hides the Col r0 r1
details of implemetation. Why
don’t they just run the system at
XXX MHz like the other guy? Data d0 d0 d0 d0 d1 d1 d1 d1
Then the latency would be much
better, and the bandwidth would
be extreme. Perhaps they can’t, RAS CAS Pipelined Access
and we’ll explain why. To latency latency
understand the reason why
some systems can operate at
XXX MHz while others cannot,
we must go digging past the nice
timing digrams and the
Pretend that the world looks like this
architectural block diagrams and
see what turns up underneath.
So underneath the timing
diagram, we find this....

But...
DRAM TUTORIAL
Read fr omFC RAMTM @400MHz DDR
ISCA 2002 (The
Non-teReal
rminaWorld
ti on c as e)
Bruce Jacob
David Wang FCRAM side VDDQ(Pad)

University of
Maryland

We don’t get nice square or DQS (Pin)


even nicely shaped waveforms. DQ0-15 (Pin)
Jitter, skew, etc. Vddq and Vssq
are the voltage supplies to the I/
O pads on the DRAM chips. The
signal bounces around, and very
non ideal. So what are the
problems, and what are the
solution(s) that are used to solve
these problems? (note, the 158 VSSQ(Pad)
ps skew on parallel data
channels If your cycle time is
10ns, or 10,000ps, a skew of Controller side
158 ps is no big deal, but if your
cycle time is 1ns, or 1000ps,
then a skew of 158ps is a big
deal)
Already, we see hints of some DQ0-15 (Pin)
problems as we try to push DQS (Pin)
systems to higher and higher
clock frequencies.

skew=158psec skew=102psec

*Toshiba Presentation, Denali MemCon 2002

12
DRAM TUTORIAL

ISCA 2002
Signal Propagation
Bruce Jacob
David Wang

University of A
Maryland B

First, we have to introduce the


concept that signal propagation
takes finite time. Limited by the
speed of light, or rather ideal
transmission lines we should
have speed of approximately 2/3
the speed of light. That gets us
20cm/ns. All signals, including
system wide clock signals has to
be sent on a system board, so if
you sent a clock signal from
point A to point B on an ideal
signal line, point B won’t be able
to tell that the clock has change
until at the earliest, 1/20 ns/cm *
distance later that the clock has
risen. Ideal Transmission Line
Then again, PC boards are not
exactly ideal transmission lines.
(ringing effect, drive strength,
~ 0.66c = 20 cm/ns
etc)
The concept of “Synchronous”
breaks down when different
PC Board + Module Connectors +
parts of the system observe
different clocks. Kind of like Varying Electrical Loads
relativity

= Rather non-Ideal Transmission Line


DRAM TUTORIAL

ISCA 2002
Clocking Issues
Bruce Jacob
David Wang

University of Figure 1:
Maryland

When we build a “synchronous


Sliding Time
0th Nth
system” on a PCB board, how
do we distribute the clock
signal? Do we want a sliding
time domain? Is H Tree do-able
to N-modules in parallel? Skew
Clk
compensation?
SRC

Figure 2:
H Tree?
0 th Nth

Clk
SRC

What Kind of Clocking System?


DRAM TUTORIAL

ISCA 2002
Clocking Issues
Bruce Jacob
David Wang

University of Figure 1:
Maryland

We would want the chips to be


Write Data 0th Nth
on a “global clock”, everyone is
perfectly synchronous, but since
clock signals are delivered
through wires, different chips in
the system will see the rising
Clk
edge of a clock a little bit earlier/
later than other chips.
SRC Signal Direction
While an H-Tree may work for a
low freq system, we really need
a clock for sending (writing)
signals from the controller to the
chips, and another one for
Figure 2:
snding signals from chips to
controller (reading)
Read Data
0 th Nth

Clk
SRC
Signal Direction

We need different “clocks” for R/W


DRAM TUTORIAL

ISCA 2002
Path Length Differential
Bruce Jacob
David Wang
Path #3
Bus Signal 2
University of Path #2 Bus Signal 1
Maryland
Path #1
Controller
We purposefully “routed path #2 A B
to be a bit longer than path #1 to Intermodule
illustrate the point in between
the signal path length
Connectors
differentials. As illustrated,
signals will reach load B at a
later time than load A simply
because it is farther away from
controller than load A.

It is also difficult to do path


length and impedence matching
on a system board. Sometimes
heroic efforts must be utilized to
get us a nice “parallel” bus.

High Frequency AND Wide Parallel


Busses are Difficult to Implement
DRAM TUTORIAL

ISCA 2002
Subdividing Wide Busses
Bruce Jacob
David Wang

University of
Maryland
B
1
It’s hard to bring the Wide
parallel bus from point A to point
B, but it’s easier to bring in
smaller groups of signals from A
2
to B. To ensure proper timing, 3
we also send along a source
synchronous clock signal that is
path length matched with the
signal groun it covers.In this
figure, signal groups 1,2, and 3
may have some timing skew with
respect to each other, but within
the group the signals will have
minimal skew. (smaller channel Obstruction
can be clocked higher)

1 2 Narrow Channels,
3 Source Synchronous
A Local Clock Signals
DRAM TUTORIAL

ISCA 2002
Why Subdivision Helps
Bruce Jacob
David Wang

University of
Maryland
Sub
Analogy, it’s a lot harder to Channel 1
schedule 8 people for a meeting,
but a lot easier to schedule 2
meetings with 4 people each.
The results of the two meeting
can be correlated later.
Sub
Channel 2

Worst Case
skew of
Worst Case Worst Case
{Chan 1 +
skew of skew of
Chan 2}
Chan 1 Chan 2

Worst Case Skew must be Considered in System Timing


DRAM TUTORIAL

ISCA 2002
Timing Variations
Bruce Jacob
David Wang

University of
Maryland Controller 4 Loads

A “System” is a hard thing to


design. Especially one that
allows end users to perform
configurations that will impact Controller 1 Load
timing. To guarentee functional
correctness of the system, all
corner cases of variances in
loading and timing must be
accounted for.
Clock

Cmd to 1 Load

Cmd to 4 Loads

How many DIMMs in System?


How many devices on each DIMM?
Who built the memory module?
Infinite variations on timing!
DRAM TUTORIAL

ISCA 2002
Loading Balance
Bruce Jacob
David Wang

University of
Maryland
Controller
To ensure that a lightly loaded Duplicate
system and a fully loaded
system do not differ significantly
Signal
in timing, we either have
duplicate signals sent to
Lines
different memory modules, or
we have the same signal line, Controller
but the signal line uses variable
strengths to drive the I/O pads,
depending on if the system has
1,2,3 or 4 loads.

Variable
Controller Signal
Drive
Strength

Controller
DRAM TUTORIAL

ISCA 2002
Topology
Bruce Jacob
David Wang

University of DRAM DRAM DRAM DRAM


Maryland
Chip Chip Chip Chip
Self Explanatory. topology
determines loading and signal
propagation lengths. DRAM DRAM DRAM DRAM
Controller
? Chip Chip Chip Chip

DRAM DRAM DRAM DRAM


Chip Chip Chip Chip

DRAM DRAM DRAM DRAM


Chip Chip Chip Chip

DRAM System Topology Determines


Electrical Loading Conditions
and Signal Propagation Lengths
DRAM TUTORIAL

ISCA 2002
SDRAM Topology Example
Bruce Jacob
David Wang

University of
Maryland x16 x16
DRAM DRAM
Very simple topology. The clock Chip Chip
signal that turns around is very
nice. Solves problem of needing
multiple clocks.
Command & x16 x16
Address DRAM DRAM
Single Chip Chip
Channel
SDRAM x16 x16
Controller DRAM DRAM
Data bus Chip Chip
(64 bits)
(16 bits)
x16 x16
DRAM DRAM
Chip Chip

Loading Imbalance
DRAM TUTORIAL

ISCA 2002
RDRAM Topology Example
Bruce Jacob
David Wang

University of RDRAM
Maryland
Controller
All signals in this topology, Addr/
Cmd/Data/Clock are sent from
point to point on channels that is
path length matched by
definition.

Controller

clock
Packets traveling down Chip
turns
Parallel Paths. Skew is around
minimal by design.
DRAM TUTORIAL

ISCA 2002
I/O Technology
Bruce Jacob
David Wang

University of
Maryland Logic High
RSL vs SSTL2 etc.
∆v
(like ECL vs TTL of another era)

What is “Logic Low”, what is Logic Low


“Logic High”? Different Electrical
Signalling protocols differ on ∆t Time
voltage swing, high/low level,
etc.

∆v
delta t is on the order of ns, we
want it to be on the order of ps.

Slew Rate =
∆t

Smaller ∆ v =
Smaller ∆ t at same slew rate
Increase Rate of bits/s/pin
DRAM TUTORIAL

ISCA 2002
I/O - Differential Pair
Bruce Jacob
David Wang

University of
Maryland

Used on clocking systems, i.e.


RDRAM (all clock signals are
pairs, clk and clk#).
Single Ended Transmission Line
Highest noise tolerance, does
not need as many ground
signals. Where as singled ended
signals need many ground
connections. Also differential
pair signals may be clocked
even higher, so pin-bandwidth
disadvantage is not nearly 2:1
as implied by the diagram.

Differential Pair Transmission Line

Increase Rate of bits/s/pin ?


Cost Per Pin?
Pin Count?
DRAM TUTORIAL

ISCA 2002
I/O - Multi Level Logic
Bruce Jacob
David Wang

University of
Maryland

One of several ways on the table


to further increase the bit-rate of
the interconnects.

logic 10
range
Vref_2
logic 11
range
Vref_1
logic 01
voltage

range
Vref_0
logic 00 time
range

Increase Rate of bits/s/pin


DRAM TUTORIAL

ISCA 2002
Packaging
Bruce Jacob
David Wang

University of
Maryland DIP
“good old da ys”
Different packaging types impact
costs and speed. Slow parts can
Features Target Specification
use the cheapest packaging
available. Faster parts may have
to use more expensive
packaging. This has long be
SOJ Package

Speed
FBGA

800MBp
LQFP

550Mbps
accepted in the higher margin Small Outline J-lead
processor world, but to DRAM,
Vdd/Vddq 2.5V/2.5V (1.8V)
each cent has to hard fought for.
To some extent, the demand for
higher performance is pushing
memory makers to use more
TSOP
Thin Small Outline
Interface SSTL_2

expensive packaging to Row Cycle 35ns


accommodate higher frequency Package Time tRC
parts. When RAMBUS first
spec’ed FBGA, module makers
complained, since they have to
purchase expensive equipment
to validate that chips were
LQFP Memory Roadmap for
Hynix NetDDR II
properly soldered to the module Low Profile Quad
board, whereas something like Flat Package
TSOP can be done with visual
inspection.

FBGA
Fine Ball Grid Array
DRAM TUTORIAL

ISCA 2002
Access Protocol
Bruce Jacob
David Wang

University of Single
Maryland Cycle
16 bit wide command I have to
Cmd
send from A to B, I need 16 pins,
or if I have less than 16, I need
multiple cycles.
Cmd r0
How many bits do I need to send
from point A to point B? How
many pins do I get? Data d0 d0 d0 d0
Cycles = Bits/Pins.
Single Cycle Command

Multiple
Cycle
Cmd

Cmd r0 r0 r0 r0

Data d0 d0 d0 d0

Multiple Cycle Command


DRAM TUTORIAL

ISCA 2002
Access Protocol (r/r)
Bruce Jacob
David Wang

University of
Maryland
Control DRAM DRAM DRAM DRAM
There is inherant latency
between issuance of a read
command, and the response of
the chip with data. To increase
efficiency, a pipeline structure is
necessary to obtain full
utilization of the command,
address and data busses.
Row a0
Different from an “ordinary”
pipeline on a processor, a Col r0 r1
memory pipeline has data
flowing in both directions.
Data d0 d0 d0 d0 d1 d1 d1 d1
Architecture wise, we should be
concerned with full utilization
everywhere, so we can use the RAS CAS Pipelined Access
least number of pins for the latency latency
greatest benefit, but in actual
use, we are usually concerned
with full utilization of the data Consecutive Cache Line Read Requests to Same DRAM Row
bus.

Command
a = Active (open page)
r = Read (Column Read)
Data d = Data (Data chunk)
DRAM TUTORIAL

ISCA 2002
Access Protocol (r/w)
Bruce Jacob
David Wang DRAM
Data In/Out
University of One Datapath - Two Commands
Buffers
Maryland
Column Decoder
The DRAM chips determine the
latency of data after a read
command is received, but the Sense Amps
controller determines the timing
relationship between the write
command and the data being
written to the dram chips.
Col w0 r1
(If the DRAM device cannot
handle pipelined R/W, then...)
Data d0 d0 d0 d0 d1 d1 d1 d1
Case 1: Controller sends write
data at same time as the write Case 1: Read Following a Write Command to Different DRAM Devices
command to different devices
(pipelined)

Case 2: Controller sends write Col w0 r1


data at same time as the write
command to same device. (not
pipelined)
Data d0 d0 d0 d0 d1 d1 d1 d1
Case 2: Read Following a Write Command to Same DRAM Device

Col w0 r1
Data d0 d0 d0 d0 d1 d1 d1 d1
Soln: Delay Data of Write Command to match Read Latency
DRAM TUTORIAL

ISCA 2002
Access Protocol (pipelines)
Bruce Jacob
David Wang
Col r0 r1 r2
University of
Maryland
Data d0 d0 d0 d0 d1 d1 d1 d1 d2 d2 d2 d2
To increase “efficiency”, CAS
pipelines is required. How many
commands must one device
latency
support concurrently?
Three Back-to-Back Pipelined Read Commands
2? 3? 4? (depends on what?)

Imagine we must increast data


rate (higher pin freq), but allow
DRAM core to operate slightly Col r0 r1 r2
slower. (2X pin freq., same core
latency)
Data d0 d0 d0 d0 d1 d1 d1 d1d2 d2 d2 d2
This issue ties access protocol
to internal DRAM architecture CAS
issues.
latency

“Same” Latency, 2X pin frequency, Deeper Pipeline

When pin frequency increases, chips must either


reduce “real latency”,or
support longer bursts, or
pipeline more commands.
DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
440BX used 132 pins to control
a single SDRAM channel, not
counting Pwr & GND. now 845 • Advanced Basics
chipset only uses 102.

Also slower versions. 66/100 • DRAM Evolution: Interface Path


Also page burst, an entire page.
Burst length programmed to
• SDRAM, DDR SDRAM, RDRAM Memory System
match cacheline size. (i.e. 32
byte = 256 bits = 4 cycles of 64
Comparisons
bits)
• Processor-Memory System Trends
Latency as seen by the
controller is really CAS + 1 • RLDRAM, FCRAM, DDR II Memory Systems Summary
cycles .
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL

ISCA 2002
SDRAM System In Detail
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3 Dimm4
University of
Maryland

Single
Channel
SDRAM
Controller

Addr & Cmd


“Mesh T opology” Data Bus
Chip (DIMM) Select
DRAM TUTORIAL

ISCA 2002
SDRAM Chip
Bruce Jacob
David Wang
133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus
Maryland 54 pin
TSOP
Programmable Burst Length, 1,2,4 or 8
SDRAM: inexpensive
packaging, lowest cost (LVTTL Quad Banks Internally
signaling), standard 3.3V supply
voltage. DRAM core and I/O Supply Voltage of 3.3V
share same power supply.

“Power cost” is between 560mW


Low Latency, CAS = 2 , 3
and 1W per chip for the duration 14 Pwr/Gnd
of the cache line burst. (Note: it LVTTL Signaling (0.8V to 2.0V) 16 Data
costs power just to keep lines
stored in sense amp/row buffers, (0 to 3.3V rail to rail.) 15 Addr
something for row buffer 7 Cmd
management policies to think
about.) 1 Clk
1 NC
About 2/7 of pins used for Addr,
2/7 used for Data, 2/7 used fo
Condition Specification Cur. Pwr
Pwr/Gnd, and 1/7 used for cmd
signals. Operating (Active) Burst = Continous 300mA 1W
Row Commands and column
commands are sent on the
Operating (Active) Burst = 2 170mA 560mW
same bus, so they have to be
demultiplexed inside the DRAM Standby (Active) All banks active 60mA 200mW
and decoded.
Standby (powerdown) All banks inactive 2mA 6.6mW
DRAM TUTORIAL

ISCA 2002
SDRAM Access Protocol (r/r)
Bruce Jacob 0 1 2 3 4 5 6 7 8 9 10 11 12 13
David Wang 1
SDRAM SDRAM
University of 1
chip # 0 chip # 1
3 read command and address assertion 2
Maryland Memory 3

2
Controller
data bus utilization 4
We’ve spent some time Data bus
discussing some pipelined back CASL 3
to back read commands sent to 4
the same chip, now let’s try to
pipeline commands to different
chips.

In order for the memory


controller to be able to latch in
Clock signal
data on the data bus on
consecutive cycles, chip #0 has
to hold the data value past the Data return from
rising edge of the clock to satisfy 2 4 Data return from
the hold time requirements, then chip # 0 chip # 1
chip #0 has to stop, allow the
bus to go “quiet”, then chip #1
can start to drive the data bus at
least some “setup time” ahead of
the rising edge of the next clock. chip # 0 bus idle chip # 1
Clock cycles has to be long time
enough to tolerate all of the hold time setup time
timing requirements.

Back-to-back Memory Read Accesses to Different Chips in SDRAM

Clock Cycles are still long enough to


allow for pipelined back-to-back Reads
DRAM TUTORIAL

ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang
Figure 1:
University of
Maryland Consecutive
Reads
0 th Nth
I show different paths, but
theses signals are sharing the
bi-directional data bus. For a
read to follow a write to a
different chip, the worst case
skew is when we write to the (N-
1)th chip, then expect to pipeline
a read command in the next
cycle right behind it. The worst Worst case = Dist(N) - Dist(0)
case signal path skew is the
sum of the distances. Isn’t from
N to N even worse? No, SDRAM
does not support pipelined read
Figure 1:
th Nth
behind a write on the same chip.
Also, it’s not as bad as I project
here, since read cycles are
Read After 0 (N-1)th
center aligned, and writes are
edge aligned, so in essence, we
Write
get 1 1/2 cycles to pipeline this
case, instead of just 1 cycle.
Still, this problem limits the freq
Write
scalability of SDRAM, an idle Read
cycles maybe inserted to meet
timing.

Looks Just like SDRAM!


Worst case = Dist(N) + Dist(N-1)

Bus Turn Around


DRAM TUTORIAL

ISCA 2002
SDRAM Access Protocol (w/r)
Bruce Jacob
David Wang

University of 1
Maryland
SDRAM SDRAM
Timing bubbles. More dead
cycles
chip # 0 chip # 1
2
Memory
Controller 3

4 Data bus

1 3
Col w0 r1

Data d0 d0 d0 d0 d1 d1 d1 d1
2 4

Read Following a Write Command to Same SDRAM Device


DRAM TUTORIAL

ISCA 2002
DDR SDRAM System
Bruce Jacob
David Wang Dimm1 Dimm2 Dimm3
University of
Maryland

Since data bus has a much


lighter load, if we can use better
signaling technology, perhaps Single
we can run just the data bus at a
higher frequency.
Channel
At the higher frequency, the
DDR
skews we talked about would be SDRAM
terrible with a 64 bit wide data
bus. So we use a source Controller
synchonous strobe signals
(called DQS) that is routed
parallel to each 8 bit wide sub
channel.

DDR is newer, so let’s use lower


core voltage, saves on power
too!

Addr & Cmd


Same Topology Data Bus
as SDRAM DQS (Data Strobe)
Chip (DIMM) Select
DRAM TUTORIAL

ISCA 2002
DDR SDRAM Chip
Bruce Jacob
David Wang 133 MHz (7.5ns cycle time)
256 MBit
University of Multiplexed Command/Address Bus 66 pin
Maryland
TSOP
Programmable Burst Lengths, 2, 4 or 8*
Slightly larger package, same
width for Addr and Data, new Quad Banks Internally
pins are 2 DQS, Vref, and now
differential clocks. Lower supply Supply Voltage of 2.5V*
voltage.

Low voltage swing, now with


Low Latency, CAS = 2 , 2.5, 3 *
reference to Vref instead of (0.8
to 2.0V) SSTL-2 Signaling (Vref +/- 0.15V)
No power discussion here (0 to 2.5V rail to rail) 16 Pwr/Gnd*
because data sheet (micron) is
incomplete. 16 Data
0 1 2 3 4 5 15 Addr
Read Data returned from DRAM Clk #
chips now gets read in with
7 Cmd
Clk 2 Clk *
respect to the timing of the DQS
signals, sent by the dram chips Cmd Read 7 NC *
in parallel with the data itself.
DQS 2 DQS *
The use of DQS introduces Data 1 Vref *
“bubbles” in between bursts from
different chips, and reduces
bandwidth efficiency. CASL = 2 DQS Post-amble

DQS Pre-amble
DRAM TUTORIAL

ISCA 2002
DDR SDRAM Protocol (r/r)
Bruce Jacob
David Wang

University of r0
Maryland SDRAM SDRAM
Here we see that two
chip # 0 chip # 1
consecutive column read
Memory
d0
commands to different chips on
the DDR memory channel
r1
cannot be placed back to back
Controller
on the Data bus due to the DQS
signal hand-off issue. They may
Data bus
be pipelined with one idle cycle d1
in between bursts.

This situation is true for all


consecutive accesses to
DQS pre-amble DQS post-amble
different chips, r/r. r/w. w/r.
(except w/w, when the controller
keeps control of the DQS signal, 0 1 2 3 4 5 6 7 8
just changes target chips) Clk #
Clk
Because of this overhead, short
nursts are inefficient on DDR, Cmd r0 r1
longer bursts are more efficient.
(32 byte cache line = 4 burst, 64
DQS
byte line = 8 burst) Data d0 d0 d0 d0 d1 d1 d1 d1

CASL = 2

Back-to-back Memory Read Accesses to Different Chips in DDR SDRAM


DRAM TUTORIAL

ISCA 2002
RDRAM System
Bruce Jacob
David Wang

University of
Maryland RDRAM
Controller
Very different from SDRAM.
Everything is sent around in 8
(half cycle) packets. Most
systems now runs at 400 MHz,
but since everything is DDR, it’s
called “800 MHz”. The only
difference is that packets can
only be initiated at the rising
edge of the clock, other than Bus Clock
that, there’s no diff between 400
DDR and 800. Column Cmd w0 w1 r2
Very clean topology, very clever
clocking scheme. No clock
handoff issue, high efficiency. Data Bus d0 d1 d2
Write delay improves matching
with read latency. (not perfect, tCWD tCAC
as shown) since data bus is 16
bits wide, each read command (Write delay) (CAS access delay) tCAC - tCWD
gets 16*8=128 bits back. Each
cacheline fetch = multiple
packets. Two Write Commands Followed by a Read Command
Up to 32 devices.
Packet Protocol : Everything in 8 (half) cycle packets
DRAM TUTORIAL

ISCA 2002
Direct RDRAM Chip
Bruce Jacob
David Wang
400 MHz (2.5ns cycle time)
University of 256 MBit
Maryland Separate Row-Col Command Busses 86 pin
FBGA
RDRAM packets does not re-
Burst Length = 8*
order the data inside the packet. 49 Pwr/Gnd*
To compute RDRAM latency, we
4/16/32 Banks Internally*
must add in the command 16 Data
packet transmission time as well Supply Voltage of 2.5V* 8 Addr/Cmd
as data packet transmission 4 Clk*
time. Low Latency, CAS = 4 to 6 full cycles* 6 CTL *
RDRAM relies on the multitudes 2 NC
of banks to try to make sure that RSL Signaling (Vref +/- 0.2V) 1 Vref *
a high percentage of requests
would hit open pages, and only (800 mV rail to rail)
incur the cost of a CAS, instead
of a RAS + CAS.
Active Active precharge

read read read read

data data data data

All packets are 8 (half) cycles in length,


the protocol allows near 100% bandwidth
utilization on all channels. (Addr/Cmd/Data)
DRAM TUTORIAL

ISCA 2002
RDRAM Drawbacks
Bruce Jacob
David Wang
High Frequency
University of I/O Test and
Maryland RSL: Separate
Package Cost
RDRAM provides high
bandwidth, but what are the
Power Plane 30% die cost
costs?
RAMBUS pushed in many for logic @
different areas simultaneously.
The drawback was that with new
64 Mbit node
set of infrastructure, the costs for
first generation products were Control Logic -
exhorbant. Active Decode Row buffers
Logic + Open Single Chip
Row Buffer. Provides All
(High power Data Bits for
for “quiet” state) Each Packet
(Power)

Significant Cost Delta for First Generation


DRAM TUTORIAL

ISCA 2002
System Comparison
Bruce Jacob
David Wang

University of
SDRAM DDR RDRAM
Maryland
Frequency (MHz) 133 133*2 400*2
Low pin count, higher latency. Pin Count (Data Bus) 64 64 16
In general terms, the system
comparison simply points out
the various parts that RDRAM
Pin Count (Controller) 102 101 33
excells in, i.e. high bandwidth
and low pin count., but they also Theoretical Bandwidth 1064 2128 1600
have longer latency, since it
takes 10 ns just to move the (MB/s)
command from the controller
onto the DRAM chip, and Theoretical Efficiency 0.63 0.63 0.48
another 10ns just to get the data
from the DRAM chips back onto (data bits/cycle/pin)
the controller interface
Sustained BW (MB/s)* 655 986 1072
Sustained Efficiency* 0.39 0.29 0.32
(data bits/cycle/pin)
RAS + CAS (tRAC) (ns) 45 ~ 50 45 ~ 50 57 ~ 67
CAS Latency (ns)** 22 ~ 30 22 ~ 30 40 ~ 50
133 MHz P6 Chipset + SDRAM CAS Latency ~ 80 ns
*StreamAdd
**Load to use latency
DRAM TUTORIAL

ISCA 2002
Differences of Philosophy
Bruce Jacob
David Wang
SDRAM - Variants
University of
Maryland
DRAM
RDRAM moves complexity from Controller
interface into DRAM chips. Chips
Is this a good trade off? What
does the future look like?
Complex Inexpensive Simple
Interconnect Interface Logic

RDRAM - Variants

DRAM
Controller
Chips

Simplified expensive Complex


Interconnect Interface Logic

Complexity Moved to DRAM


DRAM TUTORIAL

ISCA 2002
Technology Roadmap (ITRS)
Bruce Jacob
David Wang

University of 2004 2007 2010 2013 2016


Maryland
Semi Generation (nm) 90 65 45 32 22
To begin with, we look in a
crystal ball to look for trends that CPU MHz 3990 6740 12000 19000 29000
will cause changes or limit
scalability in areas that we are MLogicTransistors/ 77.2 154.3 309 617 1235
interested in.
ITRS = International Technology cm^2
Roadmap for Semiconductors.
High Perf chip pin count 2263 3012 4009 5335 7100
Transistor Frequecies are
supposed to nearly double every High Performance chip 1.88 1.61 1.68 1.44 1.22
generation, and transistor cost (cents/pin)
budget (as indicated by Million
Logic Transistors per cm^2) are Memory pin cost 0.34 - 0.27 - 0.22 - 0.19 - 0.19 -
projected to double.
(cents/pin) 1.39 0.84 0.34 0.39 0.33
Interconnects between chips are
a different story. Measured in Memory pin count 48-160 48-160 62-208 81-270 105-351
cents/pin, pin cost decreases
only slowly, and pin budget
grows slowly each generation.

Punchline: In the future, Free


Transistors and Costly
interconnects.
Trend: Free Transistors &
Costly Interconnects
DRAM TUTORIAL

ISCA 2002
Choices for Future
Bruce Jacob
David Wang
Direct Connect

DRAM

DRAM
DRAM

DRAM
Custom DRAM:
University of CPU
Highest Bandwidth +
Maryland Low Latency

So we have some choices to


make. Integration of memory
controller will move the memory Direct Connect

DRAM
DRAM

DRAM
DRAM
controller on die, frequency will
CPU semi-comm. DRAM:
be much higher. Command-data
path will only cross chip High Bandwidth +
boundaries twice instead of 4 Low/Moderate Latency
times. But interfacing with
memory chips directly means
that you are to be limited by the
lowest common denominator. To Direct Connect
DRAM DRAM
get highest bandwidth (for a Commodity DRAM
given number of pins) AND
CPU Low Bandwidth +
lowest latency, we’ll need DRAM DRAM Low Latency
custom RAM, might as well be
SRAM, but it will be prohibatively DRAM DRAM
expensive

DRAM DRAM DRAM DRAM

Memory DRAM DRAM


CPU DRAM DRAM
Controller

DRAM DRAM DRAM DRAM


Indirect Connection
Highest Bandwidth DRAM DRAM

Inexpensive DRAM
Highest Latency
DRAM TUTORIAL

ISCA 2002
EV7 + RDRAM (Compaq/HP)
Bruce Jacob
David Wang

University of
• RDRAM Memory (2 Controllers)
Maryland
• Direct Connection to processor
Two RDRAM controller means 2
independent channels. Only 1
packet has to be generated for • 75ns Load to use latency
each 64 byte cache line
transaction request.

(extra channel stores cache


• 12.8 GB/s Peak bandwidth
coherence data. i.e. I belong to
CPU#2, exclusively.) • 6 GB/s read or write bandwidth
Very aggressive use of available
pages on RDRAM memory.
• 2048 open pages (2 * 32 * 32)
16
16 Each column read
MC 64 16 fetches 128 * 4 = 512 b
16 (data)
16
16
MC 64 16
16
DRAM TUTORIAL

ISCA 2002
What if EV7 Used DDR?
Bruce Jacob
David Wang

University of
• Peak Bandwidth 12.8 GB/s
Maryland
• 6 Channels of 133*2 MHz DDR SDRAM ==
EV7 cacheline is 64 bytes, so
each 4-channel ganged RDRAM • 6 Controllers of 6 64 bit wide channels, or
can fetch 64 bytes with 1 single
packet.
Each DDR SDRAM channel can
fetch 64 bytes by itself. So we
• 3 Controllers of 3 128 bit wide channels
need 6 controllers if we gang
two DDR SDRAMs together into
one channel, we have to reduce
the burst length from 8 to 4. EV7 + 6 controller EV7 + 3 controller
Shorter bursts are less efficient.
System EV7 + RDRAM
DDR SDRAM DDR SDRAM
Sustainable bandwidth drops.
Latency 75 ns ~ 50 ns* ~ 50 ns*
Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd
Controller 2 6*** 3***
Count
Open pages 2048 144 72

* page hit CAS + memory controller latency.


** including all signals, address, command, data, clock, not including ECC or parity
*** 3 controller design is less bandwidth efficient.
DRAM TUTORIAL

ISCA 2002
What’s Next?
Bruce Jacob
David Wang

University of
• DDR II
Maryland
DDR SDRAM was an
• FCRAM
advancedment from SDRAM,
with lowered Vdd, new electrical
signal interface (SSTL), new • RLDRAM
protocol, but fundamentally the
same tRC ~= 60ns. RDRAM has
tRC of ~70ns. All comparable in
Row recovery time.
• RDRAM (Yellowstone etc)
So what’s next? What’s on the
Horizon? DDR II/FCRAM/
RLDRAM/RDRAM-nextGen/
• Kentron QBM
Kentron? What are they, and
what do they bring to the table?
DRAM TUTORIAL

ISCA 2002
DDR II - DDR Next Gen
Bruce Jacob
David Wang Lower I/O DRAM core operates at
University of Voltage (1.8V) 1:4 freq of data bus freq
Maryland (SDRAM 1:1, DDR 1:2)

Backward Compat.
DDR II is a follow on to DDR 400 Mbps to DDR (Common
DDR II command sets are a
superset of DDR SDRAM - multidrop modules possible)
commands.
Lower I/O voltage means lower 800 Mbps
power for I/O and possibly faster
signal switching due to lower rail
-point to point No more Page-
to rail voltage.
DRAM core now operates at 1:4 Transfer-Until-
of data bus frquency. valida
command may be latched on
Interrupted
any given rising edge of clock, FPBGA package Commands
but may be delayed a cycle
since command bus is running (removes speedpath)
at 1:2 frequency to the core now.
In a memory system it can run at
400 MHz per pin, while it can be
Burst Length == 4
cranked up to 800 MHz per pin Only!
in an embedded system without
connectors.
DDR II eliminates the transfer-
until-interrupted commands, as
well as limits the burst length to
4 only. (simple to test) 4 Banks internally Write Latency = CAS -1
(same as SDRAM and DDR) (increased Bus Utilization)
DRAM TUTORIAL

ISCA 2002
DDR II - Continued
Bruce Jacob
David Wang

University of
Posted Commands
Maryland
Active Read SDRAM & DDR
Instead of a controller that keeps
(RAS) (CAS)
track of cycles, we can now have
a “dumber” controller. Control is
now simple, kind of like SRAM. tRCD data
part I of address one cycle, part
II the next cycle.
SDRAM & DDR SDRAM relies on memory controller to know
tRCD and issue CAS after tRCD for lowest latency.

Active Read
DDR II: Posted CAS
(RAS) (CAS)

tRCD data
Internal counter delays CAS command, DRAM chip issues “real”
command after tRCD for lowest latency.
DRAM TUTORIAL

ISCA 2002
FCRAM
Bruce Jacob
David Wang

University of
Fast Cycle RAM (aka Network-DRAM)
Maryland
FCRAM is a trademark of
Fujitsu. Toshiba manufactures
under this trademark, Samsung
Features DDR SDRAM FCRAM/Network-DRAM
sells Network DRAM. Same
thing. Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15
extra die area devoted to circuits
that lowers Row Cycle down to
half of DDR, and Random
Electrical Interface SSTL-2 SSTL-2
Access (RAC) latency down to
22 to 26ns. Clock Frequency 100~167 MHz 154~200 MHz
Writes are delay-matched with
CASL, better bus utilization. tRAC ~40ns 22~26ns

tRC ~60ns 25~30ns

# Banks 4 4

Burst Length 2,4,8 2,4

Write Latency 1 Clock CASL -1

FCRAM/Network-DRAM looks like DDR+


DRAM TUTORIAL

ISCA 2002
FCRAM Continued
Bruce Jacob
David Wang

University of
Maryland
With faster DRAM turn around
time on the tRC, a random
access that hits the same page
over and over again will have the
highest bus utilization. (With
random R/W accesses)
Also, why Peak BW != sustained
BW. Deviations from peak
bandwidth could be due to
architecture related issues such
as tRC (cannot cycle DRAM
arrays to grab data out of same
DRAM array and re-use sense
amps)

Faster tRC allows Samsung to claim higher bus efficiency


* Samsung Electronics, Denali MemCon 2002
DRAM TUTORIAL

ISCA 2002
RLDRAM
Bruce Jacob
David Wang
Peak Random
Bus Width Row Cycle
University of DRAM Type Frequency
(per chip)
Bandwidth Access
Time (tRC)
Maryland (per Chip) Time (tRAC)
Another Variant, but RLDRAM is
targetted toward embedded PC133 133 16 200 MB/s 45 ns 60 ns
systems. There are no SDRAM
connector specifications, so it
can target a higher frequency off DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns
the bat.
PC800 400 * 2 16 1.6 GB/s 60 ns 70 ns
RDRAM

FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns

RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns

Comparable to FCRAM in latency


Higher Frequency (No Connectors)
non-Multiplexed Address (SRAM like)
DRAM TUTORIAL
RLDRAM Applications---L3
(II) Cache
ISCA 2002
RLDRAM Continued
Bruce Jacob
David Wang High-end PC and Server
University of
Maryland
Infineon proposes that RLDRAM Northbridge
could be integrated onto the
motherboard as an L3 cache. 64
MB of L3. Shaves 25 ns off of Processor Memory
tRAC from going to SDRAM or
DDR SDRAM. Not to be used as
Controller
main memory due to capacity
constraints.
2.4GB/s X32 2.4GB/s X32

256M b ??? 256Mb


RLDRAM RLDRAM

RLDRAM is a great replacement to SRAM in L3


MP SM M GS cache applications because of its high density,
low power and low cost
2001-09-04
Page 6
*Infineon Presentation, Denali MemCon 2002
DRAM TUTORIAL

ISCA 2002
RAMBUS Yellowstone
Bruce Jacob
David Wang

University of
• Bi-Directional Differential Signals
Maryland
Unlike other DRAM’s.
• Ultra low 200mV p-p signal swings
Yellowstone is only a voltage
and I/O specification, no DRAM
AFAIK. • 8 data bits transferred per clock
RAMBUS has learned their
lesson, they used expensive
packaging, 8 layer
motherboards, and added cost
• 400 MHz system clock
everywhere. Now the new pitch
is “higher performance with
same infrastructure”.
• 3.2 GHz effective data frequency
• Cheap 4 layer PCB
• Commodity packaging

System Clock
Data 1.2 V
1.0 V

Octal Data Rate (ODR) Signaling


DRAM TUTORIAL
TM
ISCA 2002
Kentron QBM
Bruce Jacob
David Wang
DDR A DDR B DDR A DDR B DDR A DDR B DDR A DDR B
University of
Maryland
Quad Band Memory

SWITCH
SWITCH

SWITCH
SWITCH

SWITCH
SWITCH
SWITCH
SWITCH
Uses Fet switches to control
which DIMM sends output.
Two DDR memory chips are
interleaved to get Quad memory.
Advantages, uses standard

SWITCH
SWITCH
SWITCH
SWITCH
DDR chips, extra cost is low,
only the wrapper electronics.
Modification to memory
controller required, but minimal.
Has to understand that data is
open open open
being burst back at 4X clock
frequency. Does not improve
efficiency, but cheap bandwidth.
Supports more loads than Clock
“ordinary DDR”, so more Controller
capacity.
DDR A d1 d1 d1 d1
DDR B d0 d0 d0 d0

Output d0 d1 d0 d1 d0 d1 d0 d1

“Wrapper Electr onics around DDR memory”


Generates 4 data bits per cycle instead of 2.
Quad Band Memory
DRAM TUTORIAL

ISCA 2002
A Different Perspective
Bruce Jacob
David Wang

University of
Everything is bandwidth
Maryland
Instead of thining about things
on a strict latency-bandwidth
Clock
perspective, it might be more
helpful to think in terms of
latency vs pin-transition
Row Cmd/Addr Bandwidth
efficiency perspective. The idea
is that
Col. Cmd/Addr Bandwidth
Write Data Bandwidth
Read Data Bandwidth

Latency and Bandwidth

Pin-bandwidth and

Pin-transition *Efficiency (bits/cycle/sec)


DRAM TUTORIAL

ISCA 2002
Research Areas: Topology
Bruce Jacob
David Wang

University of
Maryland
DRAM systems is basically a
networking system with a smart
master controller and a large
number of “dumb” slave devices.
If we are concerned about
“efficiency” on a bit/pin/sec level,
it might behoove us to draw
inspiration from network
interfaces, and design
something like this...
Unidirection command and write
packets from controller to DRAM
chips, and Unidirection bus from
DRAM chips back to the
controller. Then it looks like a
network system with slot ring
interface, no need to deal with
bus-turn around issues.
Unidirectional Topology:
• Write Packets sent on Command Bus
• Pins used for Command/Address/Data
• Further Increase of Logic on DRAM chips
DRAM TUTORIAL

ISCA 2002
Memory Commands?
Bruce Jacob
David Wang
Act Write
University of Act Write 0
Maryland 000000
Certain things simply does not
make sense to do. Such as Instead of A[ ] = 0; Do “write 0”
various STREAM components.
Move multimegabyte arrays from
DRAM to CPU, just to perform
simple “add” function, then move
that multi megabyte arrays right
back. In such extreme Why do A[ ] = B[ ] in CPU?
bandwidth constrained
applications, it would be
beneficial to have some logic or Memory
hardware on DRAM chips that Controller
can perform simple
computation. This is tricky, since
we do not want to add too much
logic as to make the DRAM
chips prohibatively expensive to
manufacture. (logic overhead Move Data inside of DRAM or between DRAMs.
decreases with each generation,
so adding logic is not an
impossible dream) Also, we do
not want to add logic into the
critical path of a DRAM access.
That would serve to slow down a
Why do STREAMadd in CPU?
general access in terms of the
“real latency” in ns.
A[ ] = B[ ] + C[ ]

Active Pages *(Chong et. al. ISCA ‘98)


DRAM TUTORIAL

ISCA 2002
Address Mapping
Bruce Jacob
David Wang
Physical
University of Address
Maryland
For a given physical address,
there are a number of ways to
map the bits of the physical
address to generate the
Device Id Row Addr Col Addr Bank Id
“memory address” in terms of
device ID, Row/col addr, and
bank id.
The mapping policies could
impact performance, since badly
mapped systems can cause
bank conflicts in consecutive
accesses.
Now, mapping policies must also
take temperature control into
account, as consecutive
accesses that hit the same
DRAM chip can potentially
create undesirable hot spots.
One reason for the additional
cost of RDRAM initially was the
use of heat spreaders on the
memory modules to prevent the
hotspots from building up.

Access Distribution for Temp Control


Avoid Bank Conflicts
Access Reordering for performance
DRAM TUTORIAL

ISCA 2002
Example: Bank Conflicts
Bruce Jacob
David Wang
Column Decoder
University of
Maryland Column Decoder
Sense Amps
Each Memory system consists Column Decoder
of one or more memory chips, SenseColumn
Amps Decoder
and most times, accesses to ... Bit Lines...
Sense Amps
these chips can be pipelined.
Each chip also has multitudes of Sense Amps
... Bit Lines...

Row Decoder
banks, and most of the times,
accesses to these banks can ... Bit Lines...

Row Decoder
also be pipelined. (key to Memory ... Bit Lines...

....
efficiency is to pipeline

Row Decoder
commands) Array
Memory

....

Row Decoder
Array
Memory

....
Multiple Banks Array
Memory

....
to Reduce Array
Access Conflicts

Read 05AE5700 Device id 3, Row id 266, Bank id 0


Read 023BB880 Device id 3, Row id 1BA, Bank id 0
Read 05AE5780 Device id 3, Row id 266, Bank id 0
Read 00CBA2C0 Device id 3, Row id 052, Bank id 1

More Banks per Chip == Performance == Logic Overhead


DRAM TUTORIAL

ISCA 2002
Example: Access Reordering
Bruce Jacob
David Wang 1 Read 05AE5700 Device id 3, Row id 266, Bank id 0
2 Read 023BB880 Device id 3, Row id 1BA, Bank id 0
University of 3 Read 05AE5780 Device id 3, Row id 266, Bank id 0
Maryland Read 00CBA2C0 Device id 1, Row id 052, Bank id 1
4
Each Load command is
translated to a row command Act 1 Prec Act 2 Prec Act 3
and a column command. If two
commands are mapped to the
same bank, one must be
Read Read
completed before the other can
start. Data Data
Or, if we can re-order the
sequences, then the entire tRC
sequence can be completed
faster. Strict Ordering
By allowing Read 3 to bypass Act 1 Act 4 Prec Prec Act 2 Prec
Read 2, we do not need to
generate another row activation 3
command. Read 4 may also ReadReadRead Read
bypass Read 2, since it operates
on a different Device/bank
entirely.
Data Data Data Data
DRAM now can do auto
precharge, but I put in the Memory Access Re-ordered
precharge explicitly to show that
two rows cannot be active within
tRC (DRAM architecture)
constraints.
Act = Activate Page (Data moved from DRAM cells to row buffer)
Read = Read Data (Data moved from row buffer to memory controller)
Prec = Precharge (close page/evict data in row buffer/sense amp)
DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
now -- talk about performance
issues.
• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL

ISCA 2002
Simulator Overview
Bruce Jacob
David Wang

University of
CPU: SimpleScalar v3.0a
Maryland
• 8-way out-of-order
NOTE

• L1 cache: split 64K/64K, lockup free x32


• L2 cache: unified 1MB, lockup free x1
• L2 blocksize: 128 bytes

Main Memory: 8 64Mb DRAMs


• 100MHz/128-bit memory bus
• Optimistic open-page policy

Benchmarks: SPEC ’95


DRAM TUTORIAL

ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang FPM, EDO, SDRAM, ESDRAM, DDR: x16 DRAM

x16 DRAM
University of
Maryland x16 DRAM

x16 DRAM
NOTE CPU Memory
128-bit 100MHz bus
and caches Controller
x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

DIMM
Rambus, Direct Rambus, SLDRAM:

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
CPU Memory
128-bit 100MHz bus
and caches Controller

Fast, Narrow Channel

Note: TRANSFER WIDTH of Direct Rambus Channel


• equals that of ganged FPM, EDO, etc.
• is 2x that of Rambus & SLDRAM
DRAM TUTORIAL

ISCA 2002
DRAM Configurations
Bruce Jacob
David Wang Rambus & SLDRAM dual-channel:
University of

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
Maryland
CPU Memory
NOTE 128-bit 100MHz bus
and caches Controller

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
Fast, Narrow Channel x2

Strawman: Rambus, etc.

DRAM

DRAM

DRAM
CPU Memory
128-bit 100MHz bus

DRAM
and caches Controller

...
DRAM
Multiple Parallel Channels
DRAM TUTORIAL

ISCA 2002
First … Refresh Matters
Bruce Jacob
David Wang

University of compress
Maryland
1200
Bus Wait Time
NOTE
Refresh Time
Data Transfer Time
Data Transfer Time Overlap
Column Access Time

Time per Access (ns)


800 Row Access Time
Bus Transmission Time

400

0 FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations

Assumes refresh of each bank every 64ms


DRAM TUTORIAL

ISCA 2002
Overhead: Memory vs. CPU
Bruce Jacob
David Wang Total Execution Time in CPI — SDRAM

University of Stalls due to Memory Access Time


Maryland 3 Overlap between Execution & Memory
Processor Execution (includes caches)
NOTE
2.5

Clocks Per Instruction (CPI)


To
mo
rr
2 Ye To ow’
ste day s C
rda ’s PU
y’s CP
CP U
1.5 U

0.5

0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

BENCHMARK

Variable: speed of processor & caches


DRAM TUTORIAL

ISCA 2002
Definitions (var. on Burger, et al)
Bruce Jacob
David Wang • tPROC — processor with perfect memory
University of
Maryland • tREAL — realistic configuration
NOTE • tBW — CPU with wide memory paths
• tDRAM — time seen by DRAM system
tREAL tDRAM

Stalls Due to tREAL - tBW


BANDWIDTH

tBW
Stalls Due to tBW - tPROC
LATENCY

CPU-Memory
OVERLAP tPROC - (tREAL - tDRAM)

CPU+L1+L2 tREAL - tDRAM


Execution
tPROC
DRAM TUTORIAL

ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang

University of
Bandwidth-Enhancing Techniques I:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Overlap between Execution & Memory
5 Processor Execution

Newer DRAMs

To
Cycles Per Instruction (CPI)

m o da
or da y’
Ye

T
ro y’ s C
st

w s
er

’s CP PU
C U
3

PU
2

0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR

DRAM Configuration
DRAM TUTORIAL

ISCA 2002
Memory & CPU — PERL
Bruce Jacob
David Wang

University of
Bandwidth-Enhancing Techniques II:
Maryland
Stalls due to Memory Bandwidth
NOTE Stalls due to Memory Latency
Execution Time in CPI — PERL
5 Overlap between Execution & Memory
Processor Execution
Cycles Per Instruction (CPI)

0 FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2

DRAM Configuration (10GHz CPUs)


DRAM TUTORIAL

ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Data Transfer Time Overlap
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time

300

200

100

0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers


DRAM TUTORIAL

ISCA 2002
Average Latency of DRAMs
Bruce Jacob
David Wang Bus Wait Time
Refresh Time
University of Data Transfer Time
Maryland 500
Data Transfer Time Overlap
Column Access Time
NOTE
Row Access Time
Avg Time per Access (ns) 400 Bus Transmission Time

300

200

100

0
FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers


DRAM TUTORIAL

ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang

University of Architectural Comparison

Normalized Execution Time (DDR2)


Maryland pc100
3
ddr133
NOTE
drd
2.5
ddr2
ddr2ems
2 ddr2vc

1.5

0.5
cc1 com go ijpeg li linea mpe mpe peg− perl ran− strea strea
pres r_wal g2de g2en wit dom m m_n
s k c c _wa o_un

Benchmark
DRAM TUTORIAL

ISCA 2002
DDR2 Study Results
Bruce Jacob
David Wang

University of Perl Runtime


Maryland
0.9 pc100
ddr133
NOTE 0.8
drd
Execution Time (Sec.) 0.7 ddr2
ddr2ems
0.6
ddr2vc
0.5
0.4
0.3
0.2
0.1
0
1 Ghz 5 Ghz 10 Ghz
Processor Frequency
DRAM TUTORIAL

ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang 100
No L2 Cache
100
No L2 Cache FPMDRAM
EDODRAM
SDRAM
ESDRAM
University of 80 80 DDRSDRAM
SLDRAM

Hit rate in row buffers

Hit rate in row buffers


Maryland RDRAM
DRDRAM
60 60

NOTE 40 40

20 20

0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex


0 AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword
SPEC INT 95 Benchmarks ETCH Traces
1MB L2 Cache 1MB L2 Cache
100 100

80 80
Hit rate in row buffers

Hit rate in row buffers


60 60

40 40

20 20

0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex


0 AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword
SPEC INT 95 Benchmarks ETCH Traces

4MB Cache 4MB L2 Cache


100 100

80 80
Hit rate in row buffers

Hit rate in row buffers

60 60

40 40

20 20

0 Compress Gcc Go Ijpeg Li M88ksim Perl Vortex


0 AcroreadCompress Gcc Go Netscape Perl Photoshop Powerpoint Winword
SPEC INT 95 Benchmarks ETCH Traces
DRAM TUTORIAL

ISCA 2002
Row-Buffer Hit Rates
Bruce Jacob
David Wang Hits vs. Depth in Victim-Row FIFO Buffer
1000 200 1.5e+05
University of 800
Go Li Vortex
Maryland 150
1e+05
600
100
400
NOTE 50000
50
200

0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

50000 2e+05 4e+05

40000 Compress Perl


1.5e+05 Ijpeg 3e+05
30000
1e+05 2e+05
20000
50000 1e+05
10000

0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Compress Vortex
100000 10000

10000
1000

1000
100
100

10
10

1 1
2000 4000 6000 8000 10000 2000 4000 6000 8000 10000
Inter-arrival time (CPU Clocks) Inter-arrival time (CPU Clocks)
DRAM TUTORIAL

ISCA 2002
Row Buffers as L2 Cache
Bruce Jacob
David Wang

University of
Maryland Stalls due to Memory Bandwidth
Stalls due to Memory Latency
Overlap between Execution & Memory
NOTE Processor Execution

Clocks Per Instruction (CPI)


20

10

0
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
DRAM TUTORIAL

ISCA 2002
Row Buffer Management
Bruce Jacob
David Wang
ROW ACCESS COLUMN ACCESS
University of
Maryland RAS CAS
Data In/Out Data In/Out
Each memory transaction has to
break down into a two part Buffers Buffers
access, a row access and a
column access. In essence the
row buffer/sens amp is action as Column Decoder Column Decoder
a cache. where a page is
brought in from the memory
array and stored in the buffer Sense Amps Sense Amps
then the second step is to move
that data from the row buffers
back into the memory controller. ... Bit Lines... ... Bit Lines...
from a certain perspective, it
makes sense to speculatively

Row Decoder
Row Decoder

move pages from memory


arrays into the row buffers to
maximize the page hit of a Memory Memory

....
....

column access, and reduce


latency. The cost of a Array Array
speculative row activation
command is the ~20 bit of
bandwidth sent on the command
channel from controller to
DRAM. instead of prefetching
into DRAM, we’re just
prefetching inside of DRAM.
Row buffer hit rates 40~90%,
depending on application.
*could* be near 100% if memory
system gets speculative row RAS is like Cache Access
buffer management commands.
(This only makes sense if
memory controller is integrated)
Why not Speculate?
DRAM TUTORIAL

ISCA 2002
Cost-Performance
Bruce Jacob
David Wang

University of
FPM, EDO, SDRAM, ESDRAM:
Maryland
• Lower Latency => Wide/Fast Bus
NOTE

• Increase Capacity => Decrease Latency


• Low System Cost

Rambus, Direct Rambus, SLDRAM:


• Lower Latency => Multiple Channels
• Increase Capacity => Increase Capacity
• High System Cost

However, 1 DRDRAM = Multiple SDRAM


DRAM TUTORIAL

ISCA 2002
Conclusions
Bruce Jacob
David Wang

University of
100MHz/128-bit Bus is Current Bottleneck
Maryland
• Solution: Fast Bus/es & MC on CPU
NOTE
(e.g. Alpha 21364, Emotion Engine, …)

Current DRAMs Solving Bandwidth Problem


(but not Latency Problem)
• Solution: New cores with on-chip SRAM
(e.g. ESDRAM, VCDRAM, …)
• Solution: New cores with smaller banks
(e.g. MoSys “SRAM”, FCRAM, …)

Direct Rambus seems to scale best for future


high-speed CPUs
DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
now -- let’s talk about DRAM
performance at theSYSTEM
level. • Advanced Basics
previous studies show
MEMORY BUS is significant
bottleneck in today’s high-
• DRAM Evolution: Interface Path
performance systems

- Schumann reports that in


• Future Interface Trends & Research Areas
Alpha workstations, 30-60% of
PRIMARY MEMORY LATENCY,
is due to SYSTEM OVERHEAD
• Performance Modeling:
other than DRAM latency
Architectures, Systems, Embedded
- Harvard study cites BUS
TURNAROUND as responsible
for factor-of-two difference
between PREDICTED and
MEASURED performance in P6
systems

- our previous work shows


today’s busses (1999’s busses)
are bottlenecks for tomorrow’s
DRAMs

so -- look at bus, model system


overhead
DRAM TUTORIAL

ISCA 2002
Motivation
Bruce Jacob
David Wang

University of
Even when we restrict our focus …
Maryland

at the SYSTEM LEVEL -- i.e.


outside the CPU -- we find a
large number of parameters SYSTEM-LEVEL PARAMETERS
this study only VARIES a
handful, but it still yields a fairly
large space
• Number of channels Width of channels
the parameters we VARY are in • Channel latency Channel bandwidth
blue & green
• Banks per channel Turnaround time
by “partially” independent, i
mean that we looked at a small • Request-queue size Request reordering
number of possibilities:
• Row-access Column-access
- turnaround is 0/1 cycle on
800MHz bus • DRAM precharge CAS-to-CAS latency
- request ordering is
INTERLEAVED or NOT
• DRAM buffering L2 cache blocksize
• Number of MSHRs Bus protocol

Fully | partially | not independent (this study)


DRAM TUTORIAL

ISCA 2002
Motivation
Bruce Jacob
David Wang

University of
... the design space is highly non-linear …
Maryland
3 32-Byte Burst
and yet, even in this restricted GCC 64-Byte Burst
design space, we find highly
EXTREMELY COMPLEX results 128-Byte Burst

Cycles per Instruction (CPI)


SYSTEM is very SENSITIVE to
CHANGES in these parameters
2
[discuss graph]

if you hold all else constant and


vary one parameter, you can
see extremely large changes in
end performance ... up to 40% 1
difference by changing ONE
PARAMETER by a FACTOR OF
TWO (e.g. doubling the number
of banks, doubling the size of the
burst, doubling the number of
channels, etc.)
0
0.8 1.6 3.2 6.4 12.8 25.6
yte est yte t es
t es yte t es
t es t es t es tes te
s
b by 1 b by by 1 b by by by by by by
x1 x 2 x x 4 x 2 x x 8 x 4 x 2 x 8 x 4 x 8
an an cha
n
an han ha
n n n n n n n
ch c h ch c c c ha cha cha c ha cha c ha
1 1 2 1 2 4 1 2 4 2 4 4

System Bandwidth
(GB/s = Channels * Width * 800MHz)
DRAM TUTORIAL

ISCA 2002
Motivation
Bruce Jacob
David Wang

University of
... and the cost of poor judgment is high.
Maryland
10
so -- we have the worst possible
scenario: a design space that is
very sensitive to changes in
parameters and execution times Cycles per Instruction (CPI) Worst Organization
that can vary by a FACTOR OF 8 Average Organization
THREE from worst-case to best Best Organization
clearly, we would be well-served
to understand this design space
6

0
bzip gcc mcf parser perl vpr average
SPEC 2000 Benchmarks
DRAM TUTORIAL

ISCA 2002
System-Level Model
Bruce Jacob
David Wang

University of
SDRAM Timing
Maryland

so by now we’re very familiar


with this picture ...
Row Access
we cannot use it in this study, Clock
because this represents the Column Access
interface between the DRAM
and the MEMORY Command Transfer Overlap
CONTROLLER.

typically, the CPU’s interface is


ACT READ Data Transfer
much simpler: the CPU sends all
of the address bits at once with Address
CONTROL INFO (r/w), and the Row Col
memory controller handles the Addr Addr
bit addressing and the RAS/CAS
timing
DQ Valid Valid Valid Valid
Data Data Data Data
DRAM TUTORIAL

ISCA 2002
System-Level Model
Bruce Jacob
David Wang

University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
this gives the picture of what is
happening at the SYTEM
LEVEL Row Access
Clock
the CPU to MEM Column Access
CONTROLLER is shown as
“ABUS Active” Command Transfer Overlap
the MEM-CONTROLLER to
DRAM activity is shown as
ACT READ Data Transfer
DRAM BANK ACTIVE
Address
and the data read-out is shown Row Col
as “DBUS Active” Addr Addr

i
DQ Valid Valid Valid Valid
Data Data Data Data

DRAM Bank Active

ABUS Active DBUS Active


DRAM TUTORIAL

ISCA 2002
System-Level Model
Bruce Jacob
David Wang

University of
Timing diagrams are at the DRAM level
Maryland
… not the system level
f the DRAM’s pins do not
connect directly to the CPU (e.g.
in a hierarchical bus Row Access
organization, or if the data is Clock
funnelled through the memory Column Access
controller like the northbridge
chipset), then there is yet Command Transfer Overlap
another DBUS ACTIVE timing
slot that follows below and to the
right ... this can continue to
ACT READ Data Transfer
extend to any number of
hierarchical levels, as seen in Address
huge server systems with Row Col
hundreds of GB of DRAM Addr Addr

DQ Valid Valid Valid Valid


Data Data Data Data

DRAM Bank Active

ABUS Active DBUS1 Active

DBUS2 Active
DRAM TUTORIAL

ISCA 2002
Request Timing
Bruce Jacob
David Wang

University of Backside bus Frontside bus


Maryland
Address
so let’s formalize this system- Cache CPU Data bus (800 MHz)
DRAM DRAM DRAM DRAM
level interface. here’s the Data bus
request timing in a slightly
different way, as well as an
example system model taken
from schumann’s paper Address
describing the 21174 memory
controller (800 MHz) MC Row/Column Addresses & Control
Control
the DRAM’s data pins are
connected directly to the CPU
(simplest possible model), the
memory controller handles the READ REQUEST TIMING:
RAS/CAS timing, and the CPU t0
and memory controller only talk
in terms of addresses and
ADDRESS BUS
control information
DRAM BANK <ROW> <COL> <PRE>
DATA BUS <DB0><DB1><DB2><DB3>
DRAM TUTORIAL

ISCA 2002
Read/Write Request Shapes
Bruce Jacob
David Wang
READ REQUESTS:
University of t0
Maryland ADDRESS BUS 10ns
DRAM BANK 90ns
such a model gives us these DATA BUS 70ns 10ns
types of request shapes for
reads and writes ADDRESS BUS 10ns
DRAM BANK 90ns
this shows a few example bus/
DATA BUS 70ns 20ns
burst configuraitons, in
particular:
ADDRESS BUS 10ns
4-byte bus with burst sizes of 32, DRAM BANK 100ns
64, and 128 bytes per burst DATA BUS 70ns 40ns

WRITE REQUESTS:
t0
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 10ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 20ns
ADDRESS BUS 10ns
DRAM BANK 90ns
DATA BUS 40ns 40ns
DRAM TUTORIAL

ISCA 2002
Pipelined/Split Transactions
Bruce Jacob
David Wang (a) Legal if R/R to different banks:

10ns
University of Read: 90ns
Maryland 70ns 20ns
20ns 10ns
the bus is PIPELINED and
Read: 90ns
supports SPLIT
TRANSACTIONS where one 70ns 20ns
request can be NESTLED inside
of another if the timing is right

[explain examples] (b) Nestling of writes inside reads is legal if R/W to different banks:
Legal if turnaround <= 10ns: Legal if no turnaround:
what we’re trying to do is to fit
these 2D puzzle pieces together 10ns 10ns
in TIME Read: 90ns
Read: 90ns
70ns 10ns 70ns 20ns

10 10ns 10 10ns
Write: 90ns Write: 90ns
40ns 10ns 40ns 20ns

(c) Back-to-back R/W pair that cannot be nestled:

10ns
Read: 100ns
70ns 40ns
10 10ns
Write: 90ns
40ns 40ns
DRAM TUTORIAL

ISCA 2002
Channels & Banks
Bruce Jacob
David Wang D D D D D D
D D D D D D D D D D D D D D D
University of
Maryland
... ...
as for physical connections, here
are the ways we modeled C C C C C C
independent DRAM channels
and independent BANKS per One independent channel Two independent channels
CHANNEL Banking degrees of 1, 2, 4, ... Banking degrees of 1, 2, 4, ...
the figure shows a few of the
parameters that we study. in D D D D D D D D D D D D
addition, we look at:
D D D D D D D D D D D D D D D D
- turnaround time (0,1 cycle)

- queue size (0, 1, 2, 4, 8, 16,


32, infinite requests per channel)
...
C C C
Four independent channels
Banking degrees of 1, 2, 4, ...

1, 2, 4 800 MHz Channels


8, 16, 32, 64 Data Bits per Channel
1, 2, 4, 8 Banks per Channel (Indep.)
32, 64, 128 Bytes per Burst
DRAM TUTORIAL

ISCA 2002
Burst Scheduling
Bruce Jacob
David Wang (Back-to-Back Read Requests)
University of 128-Byte Bursts:
Maryland

how do you chunk up a cache


block? (L@ caches use 128-
byte blocks) 64-Byte Bursts:
[read the bullets]

LONGER BURSTS amortize the


cost of activating and 32-Byte Bursts:
precharging the row over more
data transferred

SHORTER BURSTS allow the


critical word of a FOLLOWING
REQUEST to be serviced
sooner • Critical-burst-first
so -- this is not novel, but it is
fairly aggressive • Non-critical bursts are promoted
NOTE: we use close-page,
autoprecharge policy with
ESDRAM-style buffering of
• Writes have lowest priority
ROW in SRAM. result: get the
best possible precharge overlap (tend back up in request queue …)
AND multiple burst requests to
the same row will not re-invoke a
RAS cycle unleass an
intervening READ request goes
• Tension between large & small bursts:
to a different ROW in same
BANK amortization vs. faster time to data
DRAM TUTORIAL

ISCA 2002
New Bar-Chart Definition
Bruce Jacob
David Wang tREAL tDRAM
University of
Maryland
Stalls Due to tREAL - tBW
DRAM Latency
we run a series of different
simulations to get break-downs:

- CPU Activity
- memory activity overlapped tSYS
with CPU Stalls Due to tSYS - tPROC
- non-overlapped - SYSTEM Queue, Bus, ...
- non-overlapped - due to DRAM
CPU-Memory
so the top two are MEMORY
OVERLAP tPROC - (tREAL - tDRAM)
STALL CYCLES

bottom two are PERFECT CPU+L1+L2 tREAL - tDRAM


MEMORY Execution
tPROC
Note: MEMORY LATENCY is
not further divided into latency/
bandwidth/etc.

• tPROC — CPU with 1-cycle L2 miss


• tREAL — realistic CPU/DRAM config
• tSYS — CPU with 1-cycle DRAM latency
• tDRAM — time seen by DRAM system
DRAM TUTORIAL

ISCA 2002
System Overhead
Bruce Jacob
David Wang

University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround

so -- we’re modeling a memory

Cycles per Instruction (CPI)


system that is fairly aggressive
in terms of
- scheduling policies 2
- support for concurrency Perfect
Memory
and we’re trying to find which of
the following is to blame for the
most overhead:
- concurrency
- latency 1
- system (queueing, precharge,
chunks, etc)

0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8

1.6 3.2 6.4


System Bandwidth
(GB/s = Channels * Width * Speed)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
DRAM TUTORIAL

ISCA 2002
System Overhead
Bruce Jacob
David Wang

University of 3
Maryland Regular Bus Organization
0-Cycle Bus Turnaround

the figure shows:

Cycles per Instruction (CPI)


- system overhead is significant
(usually 20-40% of the total 2
memory overhead)
System overhead 10–100%
over perfect memory
- the most significant overhead
tends to be the DRAM latency

- turnaround is relatively
insignificant (however, 1
remember that this is an
800MHz bus system ...)

0
l l l l l l l l l l l l
n ne n ne n ne nne n ne n ne n ne nne n ne n ne n ne nne
ha ha ha ha ha ha ha ha ha ha ha ha
/c /c /c /c /c /c /c /c /c /c /c /c
nk ks ks ks nk nks nks nks nk nks nks nks
ba ba
n
ba
n
ba
n ba ba ba ba ba ba ba ba
1 2 4 8 1 2 4 8 1 2 4 8

1.6 3.2 6.4


System Bandwidth
(GB/s = Channels * Width * Speed)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
DRAM TUTORIAL

ISCA 2002
Concurrency Effects
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of BANKS per CHANNEL
3 4 Banks per Channel
Maryland
8 Banks per Channel
the figure also shows that

Cycles per Instruction (CPI)


increasing BANKS PER CHANNEL BANDWIDTH
CHANNEL gives you almost as
much benefit as increasing 2
CHANNEL BANDWIDTH, which
is much more costly to
implement

=> clearly, there are some


concurrency effects going on,
and we’d like to quantify and 1
better understand them

0
1.6 3.2 6.4
System Bandwidth
(GB/s = Channels * Width * Speed)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus

Banks/channel as significant as channel BW


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland
3 64-Byte Burst

Cycles per Instruction (CPI)


take a look at a larger portion of
128-Byte Burst
the design space

x-axis: SYSTEM BANDWIDTH,


which == channels X channel
width X 800MHz
2

y-axis: execution time of


application on the given
configuration

different colored bars represent 1


different burst widths

MEMORY OVERHEAD is
substantial

obvious trend: more bandwidth


is better
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
another obvious trend: more b b yt by b yt byt b b yt byt byt b yt byt b yt
bandwidth is NOT x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
NECESSARILY better ... 1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland
3 64-Byte Burst

Cycles per Instruction (CPI)


so -- there are some obvious
128-Byte Burst
trade-offs related to BURST
SIZE, which can affect the
TOTAL EXECUTION TIME by
30% or more, keeping all else
constant.
2

0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland
3 64-Byte Burst

Cycles per Instruction (CPI)


if we look more closely at
128-Byte Burst
individual system organizations,
there are some clear RULES of
THUMBthat apear ...
2

0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland
3 64-Byte Burst

Cycles per Instruction (CPI)


for one, LARGE BURSTS are
128-Byte Burst
optimal for WIDER CHANNELS
Wide channels (32/64-bit)
want large bursts
2

0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland Narrow channels (8-bit)
3 want small bursts 64-Byte Burst

Cycles per Instruction (CPI)


for another, SMALL BURSTS
128-Byte Burst
are optimal for NARROW
CHANNELS

0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Bandwidth vs. Burst Width
Bruce Jacob
David Wang

University of 32-Byte Burst


Maryland
3 64-Byte Burst
Medium channels (16-bit)

Cycles per Instruction (CPI)


and MEDIUM BURSTS are want medium bursts 128-Byte Burst
optimal for MEDIUM
CHANNELS

so -- if CONCURRENCY were
all-important, we would expect
2
small bursts to be best, because
they would allow a LOWER
AVERAGE TIME-TO-CRITICAL-
WORD for a larger number of
simultaneous requests.
1
what we actually see is that the
optimal burst width scalies with
the bus width, sugesting an
optimal number of DATA
TRANSFERS per BANK
ACTIVATION/PRECHARGE
cycle
0
0.8 1.6 3.2 6.4 12.8 25.6
y te es te es es yt
e es es es es es es
i’ll illustrate that ... b b yt by b yt byt b b yt byt byt b yt byt b yt
x1 x2 x 1
x4 x2 nx
1
x8 x4 x2 x8 x4 x8
n n han
ha a a n a n
h a a n a n a n a n a n a n
1c 1c
h 2c
1c
h
2c
h
4c 1c
h
2c
h
4c
h
2c
h
4c
h
4c
h

System Bandwidth
(GB/s = Channels * Width * 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel


DRAM TUTORIAL

ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang

University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns

this figure shows the entire DRAM BANK 90ns 64-bit channel x 32-byte burst
range of burst widths that we DATA BUS 70ns 5ns
modeled. note that some of the
figure represent several different
ADDRESS BUS 10ns
combinations ... for example, the
64-bit channel x 64-byte burst
one THIRD DOWN FROM TOP DRAM BANK 90ns
32-bit channel x 32-byte burst
is
DATA BUS 70ns 10ns
2-bute channel + 32-byte burst
4-byte channel + 64-byte burst ADDRESS BUS 10ns
8-byte channel + 128-byte burst 64-bit channel x 128-byte burst
DRAM BANK 90ns 32-bit channel x 64-byte burst
DATA BUS 70ns 20ns 16-bit channel x 32-byte burst

ADDRESS BUS 10ns


32-bit channel x 128-byte burst
DRAM BANK 100ns
16-bit channel x 64-byte burst
DATA BUS 70ns 40ns 8-bit channel x 32-byte burst

ADDRESS BUS 10ns


DRAM BANK 140ns 16-bit channel x 128-byte burst
8-bit channel x 64-byte burst
DATA BUS 70ns 80ns

ADDRESS BUS 10ns 8-bit channel x 128-byte burst

DRAM BANK 220ns


DATA BUS 70ns 160ns
DRAM TUTORIAL

ISCA 2002
Burst Width Scales with Bus
Bruce Jacob
David Wang

University of
Range of Burst-Widths Modeled
Maryland
ADDRESS BUS 10ns

the optimal configurations are in DRAM BANK 90ns 64-bit channel x 32-byte burst
the middle, suggesting an DATA BUS 70ns 5ns
optimal number of DATA
TRANSFERS per BANK
ADDRESS BUS 10ns
ACTIVATION/PRECHARGE
64-bit channel x 64-byte burst
cycle DRAM BANK 90ns
32-bit channel x 32-byte burst
[BOTTOM] -- too many transfers DATA BUS 70ns 10ns OPTIMAL
per burst crowds out other BURST WIDTHS
requests ADDRESS BUS 10ns
64-bit channel x 128-byte burst
DRAM BANK 90ns 32-bit channel x 64-byte burst
[TOP] -- too few transfers per
request lets the bank overhead DATA BUS 70ns 20ns 16-bit channel x 32-byte burst
(activation/precharge cycle)
dominate
ADDRESS BUS 10ns
32-bit channel x 128-byte burst
however, though this tells us DRAM BANK 100ns
16-bit channel x 64-byte burst
how to best organize a channel
DATA BUS 70ns 40ns 8-bit channel x 32-byte burst
with a given bandwidth, these
rules of thumb do not say
anything about how the different ADDRESS BUS 10ns
configurations (wide/narrow/ DRAM BANK 140ns 16-bit channel x 128-byte burst
medium channels) compare to
8-bit channel x 64-byte burst
each other ... DATA BUS 70ns 80ns

so let’s focus on multiple


configurations for ONE ADDRESS BUS 10ns 8-bit channel x 128-byte burst
BANDWIDTH CLASS ... DRAM BANK 220ns
DATA BUS 70ns 160ns
DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
this is like saying
6

Cycles per Instruction (CPI)


i have 4 800MHz 8-bit Rambus
channels ... what should i do? 5
- gang them together?

- keep them independent?


4

- something in between?
3
like before, we see that more
banks is better, but not always
by much
2

0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4

3.2 GB/s System Bandwidth (channels x width x speed)


DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
with large bursts, there is less
6
interleaving of requests, so extra

Cycles per Instruction (CPI)


banks are not needed
5

#Banks not particularly important


given large burst sizes ... 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4

3.2 GB/s System Bandwidth (channels x width x speed)


DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
with multiple independent
6
channels, you have a degree of

Cycles per Instruction (CPI)


concurrency that, to some
extent, OBVIATES THE NEED 5
for BANKING

... even less so


with multi-channel systems 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4

3.2 GB/s System Bandwidth (channels x width x speed)


DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
however, trying to reduce
6
execution time via multiple

Cycles per Instruction (CPI)


channels is a bit risky:
5
it is very SENSITIVE to BURST
SIZE, becaus emultiple
channels givees the longest
latency possible to EVERY
4
REQUEST in the system, and
having LONG BURSTS
exacerbates that problem -- byt 3
increasing the length of time that
a request can be delayed by
waiting for another ahead of it in
the queue 2

Multi-channel systems sometimes


(but not always) a good idea 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4

3.2 GB/s System Bandwidth (channels x width x speed)


DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — MCF
Bruce Jacob
David Wang 1 Bank per Channel
2 Banks per Channel
University of 7
Maryland 4 Banks per Channel
8 Banks per Channel
another way to look at it:
6

Cycles per Instruction (CPI)


how does the choice of burst
size affect the NUMBER or 5
WIDTH of channels?

=> WIDE CHANNELS: an


improvement is seen by
4
increasing the burst size

=> NARROW CHANNELS: 3


either no improvement is seen,
or a slight degradation is seen
by increasing burst size
2

4x 1-byte channels
2x 2-byte channels 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
1x 4-byte channels
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x x x x x
an an an an an an an an an
ch ch ch ch ch ch ch ch ch
1 2 4 1 2 4 1 2 4

3.2 GB/s System Bandwidth (channels x width x speed)


DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
David Wang 1 Bank per Channel
2
2 Banks per Channel
University of
Maryland 4 Banks per Channel
8 Banks per Channel
we see the same trends in all
the benchmarks surveyed

Cycles per Instruction (CPI)


the onllly thing that changes is
some of the relations between
parameters ...

0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x n x x n x x n
an an ha an an ha an an ha
ch ch 4
c ch ch 4
c ch ch 4
c
1 2 1 2 1 2
3.2 GB/s System Bandwidth (channels x width x speed)
DRAM TUTORIAL

ISCA 2002
Focus on 3.2 GB/s — BZIP
Bruce Jacob
David Wang 1 Bank per Channel
2
2 Banks per Channel
University of
Maryland 4 Banks per Channel
8 Banks per Channel
for example, in BZIP, the best
configurations are at smaller

Cycles per Instruction (CPI)


burst sizes than MCF

however, though THE OPTIMAL


CONFIG changes from
banchmark to benchmark, there
are always several configs that
1
are within 5-10% of the optimal
config -- IN ALL BENCHMARKS

BEST CONFIGS are at


SMALLER BURST SIZES 0
32-Byte Burst 64-Byte Burst 128-Byte Burst
s s te s s te s s te
te te by te te by te te by
by by 1 by by 1 by by 1
4 2 x 4 2 x 4 2 x
x x n x x n x x n
an an ha an an ha an an ha
ch ch 4
c ch ch 4
c ch ch 4
c
1 2 1 2 1 2
3.2 GB/s System Bandwidth (channels x width x speed)
DRAM TUTORIAL

ISCA 2002
Queue Size & Reordering
Bruce Jacob
David Wang
BZIP: 1.6 GB/s (1 channel)
University of
Maryland
3
Infinite Queue

Cycles per Instruction (CPI)


32-Entry Queue
1-Entry Queue
No Queue
2

0
el

el

el

el

el

el

l
ne

ne

ne

ne

ne

ne
nn

nn

nn

nn

nn

nn
an

an

an

an

an

an
ha

ha

ha

ha

ha
ch

ch

ch

ch

ch

ch

ch
/c

/c

/c

c
s/

s/

s/

s/

s/

s/

s/

s/

s/
nk

nk

nk
nk

nk

nk

nk

nk

nk

nk

nk

nk
ba

ba

ba
ba

ba

ba

ba

ba

ba

ba

ba

ba
1

1
2

8
32-byte burst 64-byte burst 128-byte burst
DRAM TUTORIAL

ISCA 2002
Conclusions
Bruce Jacob
David Wang

University of
DESIGN SPACE is NON-LINEAR,
Maryland
COST of MISJUDGING is HIGH
we have a complex edsign
space where neighboring
designs differ significantly CAREFUL TUNING YIELDS 30–40% GAIN
if you are careful, you can beat
the performance of the average
organization by 30-40%
MORE CONCURRENCY == BETTER
supporting memory concurrency
improves system performance,
as long as it is not done at the
(but not at expense of LATENCY)
expense of memory latency:

- using MULTIPLE CHANNELS


• Via Channels → NOT w/ LARGE BURSTS
is good, but not best solution
- multiple banks/channel is • Via Banks → ALWAYS SAFE
→ DOESN’T PAY OFF
always good idea
- trying to interleave small bursts • Via Bursts
is intuitively appealing, but it
doesn’t work • Via MSHRs → NECESSARY
- MSHRs: alwas a god idea

In general, bursts should be


large enough to amortize the BURSTS AMORTIZE COST OF PRECHARGE
precharge cost.
Direct Rambus = 16 bytes
DDR2 = 16/32 bytes
THIS IS NOT ENOUGH
• Typical Systems: 32 bytes (even DDR2)
→ THIS IS NOT ENOUGH
DRAM TUTORIAL

ISCA 2002
Outline
Bruce Jacob
David Wang

University of
• Basics
Maryland
• DRAM Evolution: Structural Path
NOTE

• Advanced Basics
• DRAM Evolution: Interface Path
• Future Interface Trends & Research Areas
• Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL

ISCA 2002
Embedded DRAM Primer
Bruce Jacob
David Wang

University of
Maryland

NOTE
CPU DRAM
Core Array

Embedded

CPU DRAM
Core Array

Not Embedded
DRAM TUTORIAL

ISCA 2002
Whither Embedded DRAM?
Bruce Jacob
David Wang

University of
Microprocessor Report, August 1996: “[Five]
Maryland
Architects Look to Processors of Future”
NOTE
• Two predict imminent merger
of CPU and DRAM
• Another states we cannot keep cramming
more data over the pins at faster rates
(implication: embedded DRAM)
• A fourth wants gigantic on-chip L3 cache
(perhaps DRAM L3 implementation?)

SO WHAT HAPPENED?
DRAM TUTORIAL

ISCA 2002
Embedded DRAM for DSPs
Bruce Jacob
David Wang

University of
MOTIVATION
Maryland
TAGLESS SRAM TRADITIONAL CACHE
(hardware-managed)
NOTE
SOFTWARE The cache “covers” HARDWARE
manages this the entire address manages this
movement of space: any datum movement of
CACHE data in the space data
may be
cached

CACHE
Address space Address space
Move from includes both includes only
memory space “cache” and MAIN primary memory
to “cache” space primary memory MEMORY (and memory-
creates a new, (and memory- mapped I/O)
equivalent data MAIN mapped I/O) Copying
object, not a MEMORY from memory
mere copy of to cache creates a
the original. subordinate copy of
the datum that is
kept consistent with
the datum still in
memory. Hardware
ensures consistency.

NON-TRANSPARENT addressing TRANSPARENT addressing


EXPLICITLY MANAGED contents TRANSPARENTLY MANAGED contents

DSP Compilers => Transparent Cache Model


DRAM TUTORIAL

ISCA 2002
DSP Buffer Organization
Bruce Jacob
David Wang Used for Study
University of
Maryland Fully Assoc 4-Block Cache

victim-0 victim-1
NOTE
S0 S1

buffer-0 buffer-1
LdSt0 LdSt1

DSP

LdSt0 LdSt1

DSP

Bandwidth vs. Die-Area Trade-Off


for DSP Performance
DRAM TUTORIAL

ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 50, 100, 200 MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI

4
#
#
# #
2 #
# # #
# # #

0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25

Bandwidth
DRAM TUTORIAL

ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 50MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI

4
#
#
# #
2 #
# # #
Increasing bus width # # #

0
B/s B/s B/s B/s B/s B/s B/s
0 .4 G 0 .8 G 1.6 G 3 .2 G 6.4 G .8 G .6 G
12 25

Bandwidth
DRAM TUTORIAL

ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP : 100MHz Memory
10 Cache Line Size
NOTE #
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI

4
#
#
# #
2 #
# # #
# # #
Increasing bus width

0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25

Bandwidth
DRAM TUTORIAL

ISCA 2002
E-DRAM Performance
Bruce Jacob
David Wang
Embedded Networking Benchmark - Patricia
University of
Maryland 200MHz C6000 DSP: 200MHz Memory
10 #
Cache Line Size
NOTE
32 bytes
64 bytes
8 128 bytes
256 bytes
# # 512 bytes
1024 bytes
6
CPI

4
#
#
# #
2 #
# # #
# # #
Increasing bus width

0
B/s B/s B/s B/s B/s B/s B /s
0.4 G 0.8 G 1.6 G 3.2 G 6.4 G .8 G .6 G
12 25
Bandwidth
DRAM TUTORIAL

ISCA 2002
Performance-Data Sources
Bruce Jacob
David Wang
“A Performance Study of Contemporary DRAM Architectures,”
University of Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
Maryland
“Organizational Design Trade-Offs at the DRAM, Memory Bus, and
Memory Controller Level: Initial Results,” University of Maryland
Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.

“DDR2 and Low Latency Variants,” Memory Wall Workshop 2000, in


conjunction w/ ISCA ’00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob.

“Concurrency, Latency, or System Overhead: Which Has the Largest


Impact on DRAM-System Performance?”
Proc. ISCA ’01. V. Cuppu and B. Jacob.

“Transparent Data-Memory Organizations for Digital Signal Processors,”


Proc. CASES ’01. S. Srinivasan, V. Cuppu, and B. Jacob.

“High Performance DRAMs in Workstation Environments,”


IEEE Transactions on Computers, November 2001.
V. Cuppu, B. Jacob, B. Davis, and T. Mudge.

Recent experiments by Sadagopan Srinivasan, Ph.D. student at


University of Maryland.
DRAM TUTORIAL

ISCA 2002
Acknowledgments
Bruce Jacob
David Wang

University of
The preceding work was supported
Maryland
in part by the following sources:
• NSF CAREER Award CCR-9983618
• NSF grant EIA-9806645
• NSF grant EIA-0000439
• DOD award AFOSR-F496200110374
• … and by Compaq and IBM.
DRAM TUTORIAL

ISCA 2002
CONTACT INFO
Bruce Jacob
David Wang

University of
Bruce Jacob
Maryland

Electrical & Computer Engineering


University of Maryland, College Park
http://www.ece.umd.edu/~blj/
blj@eng.umd.edu

Dave Wang

Electrical & Computer Engineering


University of Maryland, College Park
http://www.wam.umd.edu/~davewang/
davewang@wam.umd.edu

UNIVERSITY OF MARYLAND

Вам также может понравиться