Miller - On Chip Optical Communications

On-Chip Optical
Communication for Multicore

Processors
Jason Miller
Carbon Research Group

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE
LAB
“Moore’s Gap”
Performance
(GOPS)
Tiled Multicore r s
t o
1000 s is The
Multicore ran GOPS
100 T Gap
10 SMT, FGMT, CGMT
OOO
1 Superscalar  Diminishing returns from
0.1 Pipelining single CPU mechanisms
(pipelining, caching, etc.)
 Wire delays
0.01  Power envelopes
time
1992 1998 2002 2006 2010
2
Multicore Scaling Trends
Today Tomorrow
A few large cores on each chip 100’s to 1000’s of simpler
Diminishing returns prevent cores [S. Borkar, Intel, 2007]
cores from getting more
Simple cores are more power
complex
and area efficient
Only option for future scaling is
to add more cores Global structures do not scale;
all resources must be
Still some shared global
structures: bus, L2 caches distributed
m m m m
p p p p
p p switch switch switch switch
m m m m
p p p p
c c switch switch switch switch
m m m m
BUS p p p p
switch switch switch switch
m m m m
L2 Cache p p p p
3
The Future of Multicore
Number of cores doubles
every 18 months Parallelism replaces
clock frequency
scaling and core
complexity
Resulting
Challenges…
Scalability
Programming
Power
MIT RAW Sun Ultrasparc T2 IBM XCell 8i Tilera TILE64
4
Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel
 Efficient core-to-core communication is crucial
 Architectures that grow easily with each new technology
generation
 Programming
 Traditional parallel programming techniques are hard
 Parallel machines were rare and used only by rocket scientists
 Multicores are ubiquitous and must be programmable by
anyone
 Power
 Already a first-order design constraint
 More cores and more communication  more power
 Previous tricks (e.g. lower Vdd) are running out of steam
5
Multicore Communication
Today
Bus-based Interconnect
p p
Single shared resource
c c
Uniform communication cost
BUS Communication through
L2 Cache
memory
Doesn’t scale to many cores
due to contention and long
wires
DRAM
Scalable up to about 8 cores
6
Multicore Communication
Tomorrow
Point-to-Point Mesh Network
m m m m
p p p p Examples: MIT Raw, Tilera
TILEPro64, Intel Terascale
m m m m
p
switch
p
switch
p
switch
p
switch
Prototype
m m m m
p p p p Neighboring tiles are connected
m
p
m
p
m
p
m
p
Distributed communication
resources
Non-uniform costs:
Latency depends on distance
Encourages direct
DRAM
DRAM
DRAM
DRAM
communication
More energy efficient than bus
Scalable to hundreds of cores 7
Multicore Programming Trends
Meshes and small cores solve the physical scaling

challenge, but programming remains a barrier
Parallelizing applications to thousands of cores is hard

 Task and data partitioning
 Communication becomes critical as latencies increase
 Increasing contention for distant communication
 Degraded performance, higher energy
 Inefficient broadcast-style communication
 Major source of contention
 Expensive to distribute signal electrically
8
Multicore Programming Trends
For high performance, communication and

locality must be managed
 Tasks and data must be both partitioned and

placed
 Analyze communication patterns to minimize latencies
 Place data near the code that needs it most
 Place certain code near critical resources (e.g. DRAM, I/O)
 Dynamic, unpredictable communication is
impossible to optimize
 Orchestrating communication and locality
increases programming difficulty exponentially
9
Improving Programmability
Observations:
 A cheap broadcast communication mechanism

can make programming easier
 Enables convenient programming models (e.g., shared
memory)
 Reduces the need to carefully manage locality
 On-chip optical components enable cheap,

energy-efficient broadcast
10
ATAC Architecture
Electrical Mesh Interconnect
m m m m
p p p p
m m m m
p p p p
m m m m
p p p p
m m m m
p p p p
Optical Broadcast WDM Interconnect

11
Optical Broadcast Network
 Waveguide passes
through every core
 Multiple
wavelengths (WDM)
eliminates
contention
 Signal reaches all
cores in <2ns
 Same signal can be
received by all
cores
optical waveguide
12
Optical Broadcast Network
 Electronic-
photonic
integration using
standard CMOS
N cores
process
 Cores
communicate via
optical WDM
broadcast and
select network
 Each core sends
on its own
dedicated
wavelength using
modulators
 Cores can receive
from some set 13of
Optical bit transmission
 Each core sends data using a different wavelength  no

contention
 Data is sent once, any or all cores can receive it  efficient
broadcast
multi-wavelength source waveguide
modulator
data waveguide
modulator filter transimpedance

driver
amplifier
photodetector
flip-flop flip-flop
sending core receiving core 14

Core-to-core communication
 32-bit data words transmitted across several parallel waveguides
 Each core contains receive filters and a FIFO buffer for every
sender
 Data is buffered at receiver until needed by the processing core
 Receiver can screen data by sender (i.e. wavelength) or message
type
32 32
FIFO
FIFO
FIFO
FIFO
FIFO FIFO
32 32
Processor Processor
Core Core Processor Core
sending core sending core B receiving core

15
A
ATAC Bandwidth
64 cores, 32 lines, 1 Gb/s

 Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s
 Receive-Weighted BW: 2 Tb/s * 63 receivers = 126 Tb/s
 Good metric for broadcast networks – reflects WDM
ATAC allows better utilization of computational

resources because less time is spent performing
communication
16
System Capabilities and
Performance
Baseline: Raw Multicore Chip ATAC Multicore Chip

 Leading-edge tiled multicore  Future optical interconnect
multicore
64-core system (65nm process)
 Peak performance: 64 GOPS 64-core system (65nm process)
 Chip power: 24 W  Peak performance: 64 GOPS
 Theoretical power eff.: 2.7  Chip power: 25.5 W
GOPS/W  Theoretical power eff.: 2.5
 Effective performance: 7.3 GOPS GOPS/W
 Effective power eff: 0.3  Effective performance: 38.0
GOPS/W GOPS
 Total system power: 150 W  Effective power eff.: 1.5
GOPS/W
 Total system power: 153 W
Optical communications require a small
amount of additional system power but
allow for much better utilization of
computational resources.
17
Programming ATAC
 Cores can directly communicate with any other core
in one hop (<2ns)
 Broadcasts require just one send
 No complicated routing on network required
 Cheap broadcast enables frequent global
communications
 Broadcast-based cache update/remote store
protocol
 All “subscribers” are notified when a writing core issues a
store (“publish”)
 Uniform communication latency simplifies
scheduling
18
Communication-centric Computing
 ATAC reduces off-chip memory calls, and hence energy and
latency
 View of extended global memory can be enabled cheaply

with on-chip distributed cache memory and ATAC network
Operation Energy Latency
memory
500pJ 3pJ Network 3pJ 3 cycles
500pJ transfer
500pJ 500pJ
p p ALU add 2pJ 1 cycle

3pJ
operation
c c 3pJ
32KB cache 50pJ 1 cycle
BUS 3pJ
read
L2 Cache
Off-chip 500pJ 250
memory cycles
Bus-Based ATAC read
Multicore
19
Summary
ATAC uses optical networks to enable multicore
programming and performance scaling
ATAC encourages communication-centric architecture,

which helps multicore performance and power scalability
ATAC simplifies programming with a contention-free all-to-

all broadcast network
ATAC is enabled by recent advances in CMOS integration

of optical components
20
Backup Slides
What Does the Future Look
Like?
Corollary of Moore’s law: Number of cores will
double every 18 months
‘02 ‘05 ‘08 ‘11 ‘14
Research 16 64 256 1024 4096

Industry 4 16 64 256 1024
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self respecting O

22
Scaling to 1000 Cores
memory BNet
Proc
ONet $
HUB
ENet Dir $
memory NET
Electrical Networks Connect

64 Optically-Connected Clusters 16 Cores to Optical Hub
 Purely optical design scales to about 64 cores

 After that, clusters of cores share optical hubs
 ENet and BNet move data to/from optical hub
 Dedicated, special-purpose electrical networks
23
ATAC is an Efficient Network
• Modulators are Primary Source of Power Consumption
– Receive Power: Require only ~2 fJ/bit even with -5dB link loss
– Modulator Power:
• Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver)
• Example: 64-Core Communication
(i.e. N = 64 cores = 64 λ s; for 32 bit word: 2048 drops/core and 32 adds/core)
– Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 µ W
– Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 µ W
– Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit
• Comparison: Electrical Broadcast Across 64 Cores

– Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)
24

Miller - On Chip Optical Communications

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Miller - On Chip Optical Communications

Загружено:

Авторское право:

Доступные форматы

On-Chip Optical

Communication for Multicore

Carbon Research Group

10 SMT, FGMT, CGMT

MIT RAW Sun Ultrasparc T2 IBM XCell 8i Tilera TILE64

Meshes and small cores solve the physical scaling

Parallelizing applications to thousands of cores is hard

For high performance, communication and

 Tasks and data must be both partitioned and

 A cheap broadcast communication mechanism

 On-chip optical components enable cheap,

switch switch switch switch

switch switch switch switch

switch switch switch switch

switch switch switch switch

Optical Broadcast WDM Interconnect

 Each core sends data using a different wavelength  no

modulator filter transimpedance

sending core receiving core 14

sending core sending core B receiving core

64 cores, 32 lines, 1 Gb/s

ATAC allows better utilization of computational

Baseline: Raw Multicore Chip ATAC Multicore Chip

 View of extended global memory can be enabled cheaply

p p ALU add 2pJ 1 cycle

ATAC encourages communication-centric architecture,

ATAC simplifies programming with a contention-free all-to-

ATAC is enabled by recent advances in CMOS integration

Research 16 64 256 1024 4096

1K cores by 2014! Are we ready?

(Cores minimally big enough to run a self respecting O

Electrical Networks Connect

 Purely optical design scales to about 64 cores

• Comparison: Electrical Broadcast Across 64 Cores

Вам также может понравиться