Вы находитесь на странице: 1из 24

On-Chip Optical

Communication for Multicore


Processors

Jason Miller

Carbon Research Group


MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE
LAB
“Moore’s Gap”
Performance
(GOPS)
Tiled Multicore r s
t o
1000 s is The
Multicore ran GOPS
100 T Gap

10 SMT, FGMT, CGMT

OOO
1 Superscalar  Diminishing returns from
0.1 Pipelining single CPU mechanisms
(pipelining, caching, etc.)
 Wire delays
0.01  Power envelopes

time
1992 1998 2002 2006 2010
2
Multicore Scaling Trends
Today Tomorrow
A few large cores on each chip 100’s to 1000’s of simpler
Diminishing returns prevent cores [S. Borkar, Intel, 2007]
cores from getting more
Simple cores are more power
complex
and area efficient
Only option for future scaling is
to add more cores Global structures do not scale;
all resources must be
Still some shared global
structures: bus, L2 caches distributed
m m m m
p p p p
p p switch switch switch switch

m m m m
p p p p
c c switch switch switch switch

m m m m
BUS p p p p
switch switch switch switch

m m m m
L2 Cache p p p p
switch switch switch switch

3
The Future of Multicore
Number of cores doubles
every 18 months Parallelism replaces
clock frequency
scaling and core
complexity

Resulting
Challenges…
Scalability
Programming
Power

MIT RAW Sun Ultrasparc T2 IBM XCell 8i Tilera TILE64

4
Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel
 Efficient core-to-core communication is crucial
 Architectures that grow easily with each new technology
generation

 Programming
 Traditional parallel programming techniques are hard
 Parallel machines were rare and used only by rocket scientists
 Multicores are ubiquitous and must be programmable by
anyone

 Power
 Already a first-order design constraint
 More cores and more communication  more power
 Previous tricks (e.g. lower Vdd) are running out of steam

5
Multicore Communication
Today
Bus-based Interconnect

p p
Single shared resource
c c
Uniform communication cost
BUS Communication through
L2 Cache
memory
Doesn’t scale to many cores
due to contention and long
wires
DRAM
Scalable up to about 8 cores

6
Multicore Communication
Tomorrow
Point-to-Point Mesh Network
m m m m
p p p p Examples: MIT Raw, Tilera
TILEPro64, Intel Terascale
switch switch switch switch

m m m m
p
switch
p
switch
p
switch
p
switch
Prototype
m m m m
p p p p Neighboring tiles are connected
switch switch switch switch

m
p
m
p
m
p
m
p
Distributed communication
switch switch switch switch
resources
Non-uniform costs:
Latency depends on distance
Encourages direct
DRAM

DRAM

DRAM

DRAM

communication
More energy efficient than bus
Scalable to hundreds of cores 7
Multicore Programming Trends

Meshes and small cores solve the physical scaling


challenge, but programming remains a barrier

Parallelizing applications to thousands of cores is hard


 Task and data partitioning
 Communication becomes critical as latencies increase
 Increasing contention for distant communication
 Degraded performance, higher energy
 Inefficient broadcast-style communication
 Major source of contention
 Expensive to distribute signal electrically

8
Multicore Programming Trends

For high performance, communication and


locality must be managed

 Tasks and data must be both partitioned and


placed
 Analyze communication patterns to minimize latencies
 Place data near the code that needs it most
 Place certain code near critical resources (e.g. DRAM, I/O)
 Dynamic, unpredictable communication is
impossible to optimize
 Orchestrating communication and locality
increases programming difficulty exponentially

9
Improving Programmability

Observations:

 A cheap broadcast communication mechanism


can make programming easier
 Enables convenient programming models (e.g., shared
memory)
 Reduces the need to carefully manage locality

 On-chip optical components enable cheap,


energy-efficient broadcast

10
ATAC Architecture
Electrical Mesh Interconnect

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

Optical Broadcast WDM Interconnect


11
Optical Broadcast Network
 Waveguide passes
through every core
 Multiple
wavelengths (WDM)
eliminates
contention
 Signal reaches all
cores in <2ns
 Same signal can be
received by all
cores

optical waveguide
12
Optical Broadcast Network
 Electronic-
photonic
integration using
standard CMOS
N cores
process
 Cores
communicate via
optical WDM
broadcast and
select network
 Each core sends
on its own
dedicated
wavelength using
modulators
 Cores can receive
from some set 13of
Optical bit transmission

 Each core sends data using a different wavelength  no


contention
 Data is sent once, any or all cores can receive it  efficient
broadcast
multi-wavelength source waveguide
modulator

data waveguide

modulator filter transimpedance


driver
amplifier
photodetector

flip-flop flip-flop

sending core receiving core 14


Core-to-core communication
 32-bit data words transmitted across several parallel waveguides
 Each core contains receive filters and a FIFO buffer for every
sender
 Data is buffered at receiver until needed by the processing core
 Receiver can screen data by sender (i.e. wavelength) or message
type

32 32

FIFO

FIFO

FIFO

FIFO
FIFO FIFO

32 32

Processor Processor
Core Core Processor Core

sending core sending core B receiving core


15
A
ATAC Bandwidth

64 cores, 32 lines, 1 Gb/s


 Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s
 Receive-Weighted BW: 2 Tb/s * 63 receivers = 126 Tb/s
 Good metric for broadcast networks – reflects WDM

ATAC allows better utilization of computational


resources because less time is spent performing
communication

16
System Capabilities and
Performance

Baseline: Raw Multicore Chip ATAC Multicore Chip


 Leading-edge tiled multicore  Future optical interconnect
multicore
64-core system (65nm process)
 Peak performance: 64 GOPS 64-core system (65nm process)
 Chip power: 24 W  Peak performance: 64 GOPS
 Theoretical power eff.: 2.7  Chip power: 25.5 W
GOPS/W  Theoretical power eff.: 2.5
 Effective performance: 7.3 GOPS GOPS/W
 Effective power eff: 0.3  Effective performance: 38.0
GOPS/W GOPS
 Total system power: 150 W  Effective power eff.: 1.5
GOPS/W
 Total system power: 153 W
Optical communications require a small
amount of additional system power but
allow for much better utilization of
computational resources.

17
Programming ATAC
 Cores can directly communicate with any other core
in one hop (<2ns)
 Broadcasts require just one send
 No complicated routing on network required
 Cheap broadcast enables frequent global
communications
 Broadcast-based cache update/remote store
protocol
 All “subscribers” are notified when a writing core issues a
store (“publish”)
 Uniform communication latency simplifies
scheduling
18
Communication-centric Computing
 ATAC reduces off-chip memory calls, and hence energy and
latency

 View of extended global memory can be enabled cheaply


with on-chip distributed cache memory and ATAC network
Operation Energy Latency
memory
500pJ 3pJ Network 3pJ 3 cycles
500pJ transfer
500pJ 500pJ

p p ALU add 2pJ 1 cycle


3pJ
operation
c c 3pJ
32KB cache 50pJ 1 cycle
BUS 3pJ
read
L2 Cache
Off-chip 500pJ 250
memory cycles
Bus-Based ATAC read
Multicore
19
Summary
ATAC uses optical networks to enable multicore
programming and performance scaling

ATAC encourages communication-centric architecture,


which helps multicore performance and power scalability

ATAC simplifies programming with a contention-free all-to-


all broadcast network

ATAC is enabled by recent advances in CMOS integration


of optical components

20
Backup Slides
What Does the Future Look
Like?
Corollary of Moore’s law: Number of cores will
double every 18 months
‘02 ‘05 ‘08 ‘11 ‘14

Research 16 64 256 1024 4096


Industry 4 16 64 256 1024

1K cores by 2014! Are we ready?

(Cores minimally big enough to run a self respecting O


22
Scaling to 1000 Cores

memory BNet
Proc
ONet $
HUB
ENet Dir $

memory NET

Electrical Networks Connect


64 Optically-Connected Clusters 16 Cores to Optical Hub

 Purely optical design scales to about 64 cores


 After that, clusters of cores share optical hubs
 ENet and BNet move data to/from optical hub
 Dedicated, special-purpose electrical networks
23
ATAC is an Efficient Network
• Modulators are Primary Source of Power Consumption
– Receive Power: Require only ~2 fJ/bit even with -5dB link loss
– Modulator Power:
• Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver)
• Example: 64-Core Communication
(i.e. N = 64 cores = 64 λ s; for 32 bit word: 2048 drops/core and 32 adds/core)
– Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 µ W
– Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 µ W
– Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit

• Comparison: Electrical Broadcast Across 64 Cores


– Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)

24

Вам также может понравиться