Академический Документы
Профессиональный Документы
Культура Документы
Tutorial
Agenda
Flops (SP)
1000000
100000
10000
Game Processors
1000 PC Processors
100
10
1993 1995 1996 1998 2000 2002 2004
Memory Northbridge
Memory
Next-Gen
Accel Processor
Processor
IO
Southbridge IO
Memory Wall
• Latency induced bandwidth limitations
Power Wall
• Must improve efficiency and performance equally
Frequency Wall
• Diminishing returns from deeper pipelines
(can be negative if power is taken into account)
Cell
Cell History
IBM, SCEI/Sony, Toshiba Alliance formed in 2000
Design Center opened in March 2001
Based in Austin, Texas
February 7, 2005: First external technical disclosures
• Cell Broadband Engine Architecture documentation can be found at:
http://www.ibm.com/developerworks/power/cell
• Additional publications on Cell can be downloaded from:
http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell
• A paper on Cell in the upcoming issue of the IBM Journal of Research and Development can be
found at:
http://www.research.ibm.com/journal/rd/494/kahle.html
Cell Highlights
Supercomputer on a chip
Multi-core microprocessor (9 cores)
3.2 GHz clock frequency
10x performance for many applications
Digital home to distributed computing
Introducing Cell
Sets a new performance standard
• Exploits parallelism while achieving high frequency
• Supercomputer attributes with extreme floating point capabilities
• Sustains high memory bandwidth with smart DMA controllers
Designed for natural human interaction
• Photo-realistic effects
• Predictable real-time response
• Virtualized resources for concurrent activities
Designed for flexibility
• Wide variety of application domains
• Highly abstracted to highly exploitable programming models
• Reconfigurable I/O interfaces
• Autonomic power management
Network Synergistic
NIC Processor
Processor
CPU
CPU
Security
Security
Processor ..
.
…
Media
…
Media
Hardwired
Processor
Programmable
Synergistic
Processor
Config.
IO
Cell Architecture is …
64b Power Architecture™
Power Power
ISA ISA
…
MMU/BIU MMU/BIU
IO
Memory COHERENT BUS
transl.
Incl. coherence/memory
compatible with 32/64b Power Arch. Applications and OS’s
14 STI Technology © 2005 IBM Corporation
IBM Systems and Technology Group
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
+RMT +RMT
…
LS Alias Local Store Local Store
…
Memory Memory
LS Alias
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
Syn. Syn.
+RMT +RMT
Proc. … Proc.
LS Alias
… Local Store Local Store
ISA Memory ISA
LS Alias Memory
3.5
2.5
2
Realative
Power
Frequency
1.5
0.5
0
0.9 1 1.1 1.2 1.3
Voltage
L2 Cache
Custom Designed
– for high frequency, space
and power efficiency
NCU
Power Core
(PPE)
L2 Cache
Local Store
Local Store
SPU
SPU
SPE provides computational performance
• Dual issue, up to 8-way 32-bit SIMD
• Dedicated resources: 128 128-bit RF,
256KB Local Store
AUC
AUC
MFC
MFC
• Each DMA MMU can be dynamically
configured to protect resources MIC
N
• Dedicated DMA engine: Up to 16 96 Byte/Cycle 300+ GB/sec @ 3.2 GHz
outstanding requests N N
MFC SPU
Local Store AUC N
N
N
NCU AUC Local Store
SPU MFC
Power Core
(PPE)
N
MFC
MFC
AUC
AUC
Local Store
Local Store
SPU
SPU
24 STI Technology © 2005 IBM Corporation
IBM Systems and Technology Group
Local Store
Local Store
Update Cache (AUC)
SPU
SPU
• 4 line cache for shared memory atomic update
primitives
•High performance DMA unit
•Local Store aliased into PPE system memory
AUC
AUC
MFC
MFC
•16 element SPE side DMA Command Queue
• 8 element PPE side DMA Command Queue MIC
N
•DMA List supports 2K entry scatter/gather 96 Byte/Cycle 300+ GB/sec @ 3.2GHz
•MFC-MMU controls SPE DMA accesses N N
•Full memory protection for MFC DMA NCU AUC Local Store
•S/W controllable from PPE MMIO
Power Core
•DMA 1,2,4,8,16,128 -> 16Kbyte transfers for I/O
access (PPE)
N
MFC
MFC
AUC
AUC
Local Store
Local Store
SPU
SPU
25 STI Technology © 2005 IBM Corporation
IBM Systems and Technology Group
SPU
SPU
Broadband Interface Controller (BIC):
MFC
MFC
Provides a wide connection to external devices MIC
Two configurable interfaces (50+GB/s @ 5Gbps)
N
96 Byte/Cycle
– Configurable number of bytes
– Coherent (BIF) and / or Local Store AUC MFC SPU
N
NCU
N
N
IOIF0 IOIF1
MFC
MFC
Local Store AUC
SPU
SPU
20 GB/sec
BIF or IOIF0 I/O
SPU
SPU
Internal Interrupt Controller (IIC)
MFC
MFC
Handles SPE Interrupts MIC
Handles External Interrupts
N
96 Byte/Cycle
– From Coherent Interconnect
Local Store AUC MFC SPU
– From IOIF0 or IOIF1 N
NCU
N
N
IOIF0 IOIF1
MFC
MFC
Local Store AUC
SPU
SPU
20 GB/sec
BIF or IOIF0 I/O
– I/O Segments (256 MB)
– I/O Pages (4K, 64K, 1M, 16M byte)
I/O Device Identifier per page for LPAR
IOST and IOPT Cache – hardware / software managed
Local Store
Local Store
SPU
SPU
Token Manager – Bandwidth Reservation for shared resources
• Optionally Enabled for RT Tasks or LPAR
• Multiple Resource Allocation Groups
AUC
AUC
MFC
MFC
• Generates access tokens at configurable rate for each
allocation group
MIC
1 per each memory bank (16 total)
N
96 Byte/Cycle 256GB/sec @ 4GHz
2 for each IOIF (4 total) N N
N
IOIF1 IOIF0
MFC
MFC
AUC
AUC
5 GB/sec
Local Store
Local Store
20 GB / sec Southbridge
SPU
SPU
BIF or IOIF1 I/O
Local Store
Local Store
Power Processor Element (PPE):
SPU
SPU
•General Purpose, 64-bit RISC •Integer and Floating Point capable
N
N
AUC
AUC
Processor (PowerPC AS 2.0) •256KB Local Store
MFC
MFC
N N
•2-Way Hardware Multithreaded Element Interconnect Bus •Up to 25.6 GF/s per SPE ---
•L1 : 32KB I ; 32KB D Local Store AUC
AUC Local Store 200GF/s total *
N
NCU
•L2 : 512KB SPU
MFC
N MFC
SPU
* At clock speed of 3.2GHz
•Coherent load/store Power Core
•VMX (PPE) Internal Interconnect:
•3.2 GHz Local Store AUC
AUC Local Store
•Coherent ring structure
L2 Cache N
MFC
•300+ GB/s total internal
N
MFC SPU
SPU
interconnect bandwidth
N N
•DMA control to/from SPEs
20 GB/sec supports >100 outstanding
MFC
MFC
AUC
AUC
Coherent memory requests
SPU
SPU
Local Store
Local Store
Interconnect N
N
5 GB/sec
I/O Bus
SPE Highlights
RISC like organization
• 32 bit fixed instructions
LS
DP
• Clean design – unified Register file
SFP User-mode architecture
• No translation/protection within SPU
LS
FXU EVN • DMA is full Power Arch protect/x-late
FWD VMX-like SIMD dataflow
• Broad set of operations (8 / 16 / 32 Byte)
FXU ODD LS
• Graphics SP-Float
• IEEE DP-Float
GPR
LS
Unified register file
CONTROL
Local Store “is” large 2nd level register file / private instruction store instead of cache
• Asynchronous transfer (DMA) to shared memory
• Frontal attack on the Memory Wall
Media Unit turned into a Processor LS
• Unified (large) Register File DP
SFP
FWD
• One context
FXU ODD LS
• SIMD architecture
GPR
LS
CONTROL
CHANNEL
DMA SMM
ATO
SMF
RTB
SBI
SPE Detail
Synergistic Processor Element (SPE)
SPU Units:
User-mode architecture
LS • Simple (FXU even)
• No translation/protection DP
within SPE SFP
Add/Compare
Rotate
• DMA is full PowerPC Logical, Count Leading Zero
LS
protect/xlate FXU EVN • Permute (FXU odd)
Direct programmer control
FWD Permute
• DMA/DMA-list Table-lookup
FXU ODD LS
• Branch hint • FPU (Single / Double Precision)
VMX-like SIMD dataflow • Control (SCN)
GPR
• Graphics SP-Float Dual Issue, Load/Store, ECC Handling
LS
CONTROL
• No saturate arith, some byte • Channel (SSC) – Interface to MFC
CHANNEL
• IEEE DP-Float (BlueGene- • Register File (GPR/FWD)
DMA SMM
like) ATO
Unified register file
• 128 entry x 128 bit RTB
SBI
256KB Local Store
• Combined I & D
SPE Latencies
• 16B/cycle L/S bandwidth
• Simple fixed point - 2 cycles*
• 128B/cycle DMA bandwidth
• Complex fixed point - 4 cycles*
• Load - 6 cycles*
Local store size = 256 KB
• Single-precision (ER) float - 6 cycles*
• Integer multiply - 7 cycles*
• Branch miss - 20 cycles
No penalty if correctly hinted
• DP (IEEE) float - 13 cycles*
Partially pipelined
• Enqueue DMA Command - 20 cycles*
MFC Detail
Local SPU
Memory Flow Control System Store SPE
•DMA Unit
Legend:
•LS <-> LS, LS<-> Sys Memory, LS<->
DMA Engine DMA Data Bus
I/O Transfers Queue Snoop Bus
•8 PPE-side Command Queue entries Control Bus
Xlate Ld/St
•16 SPU-side Command Queue entries MMIO
8 Pre-Decode
Decode Thread A
L1 Data Cache
Dependency Thread B
Issue Thread A
2
1 1 1
Load/Store Fixed-Point
Branch VMX/FPU Issue (Queue)
Execution 2
Unit Unit
Unit 1 1 1 1
VMX
VMX FPU FPU
Completion/Flush Load/Store/
Arith./Logic Unit Arith/Logic Unit Load/Store
Permute
VMX Completion FPU Completion
IC1 IC2 IC3 IC4 IB1 IB2 ID1 ID2 ID3 IS1 IS2 IS3
Instruction Cache and Buffer Instruction Decode and Issue
BP1 BP2 BP3 BP4
Branch Prediction
Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write
IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2
CellBE Processor
~250M transistors
~235mm2
Top frequency >3GHz
9 cores, 10 threads
> 200+ GFlops (SP) @3.2 GHz
> 20+ GFlops (DP) @3.2 GHz
Up to 25.6GB/s memory B/W
Up to 50+ GB/s I/O B/W
~400M$(US) design investment
Local Store
Local Store
Cell
SPU
SPU
N
N
Chip
AUC
AUC
MFC
MFC
N N
MFC
N NCU
AUC
MFC
Local Store
N
SPU
SPU
Power Processor
(PPE)
SPU
AUC
MFC
25 GB/sec Memory Inteface
N
L2 Cache
AUC
MFC
Local Store
N
SPU
MFC
MFC
AUC
AUC
SPU
Local Store
SPU
Local Store
N
N
Existing Application Accelerator 20 GB/sec Coherent Interconnect 5 GB/sec I/O Bus
Programmer
Productivity
SPE <-> PPE Mailboxes MFC MFC MFC MFC MFC MFC
N N N N N N
Resource Reservation and Allocation Local Store Local Store Local Store Local Store Local Store Local Store
PPE can be shared across logical partitions SPU SPU SPU SPU SPU SPU
SPEs can be assigned to logical partitions
SPEs independently or Group Allocated
Local Store to Local Store DMA
Cache Replacement Management
TLB Replacement Management
Bandwidth Reservation
BIF/IOIF
N
SPU
MFC
Local Store
MFC
Local Store
MFC
SPU
N
Local Store
MFC
• Leverage ELF
SPU SPU
SPU SPU SPU
SPU
Local
AU Store
Local
AU Store
Engine
SPU
SPU
N
N
MFC
MFC
C
C
• Vector.org SIMD intrinsics Chip 256 GB/sec Coherent
N N
XDRtm
Local Store
Local Store Ring AU
• Data/Code partitioning SPU
AU
MFC
C
N NCU
Power
MFC
C
N
SPU
Processor
• Streaming / pre-specifying code/data use Local Store AU
(PPE) AU
Local Store
N L2 MFC
C
N
MFC
C SPU
MFC
MFC
C
AU
C
AU Store
Local
AU Store
Local
AU Store
Engine
SPU
SPU
SPU
SPU
Store
Local
Local
N
IBM Research Prototype Single Source Compiler
N
MFC
MFC
C
C
• C Frontend Chip 256 GB/sec Coherent
N N
Local Store
Local Store AU Ring AU
I/O
N
N NCU MFC
C SPU
MFC
C
• XLC SPE and XLC PPE back-end SPU Power
Processor
(PPE) Local Store
Local Store AU
IBM Research Prototype Parallelizing Compiler SPU
AU
MFC
C
N L2
Cache
MFC
C
N
SPU
• UPC front-end N N
MFC
MFC
C
AU
C
AU Store
• Fortran front-end
SPU
SPU
Store
Local
Local
N
N
• XLC SPE backend I/O
48 STI Technology © 2005 IBM Corporation
IBM Systems and Technology Group
SPUFS
Misc format bin Privileged Filesystem
SPU Object Kernel SPU Allocation, Scheduling
Loader Extension Extensions & Dispatch Extension
64-bit Linux Kernel
Architecture Specific Code
ptrace, large page, madvise, SPE error, fault handling, IIC support, IOS
Firmware / Hypervisor
Cell Reference System Hardware
50 STI Technology © 2005 IBM Corporation
IBM Systems and Technology Group
End-User
Experience
Verification Hypervisor
Miscellaneous Tools
System Level Simulator
Language extensions
Standards:
ABI
Cell Standards
Standards
Other simulators
• spusim – standalone SPU simulator
•SPE management
•Separate policy manager
•Pre-emptive partition switching on
high priority interrupts
Conventional
Network
OS
(Linux)
Real time Open Source / Proprietary
OS I/O Hosting
Hardware
Linux on Cell
All software in STIDC written on Linux OS
Execution Environment
• Started with Linux 2.4 PPC64 on Cell Simulator
SPEs exposed as I/O Devices (function offload model)
SPE DMA required pre-pinned memory
Inflexible programming model
Moved to 2.6.3
• Added heterogenous lwp/thread model – via system call – moved to SPUFS in 2.6.13
SPE thread API created (similar to pthreads library)
User mode direct and indirect SPE access models
Full pre-emptive SPE context management
spe_ptrace() added for gdb support
spe_schedule() for thread to physical SPE assignment
currently FIFO – run to completion
• SPE threads share address space with parent PPE process (through DMA)
Demand paging for SPE accesses
Shared hardware page table with PPE
• SPE Error, Event and Signal handling directed to parent PPE thread
• SPE elf objects wrapped into PPE shared objects with extended gld
SPE-side mini-loader
• madvise() extended for L2 cache and TLB locking/preloading (realtime feature)
• All patches for Cell in architecture dependent layer (subtree of PPC64)
Publishing Initial CellBE Patches for 2.6.13 (Fall 2005 target)
Debug Tools
Development Environment
GNU gdb
• ptrace and spe_ptrace enabled
• Multi-core Application source level debugger supporting PPE multithreading, SPE
multithreading, interacting PPE and SPE threads
• Three modes of debugging SPU threads
Attach to SPE thread
Launch mode – launch a new debug session for each SPE thread
Pass-thru mode – follow execution into SPE thread
RISCwatch
• Low level hardware (JTAG) debugger
pmcount
• Tool to access to HW performance counters
Performance inspector
• Suite of GPL based performance analysis tools extended to support SPE threads
tprof – timer based analysis tool
ptt – per thread time
ai – above idle
post – report generator
a2n – address to name
ctrace
• Branch tracing performance monitor (under development)
Physics Simulation
Subdivision Surfaces
PPE SPE
G System
E Memory
PPE User GE
Application L Queues Shader
I Viewer
5000
4000
3000
2000
1000
5000
4000
3000
2000
1000
Ray Casting
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
H C H C H C H C H C H C H C H C
PPE
Main Memory
L2
PPE
Main Memory
L2
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C
H C
Shuffle Byte
H C H C H C
H C
H C H C H C H C
H C H C
SIMD Register (128 bits)
H C H C
PPE Prep
Prep Frame
Frame 00 Prep
Prep Frame
Frame 11 Prep
Prep Frame
Frame 22 Prep
Prep Frame
Frame 33
SPE 7 Encode
Encode Frame
Frame 00 Encode
Encode Frame
Frame 11
Time
Store
Store Store
Store Store
Store
Load
Load HC/Accm
HC/Accm 00 Load
Load HC/Accm
HC/Accm 11 Load
Load HC/Accm
HC/Accm 00
Accm
Accm 00 Accm
Accm 11 Accm
Accm 00
MFC Load
Load
Load Load Load
Load
RC
RC 00 RC
RC 11 RC
RC 00
SPU Gen
Gen Lists
Lists 00 Gen
Gen Samples
Samples 11 Gen
Gen Lists
Lists 11 Gen
Gen Samples
Samples 00 Gen
Gen Lists
Lists 00 Gen
Gen Samples
Samples 11
Time
4
3
2
1
Shader
Shader
Texture
TextureFiltering
Filtering
Normal
NormalGeneration
Generation
Bump
BumpMapping
Mapping
Lighting
Lighting
Atmosphere
Atmosphere
Sample
Sample Accumulation
Accumulation
Multi
Multi -- Sample
Sample Adjustment
Adjustment
HC
HC DMA
DMA List
List (13
(13 KB)
KB) HC
HC DMA
DMA List
List (13
(13 KB)
KB)
Accumulation
Accumulation Data
Data Accumulation
Accumulation Data
Data
(30
(30 KB)
KB) (30
(30 KB)
KB)
Accum
Accum DMA
DMA List
List (15
(15 KB)
KB) Accum
Accum DMA
DMA List
List (15
(15 KB)
KB)
RC
RC RC
RC AC
AC Stack
Stack 44 KB
KB Code
Code
(29
(29 KB)
KB)
Chip
The number of used registers are 128, the used ratio is 100.00
Austin
Summary
Cell ushers in a new era of leading edge processors optimized for digital
media and entertainment
Desire for realism is driving a convergence between supercomputing and
entertainment
New levels of performance and power efficiency beyond what is achieved by
PC processors
Responsiveness to the human user and the network are key drivers for Cell
Cell will enable entirely new classes of applications, even beyond those we
contemplate today
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.