CUDA - Copperhead

Copperhead: A Python-like Data
Parallel Language & Compiler

Bryan Catanzaro, UC Berkeley
Michael Garland, NVIDIA Research
Kurt Keutzer, UC Berkeley
Universal Parallel Computing Research Center

University of California, Berkeley
Intro to CUDA
 Overview
 Multicore/Manycore
 SIMD
 Programming with millions of threads
2/28
The CUDA Programming Model
 CUDA is a recent programming model, designed for

 Manycore architectures
 Wide SIMD parallelism
 Scalability
 CUDA provides:
 A thread abstraction to deal with SIMD
 Synchronization & data sharing between small groups of
threads
 CUDA programs are written in C + extensions
 OpenCL is inspired by CUDA, but HW & SW vendor neutral
 Programming model essentially identical
3/28
Multicore and Manycore
Multicore Manycore
 Multicore: yoke of oxen
 Each core optimized for executing a single thread
 Manycore: flock of chickens
 Cores optimized for aggregate throughput, deemphasizing
individual performance
4/28
Multicore & Manycore, cont.
Specifications Core i7 960 GTX285
4 cores, 4 way SIMD 30 cores, 8 way SIMD

Processing Elements
@3.2 GHz @1.5 GHz
4 cores, 2 threads, 4 30 cores, 32 SIMD
Resident Threads width SIMD: vectors, 32 width
(max) SIMD:
32 strands 30720 strands Core i7
SP GFLOP/s 102 1080
Memory Bandwidth 25.6 GB/s 159 GB/s
Register File - 1.875 MB
Local Store - 480 kB
GTX285 5/28
SIMD: Neglected Parallelism
 It is difficult for a compiler to exploit SIMD

 How do you deal with sparse data & branches?
 Many languages (like C) are difficult to vectorize
 Fortran is somewhat better
 Most common solution:
 Either forget about SIMD
▪ Pray the autovectorizer likes you
 Or instantiate intrinsics (assembly language)
 Requires a new code version for every SIMD extension
6/28
What to do with SIMD?
4 way SIMD 16 way SIMD

 Neglecting SIMD in the future will be more expensive
 AVX: 8 way SIMD, Larrabee: 16 way SIMD
 This problem composes with thread level parallelism
7/28
CUDA
 CUDA addresses this problem by abstracting both SIMD

and task parallelism into threads
 The programmer writes a serial, scalar thread with the
intention of launching thousands of threads
 Being able to launch 1 Million threads changes the
parallelism problem
 It’s often easier to find 1 Million threads than 32: just look
at your data & launch a thread per element
 CUDA is designed for Data Parallelism
 Not coincidentally, data parallelism is the only way for
most applications to scale to 1000(+) way parallelism
8/28
Hello World
9/28
CUDA Summary
 CUDA is a programming model for manycore

processors
 It abstracts SIMD, making it easy to use wide SIMD
vectors
 It provides good performance on today’s GPUs
 In the near future, CUDA-like approaches will map well
to many processors & GPUs
 CUDA encourages SIMD friendly, highly scalable
algorithm design and implementation
10/28
A Parallel Scripting Language
 What is a scripting language?
 Lots of opinions on this
 I’m using an informal definition:
▪ A language where performance is happily traded for productivity
 Weak performance requirement of scalability
▪ “My code should run faster tomorrow”
 What is the analog of today’s scripting languages for manycore?
11/28
Data Parallelism
 Assertion: Scaling to 1000 cores requires data

parallelism
 Accordingly, manycore scripting languages will be data
parallel
 They should allow the programmer to express data
parallelism naturally
 They should compose and transform the parallelism to
fit target platforms
12/28
Warning: Evolving Project
 Copperhead is still in embryo

 We can compile a few small programs
 Lots more work to be done in both language definition
and code generation
 Feedback is encouraged
13/28
Copperhead = Cu + python
 Copperhead is a subset of Python, designed

for data parallelism
 Why Python?
 Extant, well accepted high level scripting language
▪ Free simulator(!!)
 Already understands things like map and reduce
 Comes with a parser & lexer
 The current Copperhead compiler takes a subset of
Python and produces CUDA code
 Copperhead is not CUDA specific, but current compiler is
14/28
Copperhead is not Pure Python
 Copperhead is not for arbitrary Python code

 Most features of Python are unsupported
 Copperhead is compiled, not interpreted
 Connecting Python code & Copperhead code will
require binding the programs together, similar to
Python-C interaction
 Copperhead is statically typed
Python
Copperhead
15/28
Saxpy: Hello world
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
 Some things to notice:

 Types are implicit
▪ The Copperhead compiler uses a Hindley-Milner type system
with typeclasses similar to Haskell
▪ Typeclasses are fully resolved in CUDA via C++ templates
 Functional programming:
▪ map, lambda (or equivalent in list comprehensions)
▪ you can pass functions around to other functions
▪ Closure: the variable ‘a’ is free in the lambda function, but
bound to the ‘a’ in its enclosing scope
16/28
Type Inference, cont.
c=a+b
+ : (Num0, Num0) > Num0
A145 A207 A52
c=a+b
Num52 Num52 Num52
 Copperhead includes function templates for intrinsics like
add, subtract, map, scan, gather
 Expressions are mapped against templates
 Every variable starts out with a unique generic type, then
types are resolved by union find on the abstract syntax tree
 Tuple and function types are also inferred
17/28
Data parallelism
 Copperhead computations are organized around data

parallel arrays
 map performs a “forall” for each element in an array
 Accesses must be local
 Accessing non-local elements is done explicitly
 shift, rotate, or gather
 No side effects allowed
18/28
Copperhead primitives
 map
 reduce
 Scans:
 scan, rscan, segscan, rsegscan
 exscan, exrscan, exsegscan, exrsegscan
 Shuffles:
 shift, rotate, gather, scatter
19/28
Implementing Copperhead
Module(
None,
def saxpy(a, x, y): Stmt(
Function(
None,
return map(lambda xi, yi: a*xi + yi, x, y) 'saxpy',
['a', 'x', 'y'],
0,
None,
Stmt(
 The Copperhead Return(
CallFunc(
compiler is written in Name('map'),

Lambda(
['xi', 'yi'],
Python 0,
Add(
Mul(
 Python provides its Name('a'),
Name('xi')
own Abstract Syntax ),

Name('yi')
)
Tree ),
Name('x'),
Name('y'),
 Type inference, code None,
None
generation, etc. is done )

)
)
by walking the AST )
)
)
20/28
Compiling Copperhead to CUDA
 Every Copperhead function creates at least one CUDA

device function
 Top level Copperhead functions create a CUDA global
function, which orchestrates the device function calls
 The global function takes care of allocating shared
memory and returning data (storing it to DRAM)
 Global synchronizations are implemented through
multiple phases
 All intermediate arrays & plumbing handled by Copperhead
compiler
21/28
Saxpy Revisited
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
template<typename Num> __device__ Num lambda0(Num xi, Num yi, Num a) {

return ((a * xi) + yi);
}
template<typename Num>__device__ void saxpy0Dev(Array<Num> x, Array<Num> y, Num a, uint
_globalIndex, Num& _returnValueReg) {
Num _xReg, yReg;
if (_globalIndex < x.length) _xReg = x[_globalIndex];
if (_globalIndex < y.length) _yReg = y[_globalIndex];
if (_globalIndex < x.length) _returnValueReg = lambda0<Num>(_xReg, _yReg, a);
}
template<typename Num>__global__ void saxpy0(Array<Num> x, Array<Num> y, Num a, Array<Num>
_returnValue) {
uint _blockMin = IMUL(blockDim.x, blockIdx.x);
uint _blockMax = _blockMin + blockDim.x;
uint _globalIndex = _blockMin + threadIdx.x;
Num _returnValueReg;
saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);
if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;
}
22/28
Phases
 Reduction
phase 0
phase 1
 Scan
phase 0
phase 1
phase 2
23/28
Copperhead to CUDA, cont.
B = reduce(map(A))
D = reduce(map(C))
phase 0
phase 1
 Compiler schedules computations into phases

 Right now, this composition is done greedily
 Compiler tracks global and local availability of all variables
and creates a phase boundary when necessary
 Fusing work into phases is important for performance
24/28
Copperhead to CUDA, cont.
 Shared memory used only for communicating between

threads
 Caching unpredictable accesses (gather)
 Accessing elements with a uniform stride (shift & rotate)
 Each device function returns its intermediate results
through registers
25/28
Split
def split(input, value):
0 flags = map(lambda a: 1 if a <= value else 0, input)
0 notFlags = map(lambda a: not a, flags)
phases
0-2 leftPositions = exscan(lambda a, b: a + b, 0, flags)

0-2 rightPositions = exrscan(lambda a, b: a + b, 0, notFlags)
2 positions = map(lambda a, b, flag: a if flag else
len(input) - b - 1, leftPositions, rightPositions, flags)
2 return scatter(input, positions)
 This code is decomposed into 3 phases

 Copperhead compiler takes care of intermediate
variables
 Copperhead compiler uses shared memory for
temporaries used in scans here
 Everything else is in registers
26/28
Interpreting to Copperhead
 If the interpreter harvested dynamic type information, it

could use the Copperhead compiler as a backend
 Fun project – see what kinds of information could be
gleaned from the Python interpreter at runtime to
figure out what should be compiled via Copperhead to a
manycore chip
27/28
Future Work
 Finish support for the basics

 Compiler transformations
 Nested data parallelism flattening
▪ segmented scans
 Retargetability
 Thread Building Blocks/OpenMP/OpenCL
 Bridge Python and Copperhead
 Implement real algorithms with Copperhead
 Vision/Machine Learning, etc.
28/28

CUDA - Copperhead

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CUDA - Copperhead

Загружено:

Авторское право:

Доступные форматы

Copperhead: A Python-like Data

Parallel Language & Compiler

Universal Parallel Computing Research Center

 CUDA is a recent programming model, designed for

4 cores, 4 way SIMD 30 cores, 8 way SIMD

Memory Bandwidth 25.6 GB/s 159 GB/s

Register File - 1.875 MB

Local Store - 480 kB

 It is difficult for a compiler to exploit SIMD

4 way SIMD 16 way SIMD

 CUDA addresses this problem by abstracting both SIMD

 CUDA is a programming model for manycore

 Assertion: Scaling to 1000 cores requires data

 Copperhead is still in embryo

 Copperhead is a subset of Python, designed

 Copperhead is not for arbitrary Python code

 Some things to notice:

 Copperhead computations are organized around data

compiler is written in Name('map'),

own Abstract Syntax ),

generation, etc. is done )

 Every Copperhead function creates at least one CUDA

template<typename Num> __device__ Num lambda0(Num xi, Num yi, Num a) {

 Compiler schedules computations into phases

 Shared memory used only for communicating between

0-2 leftPositions = exscan(lambda a, b: a + b, 0, flags)

 This code is decomposed into 3 phases

 If the interpreter harvested dynamic type information, it

 Finish support for the basics

Вам также может понравиться

template<typename Num> device Num lambda0(Num xi, Num yi, Num a) {