Cuda 9 and Beyond

May 8-11, 2017 | Silicon Valley
CUDA 9 AND BEYOND

Mark Harris, May 10, 2017
INTRODUCING CUDA 9
BUILT FOR VOLTA FASTER LIBRARIES
Tesla V100
New GPU Architecture cuBLAS for Deep Learning
Tensor Cores NPP for Image Processing
NVLink cuFFT for Signal Processing
Independent Thread Scheduling
COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES
Flexible Thread Groups Faster Compile Times

Efficient Parallel Algorithms partition
Unified Memory Profiling
Synchronize Across Thread NVLink Visualization
Blocks in a Single GPU or New OS and Compiler
Multi-GPUs sync sync Support
2
INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core
HBM2
120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
3
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
Volta HPC Application Performance

Relative to Tesla P100
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware. 10 Megawatts
4
FASTER LIBRARIES
5
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
DEEP LEARNING
Utilize Volta Tensor Cores GEMM optimizations for RNNs
(cuBLAS)
Volta optimized GEMMs (cuBLAS)
Faster image processing (NPP)
Out-of-box performance on Volta Scientific Computing
(all libraries) FFT optimizations across various sizes

(cuFFT)
NEW ALGORITHMS IMPROVED USER EXPERIENCE
Multi-GPU dense & sparse solvers, dense New install package for CUDA Libraries
eigenvalue & SVD (cuSOLVER) (library-only meta package)
Breadth first search, clustering, triangle Modular NPP with small footprint,
counting, extraction & contraction support for image batching
(nvGRAPH)
6
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute)
2 10
P100 (CUDA 8) P100 (CUDA 8)
1.8 9
V100 (CUDA 9)
1.6
V100 Tensor Cores (CUDA 9)
8
Relative Performance
Relative Performance
1.4 7
1.8x 9.3x
1.2 6
1 5
0.8 4
0.6 3
0.4 2
0.2 1
0 0
512 1024 2048 4096 512 1024 2048 4096
Matrix Size (M=N=K) Matrix Size (M=N=K)
7
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
Learn More
Connect with The Experts
H7129 Accelerated Libraries:

cuFFT, cuSPARSE, cuSOLVER, nvGRAPH
Wednesday 4pm – Lower Level Pod B
S7121: Jacobi-Based Eigenvalue Solver on GPU (cuSOLVER)

Lung Sheng Chien
Tuesday, May 9, 11:00 AM - 11:25 AM, Marriott Salon 3
8
COOPERATIVE GROUPS
9
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of

cooperating threads
Thread Block Group
Clean composition across software boundaries
Optimize for hardware fast path
Scalable from a few threads to all running threads

Partitioned Thread Groups
Deploy Everywhere: Kepler and Newer GPUs
Supported by CUDA developer tools
10
* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs
SYNCHRONIZE AT ANY SCALE
Three Key Capabilities
WHOLE-GRID MULTI-GPU
FLEXIBLE GROUPS
SYNCHRONIZATION SYNCHRONIZATION
Define and
Synchronize Multiple
Synchronize Arbitrary
Thread Blocks
Groups of Threads
partition
sync sync
sync sync
11
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program Thread Block Group
thread_group block = this_thread_block();
You can synchronize threads in a group

block.sync();
Create new groups by partitioning existing groups

Partitioned Thread Groups
thread_group tile32 = tiled_partition(block, 32);
thread_group tile4 = tiled_partition(tile32, 4);
Partitioned groups can also synchronize
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace 12
EXAMPLE: PARALLEL REDUCTION
Composable, Robust and Efficient
Per-Block Per-Warp
g = this_thread_block(); g = tiled_partition<32>(this_thread_block());
reduce(g, ptr, myVal); reduce(g, ptr, myVal);
__device__ int reduce(thread_group g, int *x, int val) {

int lane = g.thread_rank();
for (int i = g.size()/2; i > 0; i /= 2) {
x[lane] = val; g.sync();
val += x[lane + i]; g.sync();
}
return val;
} 13
LAUNCHING COOPERATIVE KERNELS
Three Synchronization Scales
Block or Sub-Block Launch with <<<>>> or

Sync cudaLaunchKernel()
Launch with
Multi-Block Sync
cudaLaunchCooperativeKernel()
Multi-Device Sync Launch with

cudaLaunchCooperativeKernelMultiDevice()
14
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
// threads update particles in parallel 1

0 2 3
integrate<<<blocks, threads, 0, stream>>>(particles);
4 5 6
7
15
// threads update particles in parallel 0 1 2 3 4

integrate<<<blocks, threads, 0, s>>>(particles);
// Collide each particle with others in neighborhood

collide<<<blocks, threads, 0, s>>>(particles); 5 6 7
Note change in how threads map to particles in acceleration data structure

16
0 1 2 3
// threads update particles in parallel

integrate<<<blocks, threads, 0, s>>>(particles);
4 5 6
7
// Note: implicit sync between kernel launches 0 1 2 3 4
// Collide each particle with others in neighborhood

collide<<<blocks, threads, 0, s>>>(particles);
5 6 7
Note change in how threads map to particles in acceleration data structure
17
WHOLE-GRID COOPERATION
Particle Simulation Update in a Single Kernel
__global__ void particleSim(Particle *p, int N) { 1

0 2 3
grid_group g = this_grid();
for (i = g.thread_rank(); i < N; i += g.size()) 4 5 6

7
integrate(p[i]);
0 1 2 3 4
g.sync() // Sync whole grid!
for (i = g.thread_rank(); i < N; i += g.size())

collide(p[i], p, N); 5 6 7
}
Launch using cudaLaunchCooperativeKernel(…) 18
MULTI-GPU COOPERATION
Large-scale Multi-GPU Simulation in a Single Kernel
__global__ void particleSim(Particle *p, int N) {
multi_grid_group g = this_multi_grid();
for (i = g.thread_rank(); i < N; i += g.size()) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

integrate(p[i]);
4 5 6 4 5 6 4 5 6 4 5 6
7 7 7 7
g.sync() // Sync all GPUs!
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
for (i = g.thread_rank(); i < N; i += g.size())

collide(p[i], p, N); 5 6 7 5 6 7 5 6 7 5 6 7
Launch using cudaLaunchCooperativeKernelMultiDevice(…)

19
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model
Volta Independent Thread Scheduling:

Program familiar algorithms and data structures in a natural way
Flexible thread grouping and synchronization
Use explicit synchronization, don’t rely on implicit convergence

CUDA 9 provides a fully explicit synchronization model
20
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model
Eliminate implicit warp synchronous programming on all architectures

Use explicit synchronization
Focus synchronization granularity with Cooperative Groups
Transition to new *_sync() primitives

__shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask()
CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()
21
Learn More
Cooperative Groups
Session S7622
Kyrylo Perelygin and Yuan Lin
Wednesday, 4pm Marriott Ballroom 3
22
DEVELOPER TOOLS
23
UNIFIED MEMORY PROFILING
Correlate CPU Page Faults with Source
Page Fault Correlation
24
NEW UNIFIED MEMORY EVENTS
Visualize Virtual Memory Activity
Memory Thrashing Page Throttling Remote Map
25
Learn More
S7495: Optimizing Application Performance

with CUDA Profiling Tools
Rahul Dhoot, Sanjiv Satoor, Mayank Jain
Thursday, 10am Marriott Ballroom 3
S7824: Developer Tools update in CUDA 9.0

Rafael Campana
Wednesday, 4pm 212A
26
THE BEYOND SECTION
27
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc
CUDA 8 Code with System Allocator

void sortfile(FILE *fp, int N) {
Removes CUDA-specific allocator
char *data; restrictions
// Allocate memory using any standard allocator Data movement is transparently
data = (char *) malloc(N * sizeof(char));
handled
fread(data, 1, N, fp);
Requires operating system support:
sort<<<...>>>(data,N,1,compare);
HMM Linux Kernel Module
use_data(data);
Learn More:
// Free the allocated memory HMM, Session 7764
free(data);
}
John Hubbard
4pm Wednesday (room 211B)
28
USING TENSOR CORES
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,

wmma::row_major);
NVIDIA cuDNN, cuBLAS, TensorRT }
CUDA C++
Volta Optimized
Warp-Level Matrix Operations
Frameworks and Libraries
29
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
D= A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 30
TENSOR CORE COORDINATION
Full Warp 16x16 Matrix Math
warp
Warp-synchronizing operation for

cooperative matrix math
Aggregate Matrix Multiply and

Accumulate for 16x16 matrices
warp
Result distributed across warp
31
CUDA TENSOR CORE PROGRAMMING
16x16x16 Warp Matrix Multiply and Accumulate (WMMA)
D=
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 32
New WMMA datatypes
Per-Thread fragments to hold components of

matrices for use with Tensor Cores
wmma::fragment<matrix_a, …> Amat;
33
New WMMA load and store operations
Warp-level operation to fetch components of

matrices into fragments
wmma::load_matrix_sync(Amat, a, stride);
warp
34
New WMMA Matrix Multiply and Accumulate Operation
Warp-level operation to perform matrix

multiply and accumulate
wmma::mma_sync(Dmat, Amat, Bmat, Cmat);
D=
35
New WMMA load and store operations
Warp-level operation to fetch components of

matrices into fragments
wmma::store_matrix_sync(d, Dmat, stride);
warp
Result
36
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Partition using an arbitrary label:
// Four groups of threads with same computed value

int label = foo() % 4;
thread_group block = partition(this_thread_block(), label);
Use with care: random groups can lead to SIMT execution inefficiency
37
Library of Collective Algorithms
Reductions, sorting, prefix sum (scan), etc.
// collective key-value sort using all threads in the block

cooperative_groups::sort(this_thread_block(), myValues, myKeys);
// collective scan-based allocate across block

int sz = myAllocationSize(); // amount each thread wants
int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);
Note: preliminary API sketch

38
May 8-11, 2017 | Silicon Valley
CUDA 9 AND BEYOND

#GTC17
http://developer.nvidia.com/cuda-toolkit
http://parallelforall.com
mharris@nvidia.com
@harrism
BACKUP
40
THREAD GROUPS INTERFACE
A thread can access the size of its group and its index (rank) in the group:
thread_group group = this_thread_block(); Intrinsic group
int index = group.thread_rank();

int size = group.size();
Thread block groups are a special type with more functions:
thread_block block = this_thread_block();

Linear index
int index = block.thread_rank();
dim3 tid = block.thread_index(); Equivalent to threadIdx (3D)
dim3 bid = block.group_index(); Equivalent to blockIdx (3D)
41
DISCOVERED CONCURRENCY
Simple, Robust Cooperation Within Warps
CUDA 8 CUDA 9 Cooperative Groups
__device__ int atomicAggInc(int *p) __device__ int atomicAggInc(int *p)
{ {
unsigned mask = __ballot(1); coalesced_group g = coalesced_threads();
unsigned total = __popc(mask); int prev;
unsigned prefix = __popc(mask & if (g.thread_rank() == 0)
__lanemask_lt()); offset = atomicAdd(p, g.size());
int lane = __ffs(mask) - 1; return g.thread_rank() + g.shfl(offset, 0);
int offset = 0; }
if (prefix == 0)
offset = atomicAdd(p, total); coalesced_threads() returns the group of
return prefix + threads that called it together (often a warp)
__shfl_sync(mask, offset, lane);
} coalesced_group supports warp shfl()
42
Volta Enables Greater Flexibility
Partition using an arbitrary label:
// Group of first four threads of all warps

auto tile = tiled_partition<32>(this_thread_block());
thread_group block = partition(this_thread_block(), tile.thread_rank() < 4);
// Four groups of threads with same computed value

int label = foo() % 4;
thread_group block = labeled_partition(this_thread_block(), label);
Use with care: random groups can lead to SIMT execution inefficiency
43
Need updated results
NPP IMAGE PROCESSING PRIMITIVES

Redesigned NPP boosts performance with smaller footprint
Over 2500 accelerated image, video & NPP Image Processing: 20-100x vs. CPU
computer vision primitives
Morphological Operations
JPEG
CUDA 9 streamlines NPP library Geometry Transforms
Filters
Small memory footprint
Color Processing
Image batching support
0 20 40 60 80 100 120
NPP/Tesla V100 Speedup vs. IPP / Xeon E5-2690 (Broadwell)
44
1 2 3 0 1 2 3 4
0
4 5 6
7 7
5 6
Phase 1: Integration Phase 2: Collision Detection
45

Cuda 9 and Beyond

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cuda 9 and Beyond

Загружено:

Авторское право:

Доступные форматы

May 8-11, 2017 | Silicon Valley

CUDA 9 AND BEYOND

COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES

Flexible Thread Groups Faster Compile Times

Volta HPC Application Performance

(all libraries) FFT optimizations across various sizes

NEW ALGORITHMS IMPROVED USER EXPERIENCE

Connect with The Experts

H7129 Accelerated Libraries:

S7121: Jacobi-Based Eigenvalue Solver on GPU (cuSOLVER)

Define, synchronize, and partition groups of

Optimize for hardware fast path

Scalable from a few threads to all running threads

Deploy Everywhere: Kepler and Newer GPUs

Supported by CUDA developer tools

thread_group block = this_thread_block();

You can synchronize threads in a group

Create new groups by partitioning existing groups

__device__ int reduce(thread_group g, int *x, int val) {

Block or Sub-Block Launch with <<<>>> or

Multi-Device Sync Launch with

// threads update particles in parallel 1

// threads update particles in parallel 0 1 2 3 4

// Collide each particle with others in neighborhood

Note change in how threads map to particles in acceleration data structure

// threads update particles in parallel

// Collide each particle with others in neighborhood

__global__ void particleSim(Particle *p, int N) { 1

for (i = g.thread_rank(); i < N; i += g.size()) 4 5 6

for (i = g.thread_rank(); i < N; i += g.size())

__global__ void particleSim(Particle *p, int N) {

for (i = g.thread_rank(); i < N; i += g.size()) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

for (i = g.thread_rank(); i < N; i += g.size())

Launch using cudaLaunchCooperativeKernelMultiDevice(…)

Volta Independent Thread Scheduling:

Flexible thread grouping and synchronization

Use explicit synchronization, don’t rely on implicit convergence

Eliminate implicit warp synchronous programming on all architectures

Transition to new *_sync() primitives

Kyrylo Perelygin and Yuan Lin

Wednesday, 4pm Marriott Ballroom 3

Memory Thrashing Page Throttling Remote Map

S7495: Optimizing Application Performance

S7824: Developer Tools update in CUDA 9.0

CUDA 8 Code with System Allocator

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,

FP16 or FP32 FP16 FP16 FP16 or FP32

Warp-synchronizing operation for

Aggregate Matrix Multiply and

Result distributed across warp

Per-Thread fragments to hold components of

wmma::fragment<matrix_a, …> Amat;

Warp-level operation to fetch components of

Warp-level operation to perform matrix

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

Warp-level operation to fetch components of

wmma::store_matrix_sync(d, Dmat, stride);

Partition using an arbitrary label:

// Four groups of threads with same computed value

Reductions, sorting, prefix sum (scan), etc.

// collective key-value sort using all threads in the block

// collective scan-based allocate across block

device int reduce(thread_group g, int *x, int val) {

global void particleSim(Particle *p, int N) { 1

global void particleSim(Particle *p, int N) {