Академический Документы
Профессиональный Документы
Культура Документы
Tesla V100
New GPU Architecture cuBLAS for Deep Learning
Tensor Cores NPP for Image Processing
NVLink cuFFT for Signal Processing
Independent Thread Scheduling
2
INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core
HBM2
120 Programmable
Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
3
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware. 10 Megawatts
4
FASTER LIBRARIES
5
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
DEEP LEARNING
Utilize Volta Tensor Cores GEMM optimizations for RNNs
(cuBLAS)
Volta optimized GEMMs (cuBLAS)
Faster image processing (NPP)
Out-of-box performance on Volta Scientific Computing
Multi-GPU dense & sparse solvers, dense New install package for CUDA Libraries
eigenvalue & SVD (cuSOLVER) (library-only meta package)
Breadth first search, clustering, triangle Modular NPP with small footprint,
counting, extraction & contraction support for image batching
(nvGRAPH)
6
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute)
2 10
P100 (CUDA 8) P100 (CUDA 8)
1.8 9
V100 (CUDA 9)
1.6
V100 Tensor Cores (CUDA 9)
8
Relative Performance
Relative Performance
1.4 7
1.8x 9.3x
1.2 6
1 5
0.8 4
0.6 3
0.4 2
0.2 1
0 0
512 1024 2048 4096 512 1024 2048 4096
Matrix Size (M=N=K) Matrix Size (M=N=K)
7
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
Learn More
8
COOPERATIVE GROUPS
9
COOPERATIVE GROUPS
Flexible and Scalable Thread Synchronization and Communication
10
* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs
SYNCHRONIZE AT ANY SCALE
Three Key Capabilities
WHOLE-GRID MULTI-GPU
FLEXIBLE GROUPS
SYNCHRONIZATION SYNCHRONIZATION
Define and
Synchronize Multiple
Synchronize Arbitrary
Thread Blocks
Groups of Threads
partition
sync sync
sync sync
11
COOPERATIVE GROUPS BASICS
Flexible, Explicit Synchronization
Thread groups are explicit objects in your program Thread Block Group
g = this_thread_block(); g = tiled_partition<32>(this_thread_block());
reduce(g, ptr, myVal); reduce(g, ptr, myVal);
Launch with
Multi-Block Sync
cudaLaunchCooperativeKernel()
14
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
4 5 6
7
15
EXAMPLE: PARTICLE SIMULATION
Without Cooperative Groups
0 1 2 3
5 6 7
Note change in how threads map to particles in acceleration data structure
17
WHOLE-GRID COOPERATION
Particle Simulation Update in a Single Kernel
grid_group g = this_grid();
multi_grid_group g = this_multi_grid();
20
ROBUST AND EXPLICIT WARP PROGRAMMING
Adapt Legacy Code for New Execution Model
21
Learn More
Cooperative Groups
Session S7622
22
DEVELOPER TOOLS
23
UNIFIED MEMORY PROFILING
Correlate CPU Page Faults with Source
Page Fault Correlation
24
NEW UNIFIED MEMORY EVENTS
Visualize Virtual Memory Activity
25
Learn More
26
THE BEYOND SECTION
27
FUTURE: UNIFIED SYSTEM ALLOCATOR
Allocate unified memory using standard malloc
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
CUDA C++
Volta Optimized
Warp-Level Matrix Operations
Frameworks and Libraries
29
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
D= A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
D = AB + C 30
TENSOR CORE COORDINATION
Full Warp 16x16 Matrix Math
warp
warp
31
CUDA TENSOR CORE PROGRAMMING
16x16x16 Warp Matrix Multiply and Accumulate (WMMA)
D=
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 32
CUDA TENSOR CORE PROGRAMMING
New WMMA datatypes
33
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations
wmma::load_matrix_sync(Amat, a, stride);
warp
34
CUDA TENSOR CORE PROGRAMMING
New WMMA Matrix Multiply and Accumulate Operation
D=
35
CUDA TENSOR CORE PROGRAMMING
New WMMA load and store operations
warp
Result
36
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Use with care: random groups can lead to SIMT execution inefficiency
37
FUTURE COOPERATIVE GROUPS
Library of Collective Algorithms
http://developer.nvidia.com/cuda-toolkit
http://parallelforall.com
mharris@nvidia.com
@harrism
BACKUP
40
THREAD GROUPS INTERFACE
A thread can access the size of its group and its index (rank) in the group:
thread_group group = this_thread_block(); Intrinsic group
42
FUTURE COOPERATIVE GROUPS
Volta Enables Greater Flexibility
Partition using an arbitrary label:
Use with care: random groups can lead to SIMT execution inefficiency
43
Need updated results
Over 2500 accelerated image, video & NPP Image Processing: 20-100x vs. CPU
computer vision primitives
Morphological Operations
JPEG
Filters
Small memory footprint
Color Processing
Image batching support
0 20 40 60 80 100 120
NPP/Tesla V100 Speedup vs. IPP / Xeon E5-2690 (Broadwell)
44
EXAMPLE: PARTICLE SIMULATION
1 2 3 0 1 2 3 4
0
4 5 6
7 7
5 6
45