CUDA Occupancy Calculator

CUDA GPU Occupancy Calculator
Just follow steps 1, 2, and 3 below! (or click here for help) 1.) Select Compute Capability (click): 2.) Enter your resource usage: Threads Per Block Registers Per Thread Shared Memory Per Block (bytes) (Don't edit anything below this line)
Multiprocessor Warp Occupancy
The other data 2.0

(Help)
Va
256 8 1024
48
(Help)
40
3.) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor Active Warps per Multiprocessor Active Thread Blocks per Multiprocessor Occupancy of each Multiprocessor
32
1536 48 6 100%
(Help)
24
16
Physical Limits for GPU Compute Capability: Threads per Warp Warps per Multiprocessor Threads per Multiprocessor Thread Blocks per Multiprocessor Total # of 32-bit registers per Multiprocessor Register allocation unit size Register allocation granularity Shared Memory per Multiprocessor (bytes) Shared Memory Allocation unit size Warp allocation granularity (for register allocation) Allocation Per Thread Block Warps Registers Shared Memory These data are used in computing the occupancy data in blue Maximum Thread Blocks Per Multiprocessor Limited by Max Warps / Blocks per Multiprocessor Limited by Registers per Multiprocessor Limited by Shared Memory per Multiprocessor Thread Block Limit Per Multiprocessor highlighted CUDA Occupancy Calculator Version: Copyright and License
2.0 32 48 1536 8 32768 64 warp 49152 128

8
0 16 80 144
Varying
8 2048 1024
48 40 32
Blocks 6 16 48 RED
24 16 8
2.1
2048
4096
6144
8192
10240
12288
14336
16384
M Wa
8 0
2048
4096
6144
8192
10240
12288
14336
16384
Threads 256 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448 464 480 496 512
Warps/Multiprocessor 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
Click Here for detailed instructions on how to use this occupancy calculator.
For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda
Your chosen resource usage is indicated by the red triangle on the graphs. The other data points represent the range of possible block sizes, register counts, and shared memory allocati
Varying Block Size

48 1
48 1
Varying Register
40
40
32
32
24
24
16
16
0 16 80 144 208 272 336 400 464 Threads Per Block
16
24
32
40
Registers Per
48
56
64
Varying Shared Memory Usage

48 40 32 1
24 16 8 0
2048
4096
6144
8192
10240
12288
14336
16384
Shared Memory Per Block
18432
20480
22528
24576
26624
28672
30720
32768
34816
36864
38912
40960
43008
45056
47104
49152
M Wa
8 0
2048
Registers Warps/Multiprocessor 8 48 1 48 2 48 3 48 4 48 5 48 6 48 7 48 8 48 9 48 10 48 11 48 12 48 13 48 14 48 15 48 16 48 17 48 18 48 19 48 20 48 21 48 22 48 23 48 24 48 25 48 26 48 27 48 28 48 29 48 30 48
4096
6144
8192
10240
12288
14336
16384
Shared Memory Per Block
18432
20480
22528
24576
26624
28672
30720
Shared Memory Warps/Multiprocessor 1024 48 0 #DIV/0! 512 48 1024 48 1536 48 2048 48 2560 48 3072 48 3584 48 4096 48 4608 48 5120 48 5632 48 6144 48 6656 48 7168 48 7680 48 8192 48 8704 48 9216 48 9728 48 10240 48 10752 48 11264 48 11776 48 12288 48 12800 48 13312 48 13824 48 14336 48 14848 48
32768
34816
36864
38912
40960
43008
45056
47104
49152
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
15360 15872 16384 16896 17408 17920 18432 18944 19456 19968 20480 20992 21504 22016 22528 23040 23552 24064 24576 25088 25600 26112 26624 27136 27648 28160 28672 29184 29696 30208 30720 31232 31744 32256 32768 33280 33792 34304 34816 35328 35840 36352 36864 37376 37888 38400 38912 39424 39936 40448
48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
40960 41472 41984 42496 43008 43520 44032 44544 45056 45568 46080 46592 47104 47616 48128 48640 49152
48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
on the graphs. s, and shared memory allocation.
Varying Register Count

1
16
24
32
40
Registers Per Thread
48
56
64
72
80
88
96
104
112
120
128
Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. Th multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the G Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a progr tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail The size of N on GPUs with compute capability 1.0-1.1 is 8192 32-bit registers per multiprocessor. On GPUs with compute capability 1.2-1.3, N = 16384. On GPUs with compute capability 2.0, N = 32768. Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements.
Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps. 1.) First select your device's compute capability in the green box. Click to go there 2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread. Click to go there
3.) Examine the blue box and the graph to the right. This will tell you the occupancy, as well as the number of active threads, warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph.
Click to go there You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GP occupancy. Determining Registers Per Thread and Shared Memory Per Thread Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option --ptxas-options=-v to nvcc. This will output information about register, local memory, shared memory, and constant memory us for each kernel in the .cu file. However, if your kernel declares any external shared memory that is allocated dynamically, you need to add the (statically allocated) shared memory reported by ptxas to the amount you dynamically allocate at run time to g the correct shared memory usage. An example of the verbose ptxas output is as follows: ptxas info ptxas info : Compiling entry function '_Z8my_kernelPf' for 'sm_10' : Used 5 registers, 8+16 bytes smem
Let's say "my_kernel" contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our tot shared memory usage per block is 2048+8+16 = 2072 bytes. We enter this into the box labeled "shared memory per block (bytes)" in this occupancy calculator, and we also enter the number of registers used by my_kernel, 5, in the box labeled regis per thread. We then enter our thread block size and the calculator will display the occupancy. Notes about Occupancy
Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupan will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to loca memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affe the *wall clock time* of the kernel execution. For bandwidth-bound applications, on the other hand, increasing occupancy can better hide the latency of memory accesses, and therefore improve performance. For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda
Compute Capability Threads / Warp Warps / Multiprocessor Threads / Multiprocessor Thread Blocks / Multiprocessor Shared Memory / Multiprocessor (bytes) Register File Size Register Allocation Unit Size Allocation Granularity Shared Memory Allocation Unit Size Warp allocation granularity (for registers)
1.0 32 24 768 8 16384 8192 256 block block 512 2
1.1 32 24 768 8 16384 8192 256 block 512 2
1.2 32 32 1024 8 16384 16384 512 512 2
1.3 32 32 1024 8 16384 16384 512 block 512 2 warp
2.0 32 48 1536 8 49152 32768 64 128
Copyright 1993-2010 NVIDIA Corporation. All rights reserved. NOTICE TO USER: This spreadsheet and data is subject to NVIDIA ownership rights under U.S. and international Copyright laws. Users and possessors of this spreadsheet and data are hereby granted a nonexclusive, royalty-free license to use it in individual and commercial software.
NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SPREADSHEET AND DATA FOR ANY PURP IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SPREADSHEET AND DATA, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL NVIDI LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEV RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR O TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SPREADSHE AND DATA.
U.S. Government End Users. This spreadsheet and data are a "commercial item" as that term is defined at 48 C.F.R. 2.101 1995), consisting of "commercial computer software" and "commercial computer software documentation" as such terms are in 48 C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government only as a commercial end item. Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the spread and data with only those rights set forth herein. Any use of this spreadsheet and data in individual and commercial software m include, in the user documentation and internal comments to the code, the above Disclaimer and U.S. Government End Users Notice.
For more information on NVIDIA CUDA, visit http://www.nvidia.com/cuda

CUDA Occupancy Calculator

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CUDA Occupancy Calculator

Загружено:

Авторское право:

Доступные форматы

CUDA GPU Occupancy Calculator

The other data 2.0

2.0 32 48 1536 8 32768 64 warp 49152 128

Varying Block Size

Multiprocessor Warp Occupancy

Multiprocessor Warp Occupancy

0 16 80 144 208 272 336 400 464 Threads Per Block

Varying Shared Memory Usage

Multiprocessor Warp Occupancy

Shared Memory Per Block

Shared Memory Per Block

on the graphs. s, and shared memory allocation.

Varying Register Count

Registers Per Thread

1.0 32 24 768 8 16384 8192 256 block block 512 2

1.1 32 24 768 8 16384 8192 256 block 512 2

1.2 32 32 1024 8 16384 16384 512 512 2

1.3 32 32 1024 8 16384 16384 512 block 512 2 warp

2.0 32 48 1536 8 49152 32768 64 128

For more information on NVIDIA CUDA, visit http://www.nvidia.com/cuda

Вам также может понравиться