Академический Документы
Профессиональный Документы
Культура Документы
A Dissertation Presented
by
Nicholas John Moore
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Computer Engineering
Northeastern University
Boston, Massachusetts
June 2012
Abstract
Graphics processing units (GPUs) offer significant speedups over CPUs for certain
classes of applications. However, maximizing GPU performance can be a difficult
task due to the relatively high programming complexity as well as frequent hardware
changes. Important performance optimizations are applied by the GPU compiler
ahead of time and require fixed parameter values at compile time. As a result, many
GPU codes offer minimum levels of adaptability to variations among problem instances and hardware configurations. These factors limit code reuse and the applicability of GPU computing to a wider variety of problems. This dissertation introduces
GPGPU kernel specialization, a technique that can be used to describe highly adaptable kernels that work across different generations of GPUs with high performance.
With kernel specialization, customized GPU kernels incorporating both problemand implementation-specific parameters are compiled for each problem and hardware instance combination. This dissertation explores the implementation and parameterization of three real world applications targeting two generations of NVIDIA
CUDA-enabled GPUs and utilizing kernel specialization: large template matching,
particle image velocimetry, and cone-beam image reconstruction via backprojection.
iii
Starting with high performance adaptable GPU kernels that compare favorably to
multi-threaded and FPGA-based reference implementations, kernel specialization is
shown to maintain adaptability while providing performance improvements in terms
of speedups and reduction in per-thread register usage. The proposed technique offers productivity benefits, the ability to adjust parameters that otherwise must be
static, and a means to increase the complexity and parameterizability of GPGPU
implementations beyond what would otherwise be feasible on current GPUs.
iv
Acknowledgements
I would like to thank Professor Leeser for her guidance and patience over the past several years. MathWorks generously supported my studies. Together with supervision
by James Lebak, this assistance significantly enhanced the quality of my education.
Completing this work would not have been possible without the support of Catherine,
my family, and my friends.
Contents
1 Introduction
2 Background
2.1
NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
In-Block Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
11
2.4
13
2.5
17
2.6
OpenCV Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3 Related Work
22
3.1
23
3.2
Autotuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3
29
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4 Kernel Specialization
33
vi
4.1
Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.2
OpenCV Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3
Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4
43
4.4.1
GPU-PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.4.2
50
5 Applications
5.1
53
5.1.1
54
5.1.2
57
5.1.3
CUDA Implementation . . . . . . . . . . . . . . . . . . . . . .
58
5.1.3.1
Numerator Stage . . . . . . . . . . . . . . . . . . . .
59
5.1.3.2
62
5.1.3.3
Other Stages . . . . . . . . . . . . . . . . . . . . . .
64
5.1.3.4
Runtime Operation . . . . . . . . . . . . . . . . . . .
65
CPU Implementations . . . . . . . . . . . . . . . . . . . . . .
66
67
5.2.1
CUDA Implementation . . . . . . . . . . . . . . . . . . . . . .
73
5.2.2
Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . .
78
80
5.3.1
82
5.1.4
5.2
5.3
53
Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . .
vii
5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
6.3
82
83
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.1.1
85
6.1.2
85
6.1.2.1
Template Matching . . . . . . . . . . . . . . . . . . .
86
6.1.2.2
PIV . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
6.1.2.3
93
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.2.1
Comparative Performance . . . . . . . . . . . . . . . . . . . .
94
6.2.2
97
6.2.2.1
Template Matching . . . . . . . . . . . . . . . . . . .
97
6.2.2.2
PIV . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.2.2.3
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
117
7.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2
7.2.2
GPU-PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii
7.2.3
A Glossary
122
123
125
127
129
136
140
ix
List of Figures
2.1
A graphical representation of the NVIDIA CUDA-capable GPU architecture realization. This figure appears in [36]. . . . . . . . . . . . . .
2.2
10
4.1
48
5.1
55
5.2
5.3
57
is
The actual functionality implemented by the numerator stage. B
the matrix average of B, and AC is the template with its average value
subtracted.
5.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
60
5.5
Graphical data layout of the shift area for a single tile. Regardless of
dimensions, all tiles are applied to the same shift area. A single block
of the summation kernel may only combine a subset, shown in gray,
of the template locations. . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Each thread accumulates the tile contributions for a singe shift offset,
an example of which is shown in gray. . . . . . . . . . . . . . . . . . .
5.7
61
62
67
5.8
68
5.9
70
5.10 The per-mask offset sum of squared differences similarity score, defined
in terms of the original PIV problem specification as shown in Figure 5.9. 73
5.11 Example of a set of threads striped across a masks area. . . . . . . .
75
5.12 A depiction of the warp specialization used in the PIV kernel to remove
the reduction as a bottleneck. . . . . . . . . . . . . . . . . . . . . . .
77
80
6.1
Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C1060. The location of peak performance
is marked with a white square. . . . . . . . . . . . . . . . . . . . . . . 115
xi
6.2
Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C2070. The location of peak performance
is marked with a white square. . . . . . . . . . . . . . . . . . . . . . . 116
xii
List of Tables
2.1
2.2
11
4.1
46
4.2
47
4.3
47
4.4
47
5.1
Per patient, the number of image frames, template number and size,
vertical/horizontal shift within ROI, and number of corr2() calls. . .
5.2
6.1
56
Template tiling examples for the template size associated with Patient
4 (156 116 pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
87
xiii
6.2
6.3
. . . . . . . . . . . . . . . . . . . . . . .
89
The PIV problem set parameters, in terms of mask and offset counts,
used for comparing performance of the FPGA and GPU implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
PIV problem set parameters used to test the impact of mask size on
the performance of the GPU implementation. . . . . . . . . . . . . .
6.5
90
PIV problem set parameters used to test the impact of the number of
search offsets on the performance of the GPU implementation. . . . .
6.6
89
91
92
6.7
93
6.8
93
6.9
94
95
96
6.12 Cone beam backprojection results comparing the OpenMP CPU implementation with four threads to the best performing configuration
on both GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.13 Template matching partial sums: performance and optimal configurations characteristics for the tiled summation kernel. RE stands for
runtime evaluated, and SK stands for specialized kernel. . . . . . . .
99
6.14 PIV GPU performance comparisons for several kernel variants across
the FPGA benchmark set. . . . . . . . . . . . . . . . . . . . . . . . . 101
6.15 PIV GPU performance data for the FPGA benchmark set, including
optimal register blocking and thread counts. . . . . . . . . . . . . . . 102
6.16 PIV GPU performance data for the varying mask size benchmark set,
including optimal register blocking and thread counts. . . . . . . . . . 104
6.17 PIV GPU performance data for the varying search benchmark set,
including optimal register blocking and thread counts. . . . . . . . . . 107
6.18 PIV GPU performance data for the varying overlap benchmark set,
including optimal register blocking and thread counts. . . . . . . . . . 108
6.19 Performance comparisons for the backprojection kernels. . . . . . . . 109
6.20 Occupancy and execution data for the C1060 on the V2 data set. . . 110
6.21 Percentage of the peak performance for the template matching application with various fixed main tile sizes and thread counts. . . . . . . 113
xv
6.22 Percentage of the peak performance for the PIV application with various fixed data register counts and thread counts. . . . . . . . . . . . 114
xvi
Listings
4.1
4.2
34
5.1
34
55
B.1 A CUDA C GPU kernel designed to demonstrate flexible kernel specialization. The kernel can be compiled both with and without specialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
C.1 The nvcc command line used to generate the PTX in C.2. The mathTest2.cu
file contained the source of Listing B.1. . . . . . . . . . . . . . . . . . 125
C.2 The run-time adaptable PTX produced by calling nvcc on the CUDA
C source in Appendix B without any fixed parameters. . . . . . . . . 125
xvii
D.1 The nvcc command line used to generate the PTX in D.2. The mathTest2.cu
file contained the source of Listing B.1 . . . . . . . . . . . . . . . . . 127
D.2 Specialized PTX produced by calling nvcc on the CUDA C source in
Appendix B and specifying all parameters on the command line. . . . 127
E.1 Unmodified OpenCV CUDA example . . . . . . . . . . . . . . . . . . 129
F.1 Modified OpenCV CUDA example: this portion is specialized . . . . 136
F.2 Modified OpenCV CUDA example: this portion is compiled into the
host program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
G.1 Initial application refresh, Part 1 . . . . . . . . . . . . . . . . . . . . 140
G.2 Initial application refresh, Part 2 . . . . . . . . . . . . . . . . . . . . 142
G.3 Pipeline iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
G.4 Per-operation timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
G.5 High-level timing, Example A . . . . . . . . . . . . . . . . . . . . . . 147
G.6 High-level timing, Example B . . . . . . . . . . . . . . . . . . . . . . 147
G.7 High-level timing, Example C . . . . . . . . . . . . . . . . . . . . . . 148
xviii
Chapter 1
Introduction
General purpose computing on graphics processing units (GPGPU) has seen wide
adoption in the past few years. The promise of significant performance gains over
traditional CPUs has resulted in an ever increasing variety of problem types to be
accelerated on GPUs.
However, developing GPGPU applications is often difficult, and peak performance
is obtained by carefully accommodating particular GPU hardware characteristics.
This leads to programming practices that limit the adaptability of kernel implementations to specific problems. Compounding this issue is the rapid evolution of
GPU hardware over time. Constructing a GPGPU application around the specific
properties of one GPU model can limit the performance of an implementation on
other hardware. Together, these practices hinder code reuse and the applicability of
GPGPU to a wider range of problem instances.
This dissertation introduces kernel specialization of GPGPU kernels, specifically
CUDA-enabled GPUs from NVIDIA. Kernel specialization refers to the generation of
CHAPTER 1. INTRODUCTION
GPU binaries that are customized for the current problem and/or hardware parameters. Specifically, the approach uses developer-friendly CUDA C language kernels
and run-time calls to the CUDA compiler. Kernel specialization allows GPU kernel
implementations to achieve greater levels of adaptability to variations among both
problems and hardware while preserving the performance associated with hard-coded
approaches. It may also enable more complicated kernels to fit within the constraints
imposed by the CUDA environment.
Many important GPU optimizations, such as loop unrolling, strength reduction,
and register blocking, require static values at compile time. With GPGPU kernel
compilation typically done ahead of time, it is common practice to hard code both
problem and hardware parameters. This limits the ability of many GPGPU kernels
to adapt to a wide range of problems and GPU targets. When adaptability is required
and parameter values are not specified, performance often suffers since many crucial
performance optimizations cannot be applied by the compiler. Generation of specialized GPU binaries once parameter values are known results in higher performance
code based on the now-fixed parameters. Parameters selected for specialization can
only be changed through recompilation.
The fixed parameters used to generate customized kernels are derived in part from
the particular problem instance. However, many CUDA kernel implementations also
provide a set of implementation parameters that can be adjusted independent of
the problem parameters. These often provide an opportunity to tune kernel imple-
CHAPTER 1. INTRODUCTION
CHAPTER 1. INTRODUCTION
kernels are often required to augment libraries problem space coverage. Auto-tuning
approaches are important for determining optimal parameterizations for kernels developed with kernel specialization in mind, but may not explore the large variety of
fundamental implementation approaches that a human designer or domain-specific
tool may employ.
Contributions of this research include a methodology to write kernels once and
recompile for different parameters, along with an application framework that automates the process. Using the methodology, this research:
Demonstrates improved adaptability of GPU kernels to different problems and
architectures while offering good performance using kernel specialization
Employs register blocking and loop unrolling in a run-time adjustable fashion
with kernel specialization
Demonstrates reduced register usage with kernel specialization
Chapter 2 provides background information, and Chapter 3 discusses related research projects. Chapter 4 presents kernel specialization in more detail. The details
of the three application case studies to which kernel specialization has been applied
are described in Chapter 5. The testing scenarios and performance results for each
application are covered in Chapter 6. Chapter 7 concludes the dissertation with a
summary and plans for future work. A glossary is provided in Appendix A.
Chapter 2
Background
This section provides an overview of the challenges this research addresses. First,
Section 2.1 provides a brief description of the CUDA architectural abstraction and
programming environment provided by NVIDIA. While the principles discussed in
this dissertation apply to GPGPU in general, all experiments to date have targeted
NVIDIA CUDA-enabled GPUs. Some aspects of typical CUDA usage make writing
general code that can adapt to arbitrary incoming problems challenging, and these
issues are examined in Section 2.4.
2.1
NVIDIA CUDA
CUDA provides a software environment for writing GPGPU kernels and specifies an
abstract hardware target with large amounts of parallelism. The abstractions provided help shield the application developer from a number of low-level hardware considerations and allow NVIDIA to change the underlying implementation of the hardware abstraction, increasing peak performance over time. However, this paradigm
still requires application programmers to contend with a number of issues not of-
CHAPTER 2. BACKGROUND
ten considered when developing software for general purpose processors, including
a structured and restricted parallelism model, several distinct memory spaces, and
transferring data to and from a device that is external to the host processor. Additionally, developers have not been completely isolated from changes in the CUDA
hardware realization, which NVIDIA has been updating rapidly over time. New
GPUs have changed important hardware parameters and introduced completely new
capabilities.
In CUDA, parallelism is presented in the form of thousands of threads. To organize these threads, CUDA provides a two-level thread hierarchy composed of thread
blocks and the grid. Thread blocks can have one, two, or three logical dimensions.
The grid specifies a one, two, or, on newer GPUs, three dimensional space consisting
of identically shaped thread blocks. Together, the grid and blocks can be used to
define a large thread space defined independently for each kernel invocation but fixed
for the duration of a kernels execution.
Inter-thread communication is possible at thread block scope via a manually managed block-local memory called shared memory, which is available for reading and
writing by all threads. Block-level thread barriers allow for synchronized shared
memory access by all the threads within a block. However, large scale inter-block
communication is generally not efficient. Combined, the thread hierarchy and communication restrictions allow for the scalability of CUDA applications by constraining
kernels to relatively small and independent chunks of computation. CUDA applica-
CHAPTER 2. BACKGROUND
tions scale to larger problems by increasing the number of blocks, and newer CUDA
hardware scales performance by providing more parallel block execution resources.
This requires, however, that the given problem is amenable to the problem partitioning required by CUDA.
The amount of memory and execution resources used at the block level can affect performance and must be considered by kernel developers. NVIDIAs hardware
realization of the abstract CUDA architecture uses streaming multiprocessors (SMs)
to execute blocks, as shown in Figure 2.1. Full thread blocks are assigned to a SM,
and within the SM a thread block is mapped to groups of 32 consecutive threads,
collectively referred to as a warp. Warps are the unit of execution, with the threads
within a warp executed in a single-instruction, multiple-thread (SIMT) manner. Fast
context switching between warps and the ability to execute multiple thread blocks
within the same SM generates a significant amount of thread-level parallelism. However, kernel configurations can have a significant impact on the ability of the GPU
to execute more warps simultaneously. SMs contain a limited number of shared
resources, including registers, shared memory, and warp and block state tracking
hardware. The number of blocks simultaneously executed by a SM is limited by the
block-level resource usage required by a kernel.
The above restrictions require CUDA developers to balance a number of tradeoffs when developing GPGPU applications. However, developers concerned with
targeting a wide variety of CUDA-capable GPUs must also account for changes in
CHAPTER 2. BACKGROUND
Figure 2.1: A graphical representation of the NVIDIA CUDA-capable GPU architecture realization. This figure appears in [36].
Compute Capability 1.0 & 1.1
Date of Toolkit Support [13] June 2007
512
Max. Threads
Shared Mem.
16 KB
16
Shared Mem. Banks
8K
32-bit Registers
Max. Warps/SM
24
CHAPTER 2. BACKGROUND
they are not covered here, the CUDA environment provides a number of different
memory types, each with its own set of size and performance characteristics. Accommodating some memory access requirements (e.g. coalescing for global memory)
are fundamental to achieving good performance on CUDA-enabled devices. Many
other issues, however, are more nuanced, and while they occur less frequently, they
can also significantly degrade performance of the memory hierarchy.
2.2
In-Block Reductions
There are many fundamental parallel programming patterns. Of interest here are
parallel reductions. Reductions have been an important part of GPGPU programming since its inception, and an examination of techniques for implementing highperformance reductions with CUDA has been included with the CUDA SDK for
many releases [24]. In particular this work relies on reductions that include an associative operation, such as addition. Reductions are often applied to large amounts
of data and the usual concerns about generating parallelism across a set of thread
blocks still apply. Generally, each block will perform a moderately-sized reduction
on a subset of the data, usually producing a single output value per block. Multiple
rounds of the reduction kernel call are used, with each successive call requiring fewer
and fewer thread blocks to span the data set.
The reductions discussed in this dissertation are of moderate size, so only the
in-block behavior is relevant. Much like the multiple reduction kernel calls, multiple
CHAPTER 2. BACKGROUND
1
10
2
+
1
+
2
+
4
+
1
+
1
Figure 2.2: Example of a parallel reduction tree starting with eight elements.
reduction rounds within a block are used, as shown in Figure 2.2. Less parallelism
available after each round, with the working set, and therefore number of threads
needed, reduced by half (assuming power-of-two initial element counts). This forms
a tree where the number of levels, or rounds, in the tree is determined by the base-2
logarithm of the initial number of elements.
In NVIDIA GPUs, register memory is private and not accessible by other threads.
This requires reductions to take place through shared memory. For each level of the
tree, data is written to shared memory, a thread-synchronization barrier is executed,
and then the remaining participating threads read out the newly compacted data.
However, with CUDA, the number of synchronizations is not the same as the number
of levels in the tree. Due to the thirty-two-thread SIMT width (warp) in NVIDIA
GPUs, the threads within a warp are guaranteed to be in sync. Once the number
of elements reaches sixty-four, a single warp can finish a reduction without synchronization.
CHAPTER 2. BACKGROUND
Compute Capability
1.0 1.1
1.2 1.3
2.x
11
Register File
per SM
32 KB
64 KB
128 KB
Shared Memory
Size per SM
16 KB
16 KB
16/48 KB
Table 2.2: The amount of register and shared memory available within each streaming
multiprocessor for NVIDIA GPUs of various compute capabilities [38].
Regardless, throughout the duration of a block-level reduction, fewer and fewer
threads participate after each level of the tree. This results in a increasing number
of idle threads.
2.3
Since the introduction of CUDA, many kernel implementation techniques have been
suggested. Most have not broken with the original paradigm suggested by NVIDIA,
which emphasizes context switching among many warps. However, some recent
research has proposed new techniques that take advantage of particular traits of
NVIDIA hardware.
First, NVIDIA GPUs provide a memory hierarchy that is inverted relative to
CPUs, with more on-die memory dedicated to the register file than shared memory [54], as shown in Table 2.2. However, since shared memory does not provide
enough throughput to achieve peak performance; sourcing ALU operands from registers is mandatory to maximize computational throughput.
While this encourages increased register usage, other factors can exert downward pressure on register usage. When sufficient instruction-level parallelism is not
CHAPTER 2. BACKGROUND
12
CHAPTER 2. BACKGROUND
13
2.4
Much like traditional CPU code, typical CUDA C development practices compile
code once ahead of time. However, unlike traditional C-like languages on CPUs, the
target for compilation is often not machine executable. CUDA C is first compiled
to PTX (parallel thread execution), an assembly-level intermediate representation
for NVIDIA GPUs [40], and then, at run time, translated to the final instruction
set architecture (ISA) used by the target GPU. The separation of PTX from a fixed
ISA allows NVIDIA to make hardware changes between generations of GPUs while
preserving portability of compiled programs.
CHAPTER 2. BACKGROUND
14
CHAPTER 2. BACKGROUND
15
at compile time. For example, the compiler must know when scalars are powers of
two to strength reduce division or modulus (two relatively expensive operations on
NVIDIA GPUS) to bit-wise operations. Likewise loop counts must be fixed for the
compiler to unroll loops or implement register blocking.
While program loops are fully supported by GPGPU languages like CUDA C, they
may incur significant overhead [49]. Unrolling loops is a key CUDA performance optimization. Rolled loops need to include several instructions for loop setup, iteration,
termination condition checking, and branching, all of which introduce overhead and
reduce the ability of the GPU to take advantage of ILP.
Register blocking can be complicated by the fact that existing NVIDIA GPU
architectures cannot indirectly address registers. Fixed loop counts are required for
the CUDA C compiler to specify the use of extra registers for data and assign them
unique virtual registers in the PTX representation of a program.
A similar issue occurs with constant memory space declarations. These must have
a fixed size at compile time. The constant memory is reserved when a CUDA module
(the CUDA translation unit) is loaded, and the total amount of CUDA memory that
can be allocated across all loaded kernels is limited to 64 KB.
A fundamental issue with fixing values at compile time is limiting the ability of a
kernel to adapt to new problems without recompilation. It is possible to leave CUDA
C kernels without fixed values, forgoing the aforementioned optimizations, but this
may incur additional performance penalties. Registers may have to be dedicated to
CHAPTER 2. BACKGROUND
16
the storage of intermediate values computed from one or more adjustable parameters.
Independent parameters, either inputs or intrinsic values like thread indexes, have to
be loaded from shared memory or special registers into regular registers before they
can be used. Dynamic code without fixed values at compile time also often requires
extra run-time guards against illegal values and memory accesses.
Adaptability can also interact with preferences for hardware-friendly parameter
values. Common fundamental parallel algorithmic components are simplest in terms
of control flow at powers of two. Reductions, for example, are guaranteed to have
an even number of elements at each tree level if the initial size is a power of two.
Otherwise, extra logic to handle tree levels with odd element counts must be incorporated. All told, adaptability can contribute a significant number of non-compute
instructions to a GPU kernel.
Since CUDA C supports many features of C++ templates, they can be used to
help circumvent these restrictions while offering some level of adaptability. Templates
with template arguments for parameter values that control optimizations can be
explicitly instantiated multiple times for different fixed parameter values. A table
lookup of function or method pointers can be used to select the optimal specialization
for a given problem. This technique is useful, especially for handling multiple data
types, but has drawbacks. First, the adaptability is limited to the pre-compiled range,
and second, it can significantly increase compiled binary size. While the impact of
applying this technique to a single kernel may be limited, the increase in code size
CHAPTER 2. BACKGROUND
17
between NVIDIA GPUs of compute capability 1.3 and 2.0. Another example is the
throughput of shared memory relative to the register file decreases between GPUs
of CUDA capability 1.3 and 2.0, putting additional emphasis on effective use of the
register file in newer GPUs. These factors can result in different optimal implementation approaches between architecture generations, making performance portability
difficult.
2.5
The thread block and grid structured parallelism provided by CUDA, block-level resource limitations, the need to utilize important optimizations, and variations among
CUDA-capable GPUs all combine to create a complicated development environment
CHAPTER 2. BACKGROUND
18
that requires developers to balance many non-obvious trade-offs. One mechanism for
balancing some of these trade-offs is implementation parameters.
Implementation parameters are generally scalar values that control how a CUDA
kernel performs its computation. These are usually distinct from problem parameters
and can be independently adjusted. The shape and number of threads in a threadblock, for example, is a built-in CUDA implementation parameter that every kernel
must select. At the most basic level, the number of threads in a block trades off the
availability of block-level resources with the number of total thread blocks. A second
CUDA built-in parameter is warp size. This, however, has not been a major issue as
NVIDIA has so far kept the warp size at thirty-two threads.
In addition to the required CUDA implementation parameters, many kernel implementations add a number of other parameters. An example of this occurs with
tiling, a common kernel implementation technique that breaks a large parallel computation up into smaller chunks that are distributed among thread blocks. The size
of the tiles can control how much shared memory is required per block, potentially
accommodating increased shared memory sizes in newer GPUs. Increasing tile sizes
often assigns a larger amount of work to each block, which can be used to scale the
block-level work load to newer GPUs with more threads and resources available in
each thread block.
Another example involves adjusting the number of registers used for register blocking, which can be used to adjust the amount of work assigned to each thread. As
CHAPTER 2. BACKGROUND
19
discussed above, register blocking can improve ILP but comes at the cost of increased
register usage.
In addition to performance, changing implementation parameters can also impact
the complexity of CUDA kernel code. Shared memory, for example, can be allocated
either statically at compile time or dynamically at kernel launch time. Dynamically
allocated shared memory, however, is more complicated and error prone to use.
It is common practice, and required for certain parameters, such as register blocking levels, to fix implementation parameters for all launches of the kernel. This allows
the compiler to apply optimizations, but a single value may not be optimal across a
range of problem parameters. CUDA kernel performance is notoriously sensitive to
small variations in both problem and implementation parameters. Several research
projects have focused on investigating techniques for the optimal selection of implementation parameters across problem parameter ranges. Some of these efforts are
described in Chapter 3. There is, however, a fundamental tension between adjusting
implementation parameters for different problem and hardware parameters and the
relative importance of compile-time optimizations for GPGPU performance. This
issue is the focus of the work described in this dissertation.
2.6
OpenCV Example
Taken together, the issues discussed in this chapter produce a complex environment
that negatively affects GPGPU kernel adaptability and code reuse. Ultimately, this
CHAPTER 2. BACKGROUND
20
reduces the application and adoption of GPGPU. As a real world example of these
issues, Appendix E contains a listing from the open-source OpenCV computer vision
librarys CUDA module that implements row filtering, a basic image processing algorithm [42, 41]. The kernel is indicative of the effort required to achieve maximum
performance with CUDA while maintaining adaptability.
A number of the kernel development issues discussed in the last section appear
in the filter kernel. Lines 67 through 77 encode fixed block and grid dimensions
and other parameters in the kernel directly, which are used to control loop unrolling
counts. The preprocessor conditional selects values based on compute capability.
The macro definition on line 71 declares constant memory for storing the filter that
will be applied to the input. The constant size creates an arbitrary ceiling on the size
of filters that can be applied. This may make sense for the application it is unlikely
that a user will have a filter larger than thirty-two pixels but is also required for
meeting the CUDA restriction that the constant memory size is known at compile
time.
Many of the compile-time optimizations discussed above are present in the OpenCV
kernel. The kernel utilizes explicit instantiation of kernel variants for every supported
parameter combination, contained in the array of function pointers starting on line
164. The declaration explicitly instantiates template specializations for filter sizes
from one to thirty-two pixels, which are necessary for loop unrolling, and for each
addressing mode.
CHAPTER 2. BACKGROUND
21
The explicit specializations starting on line 348 are used to include multiple versions of the filter code based on the needed data types. Instead of relying on a
lookup table, C++ overloading is used. Versions of the lookup table are compiled for
each data type pair. All told, 800 variants of the kernel are generated and compiled
into the kernel binary. It should be noted that for each kernel version a dedicated
CPU-invoking function (linearRowFilter caller()) is also compiled, increasing
the penalty for including multiple kernel variants in a binary.
Chapter 3
Related Work
This chapter discusses a number of other projects and research that are related to the
work discussed in this dissertation. The related work can be roughly categorized as:
run-time code generation generation, autotuning, and domain-specific tools. Many
of the domain-specific tools also include an autotuning component.
In addition to these main categories, there have also been examples in the literature of kernel customization for a very specific application. Stone et al. [52] have
demonstrated the potential for improving the performance of GPU kernels through
the use of custom compilation for a specific problem instance, but did not integrate
runtime compilation of GPU kernels into their molecular orbital visualization software. The group manually generated variants of their kernels for specific problem
instances so that loops are unrolled. They report a 40% performance improvement
for the problem-specific kernels.
Linford et al. [31] used CUDA for a GPGPU implementation of large scale atmospheric chemical kinetics simulations. The basis of their GPU implementation
23
3.1
In this dissertation, kernels that are customized for a specific problem and target
GPU at run time are explored. Another approach for executing customized GPU is
generation of GPU code on the fly at run time.
The NVIDIA OptiX ray tracing engine [44] uses run-time manipulation and com-
24
pilation of PTX kernels, but requires them to be compiled offline using the CUDA
nvcc compiler. While operating only on PTX, the OptiX PTX to PTX compilation
provides a number of domain specific optimizations at runtime by analyzing both
the PTX source and the current problem. These include memory selection, inlining
object pointers, register reduction, and efforts to reduce the impact of divergent code.
Garg et al. [21] have created a framework that accepts parallelism and type annotated Python code and automatically generates GPU code targeting the AMD
Compute Abstraction Layer (CAL) [1], a low-level intermediate language and API
for AMD GPUs. An ahead-of-time compilation step converts Python functions to an
intermediate state and a runtime just-in-time (JIT) compiler generates both GPU
kernel and host control code using program information that only becomes known
at runtime. The JIT compiler can perform loop unrolling and fusion and optimizes
some memory access patterns.
Dotzler et al. [18] present a framework (JCudaMP) that takes Java code with
OpenMP parallelization pragmas and automatically generates code to run on NVIDIA
CUDA capable GPUs. A custom Java class loader converts a subset of Java/OpenMP
parallel code sections with additional JCudaMP annotations and dynamically generates C for CUDA source. JCudaMP relies on the CUDA nvcc compiler and invokes it at runtime to compile and link C for CUDA to a shared object that is
loaded by the Java virtual machine. The generated CUDA code will check for and
indicate addressing violations for exception handling and uses Javas array-of-arrays
25
multi-dimensional array organization. While tiling of problems larger than the target
GPUs global memory is supported, important aspects of CUDA performance, like
shared memory, are not. Overhead information is provided for CUDA compilation
but not for CUDA code generation.
Also starting with Java, Leung et al. [29] have modified the JikesRVM Java virtual
machine to connect to the RapidMind framework for the purpose of running Java code
on NVIDIA GPUs. After parallelization, their modifications generate RapidMind
IR to feed directly into the RapidMind back-end, which generates target specific
code. The RapidMind template libraries and front-end stage is not used. Loops are
automatically identified and parallelized, and a cost-benefit analysis is used to decide
which loops should be executed on the GPU.
Mainland and Morrisett [33] create an abstract domain-specific language, Nikola,
and embed it inside of Haskell. While significantly raising the abstraction - Nikola
is a first-order language represented in Haskell using higher-order abstract syntax and allowing the use of NVIDIA GPUs from within a pure Haskell environment, the
approach seems limited in its ability to effectively utilize many CUDA resources and
target all algorithms. Nikola automatically generates and compiles CUDA kernel
code and manages data movement for the user. It appears that CUDA functions are
compiled once at program compile time or at runtime at each invocation.
Ocelot, created by Diamos et al. [17], dynamically translates NVIDIA PTX code
to Low Level Virtual Machine (LLVM) IR. From there, a just-in-time compiler can
26
generate code for either a multi-core CPU system or NVIDIA GPUs at runtime. The
focus of the project is on restructuring CUDA-oriented threads and memory use to
match multi-core CPU platforms.
Bergen et al. [7] discuss their experiences using OpenCL in an HPC environment
creating a compressible gas dynamics simulator that runs on GPUs from multiple
vendors. They cover several middle-level abstractions they developed to increase
programmer productivity. Of particular interest, the authors use C-based macros
to compile in parameter constants to reduce memory usage and specialize kernels.
Additionally, one of two variations of the kernel is selected at run time based on
the hardware target. One variation consists of separate stages that the increases the
memory accesses to computation ratio but use fewer registers. The second variation
fuses the two stages but uses much register resources.
PyCUDA [27] explicitly attempts to address many of the same issues discussed
in this dissertation, including parameterization and GPGPU kernel flexibility. PyCUDA includes a number of Python object oriented abstractions around CUDA API
entities and provides many of the same features as the GPU Prototyping Framework, discussed in the next chapter, such as GPU memory abstractions, precision
timing, and caching of compiled GPU binaries. PyCUDA also provides higher-level
abstractions such as GPU-based numerical arrays and support a number of elementwise and reduction operations and fast Fourier transforms. In addition to the builtin operations, PyCUDA provides element-wise and reduction code generators for
27
3.2
Autotuning
Autotuning is becoming an increasingly popular approach to improving kernel performance. There are many examples in the literature regarding autotuning specific
classes of problems, but the work highlighted here focuses on more general autotuning tools. It has been employed to both improve performance and help adapt
to different problems. Choi et al. [12] have worked on sparse matrix-vector multiplication on NVIDIA GPUs. Starting with two high-performance hand-optimized
kernels, they use offline autotuning with a performance prediction model to select
28
29
ing data to registers or shared memory. Combined with autotuning to search the
transformation and parameter space, highly optimized kernels can be generated.
Using OpenMP code as a starting point, Lee and Eigenmann [28] document their
OpenMP to CUDA source to source compiler called OpenMPC. Directives and environment variables are used to guide the CUDA code generation and include a large
subset of CUDA features, but translation between OpenMP and CUDA concepts is
automatic as is selecting portions of the application code to convert to GPU kernels.
Parallel regions are selected for GPU acceleration based on the level of inter-thread
data sharing and communication, and data transfers between the host and GPU are
minimized. The authors provide tools for searching the configuration space, including tools to prune unnecessary variables from the search space and generate and test
variants. Results using the tool set on a few OpenMP benchmark applications and
algorithm kernels show that they achieve 88 percent, on average of the performance
of hand-tuned kernels.
3.3
30
31
input representation. Zhang and Mueller [58] have developed a framework for generating highly optimized 3D stencils, from a concise user-provided representation
consisting of a single equation and a number of general stencil parameters. After
code generation, an autotuning facility search for optimal kernel design and configuration and supports targeting one or more GPUs.
Also working with stencils Catanzaro et al. [10] are able to use the JIT compilation
features of Ruby and Python (based on PyCUDA) to transform concise embedded
stencil representations into GPU code. The performance, however, is slower than
hand-coded kernels, partially due to JIT compilation overhead, but some benefit is
seen from specialized kernels with fixed loop bounds. Building on this work is the
Copperhead [9] domain-specific language embedded in Python that can be used to
represent data parallel operations in the form of map, reduce, scan, gather, etc. These
are then compiled into CUDA code at run time.
3.4
Summary
The literature surveyed here relates to this research in a number of different ways.
While able to specialize GPU kernels for specific problems at run time once the
parameters are known, run time code generation tools are limited by the range of
problem domains that they understand. Likewise, by limiting a tool to a specific
problem area, such as many of the domain-specific tools, the automatic generation
of high performance GPU code becomes tractable. However, users are restricted
32
to a language subset that may not be able to represent the problem at hand. The
approach to kernel specialization presented here relies on standard CUDA C kernel
input, which can be designed to target any problem domain. While a much more
manual approach, this can be used to fill in gaps between the capabilities offered by
more restricted tools.
General autotuning tools, while highly effective, can take a long time to search the
parameter space. In addition, these tools may be limited by the variety of supported
code transformations and may never radically restructure implementations the way a
skilled human designer may. The research presented here is complementary to these
types of advanced parameter space mapping and performance estimation tools. By
using highly parameterized CUDA kernels that are specialized quickly at run time,
autotuning tools can be used to characterize the performance of a given implementation so that effective parameters can be selected quickly and used to compile a
specialized kernel.
Chapter 4
Kernel Specialization
Chapter 2 described the current state of CUDA development, with particular emphasis on the trade-offs between performance and adaptability. In this chapter, a
technique for addressing these trade-offs, kernel specialization, is introduced.
Kernel specialization is as a technique to enable greater GPGPU kernel adaptability while mitigating the associated performance overheads. Kernel specialization
involves delaying the generation of GPU binaries until problem and hardware parameters are known. This enables a number of static-value optimizations that are
important for achieving high-performance on GPUs.
From a kernel development perspective, all that is required to use kernel specialization is to write a kernel in terms of undefined constants. These constants are
provided values when the kernel is compiled. The CUDA kernel in Listing 4.1 is
designed to highlight several of the most common types of optimizations that are applied at compilation time. In the example kernel, each thread determines its global
offset in the thread space to use as a base offset to the in and out pointers. Then,
34
using a stride of argA * argB, count values from in are accumulated in acc. The
kernel assumes that the memory regions pointed to by in and out are large enough
for any accesses generated by the other input values.
global
void mathTest ( int in , int out , int argA , int argB , int
loopCount ) {
int a c c = 0 ;
1
2
3
4
5
6
7
8
9
10
11
12
13
Listing 4.1: A CUDA C GPU kernel designed to demonstrate the common ways
kernel specialization can improve the performance of kernels. This kernel is a regular
fully run-time evaluated kernel and does not rely on specialization in any way.
global
void mathTest ( int in , int out , int argA , int argB , int
loopCount ) {
int a c c = 0 ;
1
2
3
4
5
6
7
8
9
10
11
12
The kernel in Listing 4.1 is fully run-time evaluated (RE). That is, no parameters
35
are assumed to have fixed values a compile time. As discussed in Chapter 2, this
allows the kernel to adapt to any set of inputs but comes at the expense of lower
performance.
The kernel can be modified for use with kernel specialization by replacing only
a few identifiers with constants that will be specified at compile time, as shown in
Listing 4.2. Unique names are required, assuming that the original kernel prototype is not modified, and as a convention, macro values that will be specified on
the command line are in all capital letters. In the specialized kernel (SK), a fixed
value for LOOP COUNT allows the for() loop to be unrolled. Fixed values for ARG A,
ARG B, and PTR IN result in constant folding and propagation. Lastly, fixed values for
BLOCK DIM X and PTR OUT allow the use of immediate values in the generated PTX,
removing the need to load special registers or kernel arguments from shared memory.
Expanding on the specialized kernel of Listing 4.2, the sample kernel code in
Appendix B demonstrates how it is possible to create a single CUDA C kernel that can
be used in both specialized and non-specialized situations. The kernel source listing
implements the same functionality as the first two examples, but allows for individual
control over which parameters are evaluated at run time or specialized when the
kernel is compiled. It also demonstrates two basic routes for taking advantage of the
parameter values specified on the compiler command line: preprocessing directives
and constant-value macro expansion, as previously shown, or as arguments to C++
templates.
36
function as
37
forceinline
CUDA keyword function attribute guarantees inlining, even with the method call
syntax, and allows the use of the same kernel code to access values regardless of
specialization status. There is no need to access the static const struct members
directly, as is typical with C++ template compile time processing. For the same
computedStride::op() example, the specialized version returns the multiplication
of the two template arguments specified on line 52: ARG A and ARG B. Since, as template arguments, these are static, the compiler performs the multiplication during
compilation. As a constant variable with a known value, the compiler can also propagate the result of this multiplication when it is used later on in the kernel.
Using preprocessor directives to toggle specialization status is shown by the two
if/else conditionals starting on lines 60 and 67. The conditionals control inclusion of
non-specialized code dependent on run-time evaluated kernel arguments or specialized
code based on fixed values. In the later case, pointer values are expanded directly
into the code1 .
To demonstrate the significant differences produced by kernel specialization, Appendices C and D contain PTX generated from the single CUDA C kernel in Appendix B. For the listing in Appendix C, none of the parameters values were special1
Statically compiled pointer values are provided as unsigned long hexadecimal values, so casting
is needed. Single- and double-precision floating-point values can be specified on the command line
in a similar manner. See Section 4.4
38
ized, making the PTX fully run-time evaluated. In contrast, Appendix D contains
PTX for the case where every parameter was specialized and fixed at compile time.
In this case, a loop iteration count of five, and a one-dimensional block of 128 threads,
was used. The argA and argB inputs were fixed at 3 and 7, respectively. The input
pointer was set to 0x200ca0200, and the output pointer to 0x200b80000.
Comparing the two PTX samples, a number of observations are immediately
apparent. First, the specialized PTX in Appendix D has no control flow. The
for() loop is completely unrolled. The run-time adaptable kernel includes several
instructions for loop setup, iteration, termination condition checking, and branching.
As discussed in Chapter 2, loop unrolling is important on NVIDIA GPUs.
Other optimizations are also visible in the PTX samples. These include strength
reduction and constant folding and propagation. While many computed values depend on thread index values that are unique to each thread and must be generated at
run time, such as offsets from a base memory pointer, the computations often involve
common subsets dependent only on parameters that can be determined at compile
time, such as thread block, grid, or data dimensions.
In the specialized PTX, base plus offset addressing is used and fully unrolled.
The base input pointer (0x200ca0200 is 8,603,173,376 in decimal) plus an stride of
84 bytes (argA * argB * 4) is propagated through the unrolled loads, but the base
register is still dependent on a run-time variable - the thread index. A strength
reduced multiply of 128 (shift left by 7) also appears.
39
In this case, the specialization is complete. The specialized PTX kernel contains
no references to the input arguments. The kernel arguments, however, are kept to
preserve the interchangeability of the kernel in the case that various input parameters
are toggled between run-time evaluated and statically compiled variants.
4.1
Benefits
Instead of locking a kernel into one of two regimes of higher performance without
adaptability or adaptability with lower performance, kernel specialization allows for
both performance and adaptability. The static-value requirements for many important optimizations are effectively removed.
Beyond generating efficient code for a given problem, kernel specialization can
also improve the performance portability of a single GPU kernel implementation between different GPUs. Implementation parameters that are adjustable independently
from problem parameters can be used to adjust the characteristics of a kernel for a
particular device. Of particular importance are implementation parameters that impact the amount of work assigned per-block and per-thread, discussed in Chapter 2,
such as adjusting tile sizes and the number of registers used for register blocking.
Second order performance effects resulting from the interplay between problem
and hardware characteristics are possible and can also be addressed with kernel specialization. For example, problem parameters at low extremes can reduce the amount
of work and/or inherent parallelism available at the block-level to the point where
40
ILP alone is not sufficient to maximize performance. Reducing the amount of work
assigned to each thread can increase thread counts, potentially improving performance.
A less measurable benefit of kernel specialization at run time is better code reuse
and maintainability. While greater adaptability may increase initial development
time due to the need to account for a greater number of corner cases and parameter
interactions, once completed, a single GPU kernel can often be specialized for and
applied to a wide range of problems. Assumed block and grid dimensions and parameters related to differing compute capabilities often appear both in the kernel code
(usually as macro definition constants or hard-coded values) and host code. With
kernel specialization, these assumptions can be removed from kernel code, as they
are provided to the kernel when it is compiled. Similar to loop unrolling and register
blocking, kernel specialization can convert fixed size constant memory declarations
to ones that are dynamically sized. Kernel specialization also allows developers to
use the simpler static shared memory allocation syntax but have it behave like dynamically allocated shared memory.
Kernel specialization can be very powerful when combined with C++ templates
and compile-time techniques. Kernels can be further customized beyond purely numerical parameter values. Template specializations can be used to select among a
number of variants based on the specific problem parameters or hardware characteristics. An example related to the former would be selecting a different data type or
41
per-pixel comparison metric (i.e. sum of absolute differences to sum of squared differences) for a sliding window matching operation. For the later, kernel specialization
could be used to select between the * operator and the
integer multiplies. With kernel specialization, the kernels for the particular problem
and hardware instance are generated as needed instead of compiling all supported
variants ahead of time. This can significantly reduce program binary size.
Put together, a single implementation can generate optimized GPU binaries at
hardware friendly values, but also perform well on other values. Kernel implementations may include many additional implementation parameters for greater problem
and hardware adaptability without many of the associated impacts of more flexible
GPU code.
4.2
OpenCV Example
42
would be compiled with specialized values at run time. The second grouping contains
the host function that OpenCV applications invoke from the host.
The kernel code in the specialized portion is similar to the original version since
many specific template specializations were present in the original. The anchor argument was converted from run-time evaluation to compile time. Similarly, the template
arguments to linearRowFilter caller() function were replaced with tokens that
are replaced directly with the type names. Another difference is the elimination of
the preprocessor conditional that controls certain kernel launch parameters based on
the current GPUs compute capability. This logic is now isolated in the host code.
The second compilation unit contains the host code that is invoked by OpenCV.
The specialize() function accepts a source file name, a target function name, and
then key-value pairs in the form of strings. This specialize() function generates
a specialized version of the CUDA C source and returns a function handle to a
customized version of the linearRowFilter caller() function. The type specializations for the linearRowFilter gpu() host code are kept since they are used for
host-side C++ linking with the rest of the OpenCV library.
In this example, the specialize() function is designed to work with the compilation model of run-time CUDA API, where a host function is usually compiled along
with one or more kernels. The actual mechanism used in this research is described
in Section 4.4.
Notably absent from the example are the explicit template specializations used
43
to pre-compile many kernel variants. These are now generated on the fly and for any
combination of types, parameter values, and target Compute Capabilities.
4.3
Trade-Offs
While compiling kernels at run time offers a number of benefits, the enabling mechanism is itself the downside: compiling kernels incurs overhead. The delay incurred
from compiling kernels can vary significantly, depending on the platform and the
complexity of the kernel.
Despite the overhead, it is not a factor in many situations. For example, with
many long-running streaming applications processing throughput is more important
than the length of the initialization phase. The framework used in this work and
described in the next section also caches generated binaries. If the same set of parameters is encountered, the previously generated kernel can be loaded quickly (with
speed similar to loading a dynamically linked shared object). Additionally, parameters that change often between kernel invocations can be left as run-time evaluated.
Kernel specialization is more relevant for parameters that control algorithmic behavior and control flow than those that can be considered data, such as scaling factors.
4.4
44
45
4.4.1
GPU-PF
While the macro definition mechanism for kernel specialization is relatively straightforward, as is most CUDA driver-level API code, it is often verbose and error prone.
In response to this, the GPU Prototyping Framework (GPU-PF) was developed
with many host-code abstractions, including kernel specialization automation. The
framework is designed for rapidly constructing applications with streaming processing pipelines. It provides a problem and implementation parameter-focused set of
objects where resources and actions are defined in terms of parameters. Once an
application has been specified, only the values of various parameters need to be adjusted. The framework handles propagating the effects of the parameter changes
through the application. The current set of supported parameter types are listed in
Table 4.1.
Parameters are used to control the properties and behavior of both resources
and actions. Resources include data locations (memory or files) and module-based
resources like CUDA kernels. Table 4.2 lists the currently supported resource types.
The single memory reference type may refer to any memory type except for textures,
Pair
Type
Boolean
Step
46
Description
Geometry (up to three dimensions) and element
size of a memory reference.
Subrange of a memory extent with associated
stride between updates.
Period between events and delay before first occurrence.
Scalar integer parameter.
Scalar floating point parameter.
Various properties used for CUDA texture and array memory types.
A pointer value.
Three integer values. General, but commonly
used for grid and block dimensions. Individual
elements can be referenced.
Two integer values. Individual elements can be
referenced.
Data type (int32, uint8, float4, double, etc.)
True or false boolean parameter.
Self-updating parameter that iterates through a
specified range with an associated stride.
Resource Type
Module
Description
CUDA Module
Kernel
Memory reference
CUDA kernel
A generic reference to any type
of memory except texture
CUDA texture reference
Texture
47
Dependencies
Optional:
boolean, integer,
pointer, pitch, and floating point
parameters
Module
Various: see Table 4.3.
Module, memory reference, and
array traits (not required for
binding to CUDA arrays).
Memory
Type
Constant
Description
Dependencies
Array
Global
Host
CUDA Array
Pitched or linear global memory.
Malloced, CUDA pinned, memory
mapped, or user-provided host memory.
Generic reference to a subset of any
memory type. Can move subset through
the full memory reference over time.
Can be used any place a regular reference can be used.
Subset
Action
Memory copy
Kernel Execution
Description
Single function transfers data properly
according to underlying memory types
at each end point.
Kernel launch arguments and configuration.
User function
File I/O
Dependencies
2memory reference, schedule.
DataType
Host
Memory
Global
Memory
48
Integer
Module
Kernel
Transfer
Execution
Transfer
49
pipeline execution whenever parameters are updated. A separate refresh phase allows for comprehensive error checking and convenient abstractions without sacrificing performance during repeated execution. Only the subset of application resources
affected by parameter changes are updated for a given refresh, and all resource allocation takes place during the refresh phase instead of during processing. Resource
allocation includes generation of CUDA module-based resources, allocating memory,
and preparing many CUDA API arguments. The framework automatically manages
constructing the nvcc command line, building and loading the module, and extracting
any kernel, texture, or constant memory references from the kernel. Compiled GPU
binaries are cached so that the next time the same kernel configuration is required,
it does not not have to be recompiled.
During the execution phase, events are triggered based on an associated schedule
parameter that specifies a period and delay in terms of pipeline iterations. The
period delay values associated with a GPU-PF schedule parameter allow for more
complex program behaviors than those that would otherwise be possible without
a real application task graph. The C-language API provided by GPU-PF returns
opaque handles, and cleanup of all resources and memory is handled automatically
at program termination.
To aid debugging applications, the GPU-PF framework can provide a detailed
log of the actions performed and parameters used. Listings G.1, G.2, and G.3 in
Appendix G provide excerpts of the output for the template matching application
50
discussed in Section 5.1. The framework also optionally times all events using either
CUDA GPU events or host timers, depending on the operation type. This timing
information can be reported as application totals or per-operation. Listings G.4
through G.7 provide examples of both kinds of output. The GPU and application
timing results reported here use the timing facilities provided by the framework.
4.4.2
51
GPU kernels performance space. These ranges may include valid thread counts and
the number of registers assigned for register blocking, for example.
For each set of parameters, GPU performance is measured, register usage is obtained, and the output is optionally compared to a reference. The applications considered in this paper all relied on MATLAB implementations as references, simplifying
integration into the testing facility. Built around the idea that the CPU will be slow
relative to the GPU, CPU output is cached for later use, and all GPU variants and
parameter sets are tested before iterating to the next set of problem parameters.
Once benchmarking has been performed, several data reduction functions are
provided to help analyze the collected data. Beyond reporting information such as
any execution failures or output not matching the reference, a few functions provide
textual summations or graphical depictions of slices of the multidimensional performance data that can enable humans to quickly identify patterns in the data. GPU
implementation variants can also be compared head-to-head. While advanced autotuning techniques can be used to help determine parameters close to optimal given
the often highly non-linear relationship between various parameters and resulting
GPU performance, often simpler parameterization solutions will suffice.
To this end, the benchmarking facility allows users to experiment with various
configuration policies. A policy function provides a GPU and implementation parameter configuration based on the current problem and hardware parameters. Policies
can be applied to a collected performance data set, and for each problem instance the
52
Chapter 5
Applications
To explore and demonstrate the benefits of kernel specialization, three MATLABbased applications were implemented using the prototyping framework and kernel
specialization. The applications are arbitrary and large template matching, particle
image velocimetry, and cone beam backprojection.
The first two applications did not have existing GPU implementations, and for
each of these, specialization-friendly implementations were created. The cone beam
backprojection application already provided a GPU kernel implementation, to which
some specializations were applied.
5.1
The first application considered is a lung tumor tracking application written in MATLAB [14]. The goal of this work is to track lung tumors in fluoroscopic imagery in
real time without the use of visual markers surgically implanted in the body.
CHAPTER 5. APPLICATIONS
5.1.1
54
The algorithm studied for implementation in CUDA generates templates from training data and uses Pearsons correlation to generate similarity scores between templates and the current image data. While the template generation algorithm is specific to this work, 2D Pearsons correlation is a common similarity score. It represents
the bulk of the computation in the application and is the only part of the application targeted for GPU acceleration. The template generation and template-specific
processing can be done once at application setup time and does not constitute a
significant amount of processing. For any given problem instance, template data are
considered static inputs.
Two methods are used to account for the periodic and irregular variation associated with the respiratory cycle: using multiple templates and shifting the templates
over a region of interest (ROI). These provide significant accuracy improvements at
the cost of significant increases in the processing required. For each template, a similarity score must be calculated for each possible position of the template within the
ROI, and this is repeated for each frame. Scanning the resultant scores for the best
match is currently not considered. The reference application uses the corr2() function in the MATLAB Image Processing Toolbox to implement Pearsons correlation.
The definition of corr2() is shown in Figure 5.1.
A simplified MATLAB representation of the tumor tracking computation is shown
in Listing 5.1. As written, the function would be called once for each video frame,
CHAPTER 5. APPLICATIONS
55
P P
corr2(A, B) = r
P P
M
AM N
AM N A BM N B
2
2 P P
B
A
MN
M
N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
...
Listing 5.1: MATLAB function implementing the sliding window correlation for each
template for each ROI within the current frame.
Table 5.1 shows characteristics of the reference data sets studied [14]. They
display significant variation, and none of the values are powers of two. The per
frame computational requirements vary widely between the patients and are affected
by the template size and the number of times corr2() is called. The template size
CHAPTER 5. APPLICATIONS
Patient
P1
P2
P3
P4
P5
P6
Frames
442
348
259
290
273
210
Templates
12
13
10
11
12
14
56
Template Size
53 54
23 21
76 45
156 116
86 78
141 107
Shift V/H
18/9
11/5
9/4
9/3
11/6
9/2
corr2()
Per Frame
8 436
3 289
1 710
1 463
3 588
1 330
Calls
Total
3 728 712
1 144 572
442 890
424 270
979 524
279 300
Table 5.1: Per patient, the number of image frames, template number and size,
vertical/horizontal shift within ROI, and number of corr2() calls.
increases the iteration counts associated with the double summations used in both the
numerator and the denominator of the corr2() function. The per-frame invocation
count of the corr2() function is dependent on both the number of templates as well
as the shift values, which affect the loop counts in the for loops of Listing 5.1.
In this application, the static nature of the templates allows us to eliminate
significant redundant computation. Calling the corr2() function multiple times with
the same template results in recalculating the template matrix averages for use in
the denominator and numerator. Similarly, the total template contribution to the
denominator is, for a given patient run, constant across correlations and can be
computed once and reused. However, the corresponding frame data values must be
computed for each position within the ROI and for each incoming frame. As the
template-sized sliding window traverses the ROI, the underlying frame data changes,
changing the frames contribution to both the average of the frame data and the
frames contribution to the denominator. All implementations of the sliding window
corr2() implementation, discussed later in detail, take advantage of this redundancy
to improve performance. Assuming that the template data is represented by A and
CHAPTER 5. APPLICATIONS
57
P P
C
A
B
B
M
N
corr2(A, B) = r M N M N
2
P P
B
AD
MN
M
N
is the matrix average of B, AC is the current template with the
Figure 5.2: B
template average subtracted from each value, and AD is the template contribution
to the denominator.
the frame data is represented by B, the actual implemented functionality is reduced
to the equation shown in Figure 5.2.
Several implementations of the computational core of the application have been
created: (1) using CUDA without kernel specialization, (2) using CUDA with kernel specialization, (3) using C with POSIX Threads-based multithreading, and (4)
using MATLAB. All implementations take advantage of the redundant computation
described above.
5.1.2
This particular template matching application is an interesting case study since the
data sets are created by humans and not carefully selected to match GPU architectures. Both the template size and shift area vary significantly and make it difficult
to assume general relationships between the parameters, as would be the case when
building a general purpose library of GPU kernels.
For each patient a single ROI is defined, but it is of relatively small size (between
95 and 703 total template shift positions) compared to typical linear image filtering.
This introduces concerns about generating enough parallelism, especially when each
CHAPTER 5. APPLICATIONS
58
5.1.3
CUDA Implementation
The GPU version was implemented using six CUDA kernels over four stages that
correspond to various components of the corr2() calculation: (1) frame data averages,
(2) frame data contribution to the denominator for each shift offset within the ROI,
CHAPTER 5. APPLICATIONS
59
XX
M
AC
M N BM N B
is the
Figure 5.3: The actual functionality implemented by the numerator stage. B
matrix average of B, and AC is the template with its average value subtracted.
(3) numerator, and (4) final result. The first three stages consist of two distinct
kernel types that are of nearly identical structure: a tiled kernel that computes
partial results and a reduction kernel that combines the partial results into a final
product for that stage. Since these first three stages are similar, the numerator stage
is first presented alone. Then, the differences between the numerator stage and the
two frame data statistics stages are explained, followed by the final stage.
5.1.3.1
Numerator Stage
CHAPTER 5. APPLICATIONS
60
Template
Main Tiles
Bottom Tiles
Right
Tiles
Corner
Tiles
CHAPTER 5. APPLICATIONS
61
Horizontal Shift (strided)
0xXXXXXX00
Vertical Shift
(sequential)
Figure 5.5: Graphical data layout of the shift area for a single tile. Regardless of dimensions, all tiles are applied to the same shift area. A single block of the summation
kernel may only combine a subset, shown in gray, of the template locations.
Each block of the first kernel of the numerator stage writes the output for each
tile over the tiles search space contiguously in memory, with each tiles contribution
starting at the beginning of a pitched memory segment, as shown in Figure 5.5. Each
grid square represents the partial result contributed by a single template tile sliding
over the shift area. The shift area is stored in a column-major organization, with the
gray squares representing the shift area covered by a single kernel launch.
This data layout produces a regular data pattern with the partial contributions
for consecutive tiles separated by a constant stride. Additionally, since the data is
grouped by shift area, the data layout is constant regardless of tile shape, simplifying
the inclusion of the edge case tile contributions. As a result, the second kernel of
the numerator stage, which performs a reduction sum over the contributions of each
tile, is relatively simple: each thread is assigned to a unique shift offset. Since the
CHAPTER 5. APPLICATIONS
62
Intra-thread
Access
Consectuve
Threads/Data
Tiles
Figure 5.6: Each thread accumulates the tile contributions for a singe shift offset, an
example of which is shown in gray.
reduction for each shift offset is independent, no coordination between threads is
required. Each thread simply accumulates the contributions independently, with
consecutive threads accumulating the contribution of consecutive shift offsets. The
data accesses are fully coalesced.
5.1.3.2
For the implementation to adapt to arbitrary template sizes, as is relevant for the
tumor tracking application, it must be able to handle template sizes that are not
integer multiples of any tile size, let alone a tile size that executes efficiently on the
GPU. Data padding is made difficult by the definition of the corr2() function, as
the average of the data under the window is different at each window position. This
scenario, shown in Figure 5.4, results in leftover template pixels not covered by the
CHAPTER 5. APPLICATIONS
Main Tiles
Size
Count
88
19 4
16 10 9 11
44
39 29
Right Tiles
Size
Count
84
19
16 6
9
63
Bottom Tiles
Size
Count
48
14
12 16
11
Corner Tile
Size
44
12 6
Table 5.2: Template tiling examples for the template size associated with Patient 4
(156 116 pixels).
regular set of template tiles. The main tile size chosen will affect the number and
shape of any edge tiles that are present. Table 5.2 shows the dimensions and number
of each type of tile for the 156116 pixel template size of Patient 4 for three different
main tile sizes. The included tile sizes are examples of how tile sizes affect the total
number of tiles and the presence of irregular edge case tiles. Although 4 4 pixel
tiles eliminate edge cases for Patient 4, small tile sizes do not execute efficiently on
NVIDIA GPUs. This is discussed in Section 6.2.
Blocks that are assigned tiles around the right and bottom edges of the templates
may perform the nested summation over a differently shaped area. The core computational functionality is implemented as a
device
dimensions, shift amounts, and all other parameters passed in as run time evaluated arguments. Each block determines the necessary parameters before invoking
the function. As a result, the nested for loops over these areas may have different
loop bounds that have to be evaluated at runtime, which will prevent loop unrolling
optimizations. The non-specialized versions of the CUDA kernels incur significant
performance impacts as a result.
However, through kernel specialization, we are able to maintain adaptability and
CHAPTER 5. APPLICATIONS
64
handle the edge cases while preserving the benefits of compile-time optimization over
fixed values. This application is a natural fit for compiling highly-problem specific
specialized kernels. While parameters can vary widely between patients, they are
fixed for each patient after an initial setup stage. For the kernel specialized implementation, the core computational functionality is contained within a
device
template function with nearly all parameters, including the tile size, converted to template arguments that are determined at compile time. Up to four separate explicit
specializations of the processing code are instantiated within a wrapper
global
function, one for each needed tile size. The appropriate version is selected by each
thread block, resulting in different blocks executing different kernel code.
Here, kernel specialization allows the compiler to unroll loops for both the main
and edge-case tile sizes. In addition to loop unrolling, strength reduction and compiledin constants are used to calculate a number of offsets and strides, eliminating repeated
runtime evaluation of non-compute code by each thread or block.
The second reduction kernel also has non-specialized and specialized variants.
Each thread loops over the tiles accumulating each partial result for a given shift
offset. The total number of tiles, including edge cases, is provided as a compile time
parameter so the loop may be unrolled in the specialized kernel.
5.1.3.3
Other Stages
The remaining stages share a number of similarities with the numerator stage. While
the working set requirements are not as high for the frame data average and denom-
CHAPTER 5. APPLICATIONS
65
inator calculations as the numerator kernel since no template data is needed, the
non-tiled working set for most of the patients would still exceed the available shared
memory. As a result, the frame data averages and contribution to the denominator
are each implemented as a two kernel solution similar to that of the numerator. The
frame denominator and numerator share the same reduction sum kernel, but the
frame averages reduction relies on a different kernel that also performs a normalization to produce average values.
The final kernel from the current corr2() implementation computes the fraction
and square root operation from the denominator. The reciprocal square root function
is used on the denominator data to perform the square root and convert the division
implied by the fraction into a multiplication and combine two expensive operations
into one, leaving only an additional multiplication to produce the final value for each
shift position. As is the case with the numerator stage, the final kernel is implemented
assuming the template data is precomputed and static. Since the sliding window
shift operation is not applied to the template data, the template portion of the
denominator exists as a single scalar value for each template.
5.1.3.4
Runtime Operation
The host application responsible for coordinating I/O, data transfers, and kernel execution was built using GPU-PF. Each frame is streamed onto the GPU and processed
independently during execution. Currently, only the necessary ROI data from each
frame is pushed to the GPU, which limits the size of the data transfers. To reduce
CHAPTER 5. APPLICATIONS
66
5.1.4
CPU Implementations
CHAPTER 5. APPLICATIONS
67
5.2
Kernel specialization was also applied to improve the performance of a particle image
velocimetry (PIV) application. PIV is an optical flow technique that attempts to
CHAPTER 5. APPLICATIONS
68
CHAPTER 5. APPLICATIONS
69
the velocity of the fluid. The details of the constraints depend on a number of physical
factors that change between physical setups, but in general this results in a moderate
search area for each mask. This requirement makes the scale of the search similar
to the template matching application, but the size of the template is not expected
to be nearly as large. In addition, many individual template matching operations
are performed, and the mask data, analogous to the template data in the template
matching application, is location dependent. Each individual velocity estimate in the
field will use a unique patch.
The mask associated with any velocity estimate is referred to as a sub-area, as
shown in Figure 5.9. The space of mask offsets searched is defined by the dimensions
of the ROI, called the interrogation window, which is rectangular. The search space
is just the difference in dimensions between the sub-area and interrogation window
and is always dense, meaning that a similarity score is generated for every possible
offset of the sub-area within the interrogation window. As an example, consider an
interrogation window of forty pixels square and a sub-area of thirty-two pixels square.
In this case, with the sub-area at the center of the interrogation window, the sub-area
can move up to four pixels in each direction up, right, down, and left, producing a
nine-by-nine square offset space.
In the FPGA implementation, the distribution of local searches throughout an
image pair for which velocity estimates need to be generated was assumed to be a
regular grid and was defined in terms of interrogation window parameters. Start-
CHAPTER 5. APPLICATIONS
(0,0)
70
Image
x
(x0,y0)
Interrogation Window
(u0,v0)
Sub-area
Figure 5.9: A graphical depiction of the terminology originally used for the FPGA
PIV implementation. This image is from Benniss dissertation [6].
ing from the top left corner, the distance between adjacent velocity estimates was
controlled by a parameter called overlap. Overlap, specified in pixels for each dimension, is the number of pixels that adjacent interrogation windows overlap. Continuing
with the forty-by-forty interrogation window example, an overlap of twenty in each
direction results in a grid of velocity estimates that are twenty pixels apart. In each
dimension, adjacent interrogation windows will share half of their pixels. For a given
interrogation window, the patch and location of the estimate are based on the center
of the interrogation window.
The FPGA implementation was designed with the goal of integration with an
existing PIV test bed used by the Robot Locomotion Group (RLG) at the Massachusetts Institute of Technlology, headed by Professor Russ Tedrake [47]. Based
on updated requirements from the RLG, the CUDA version, discussed in detail below,
uses a significantly different problem definition.
The new problem specification is more flexible than the FPGA implementation
CHAPTER 5. APPLICATIONS
71
and is defined in terms of mask corners and offsets. The mask corners are provided
as a list of arbitrary x- and y-coordinates representing the locations of the top left
corner of each desired mask. The offsets at which to calculate similarity scores
are also specified as a list of arbitrary x- and y-coordinates. The set of offsets are
applied to each mask. The new problem specification allows the same implementation
to perform both global and local searches under different conditions. For example,
determining a flow estimate at image-wide scales can be used to determine the overall
fluid motion. When the flow is non-turbulent and a direction can be determined
ahead of time, the search space defined by the offsets can be biased in the forward
direction. On the other hand, the same problem specification can be used to define
a PIV problem instance that includes a closely clustered set of masks and offsets
around an area of turbulent flow. This can produce better estimates in turbulent
flows.
While the new problem specification provides more flexibility, it presents a more
challenging problem for implementation with CUDA. The regular spacing of interrogation windows and dense rectangular search space create a very regular data access
pattern across interrogation windows and between consecutive mask offsets. The
regularity provides rigidly structured opportunities to minimize data traffic through
the memory hierarchy, often a key performance consideration for GPUs. Depending
on the value of the overlap parameter, a significant amount of data may be shared
between adjacent interrogation windows. Adjacent mask offsets will also share all
CHAPTER 5. APPLICATIONS
72
but one row or column of the data from the interrogation window. The FPGA
implementation takes advantage of these opportunities.
With the new problem specification, it is still be possible to examine the values
within the list of mask corners or offsets to determine whether it is possible to reuse
any data at the top of the memory hierarchy probably shared or register memory
within a thread block. However, the limitations of fast block-local memory, specifically the fact that different thread blocks cannot share data, may limit opportunities
for data sharing.
In addition to the specification of the mask and offset distributions, the FPGA
and CUDA versions differ in two other ways: similarity score and data type. The
FPGA application uses cross-correlation in the time domain. Based on guidance
from the RLG and literature [23], the similarity score for the GPU implementation
was switched to the sum of squared differences, shown in Equation 5.10. Timedomain processing was selected since Yu [57] et al. has shown that the time domain is
more computationally efficient for the problem instances of interest, due to padding
requirements. The FPGA implementation was highly parameterized, including by
bit-widths of the data type. The processing was changed from integer to singleprecision floating-point, as per newer requirements from the RLG.
Finally, not yet considered is the selection process for a given masks velocity
based on computed similarity scores. This is a simple reduction scan for each mask
corner. In the case of cross-correlation, a maximum value is desired, while for the
CHAPTER 5. APPLICATIONS
ssqd(x0 , y0 , u0 , v0 ) =
73
i=m1
X j=n1
X
i=0
j=0
Figure 5.10: The per-mask offset sum of squared differences similarity score, defined
in terms of the original PIV problem specification as shown in Figure 5.9.
sum of squared differences metric, a minimum value is optimal.
5.2.1
CUDA Implementation
The high-level design of the PIV CUDA implementation was based on several observations and guidance from the RLG. The target range for the number of mask
corners is between 200 and 2000 and up to 2000 offsets. At the same time, mask
dimensions will be between twenty-five and sixty-four pixels and always square. Note
that the computation associated with each mask is entirely independent, and while
significant, the target parameter ranges reduce the computation associated with a
single mask to a relatively small portion of the total computation.
Based on this, a mapping of one mask per block was chosen. At the low end of
the range, this produces more than enough blocks to fully utilize current NVIDIA
hardware, while at the same time generates significant potential per-thread block
parallelism except at extremely low counts of shift offsets and mask sizes. As with the
template matching application, each iteration of the summation loops is independent.
Another benefit of this mapping scheme is that the mask data becomes static for each
block. It should be possible to load the mask data to block-local memory only once.
A possible downside of this approach is discarding the ability to optimize for data
CHAPTER 5. APPLICATIONS
74
accessed by multiple masks, as described above. However, the scale of the data reuse
is highly problem instance dependent and based on the arbitrary distribution of both
the distribution of masks and corners. Going without data reuse simplifies the GPU
implementation. Reading interrogation window data through texture memory and
the L1-style caching on Fermi-architecture and later GPUs can potentially mitigate
the performance impact.
Another observation about the PIV problem as posed by the RLG is that the
reduction operation is per-mask and therefore moderate in scale no more than
2000 values. Performing this reduction at the block level is efficient, removes the need
to incur overheads associated with a second reduction kernel launch, and decreases
global memory traffic by reducing the output of the kernel to a single value per block.
Beyond a per-block reduction to find the optimal similarity score, the accumulation associated with computing the similarity score at each offset must be mapped
to a blocks threads. Based on the increasing importance of using register file over
shared memory, as discussed in Section 2.3, and that for each block the mask data is
static, register blocking the mask data was chosen, as shown in Figure 5.11. Consecutive threads load consecutive column-major mask data values into registers. While
this generates a stride for intra-thread accesses, it produces coalesced inter-thread
data access patterns for both the initial mask data load and reading ROI data from
the other video frame.
This block-level mapping decision breaks apart the mathematically associative
CHAPTER 5. APPLICATIONS
75
Thread
Set
Sample Mask
0
1
2
3
Figure 5.11: Example of a set of threads striped across a masks area.
accumulation of squared differences across the threads and keeps the threads in lockstep over the offset space. This inverts the thread mapping chosen for the large
template matching application, where threads operated in lock-step over the same
template tile data but for different shift offsets.
The PIV thread mapping decision provides sufficient per-thread work for each
offset and generates ILP. At the low end of the expected template size range of
twenty-five pixels per side, this still maps more than four pixels of mask data to each
thread for a thread block of 128 threads. This results in parallelism availability that
is less sensitive to the size of the offset space. Even down to a sixteen pixel mask,
128 threads-per-block results in two pixels per thread or four for a 64 thread block.
Given a value for the number of pixels of mask data assigned to each thread and
a number of threads per block, a thread block will cover a certain natural area.
For example, 128 threads per block at four pixels per thread results in a natural area
of 512 pixels. For larger masks, the mask is tiled and the block loops over each tile.
This tile looping occurs outside of the loop traversing the offset space. This allows
each mask tile to be accessed only once for the entire offset space.
However, looping over the offset space inside the mask tiling requires tracking
CHAPTER 5. APPLICATIONS
76
and accumulating mask tile contributions for each offset. A shared memory area
with the same number of elements as elements in the offset space is reserved for
this purpose. However, before a given mask tiles contribution for a given offset can
be accumulated, the portion calculated by each thread must be reduced to a single
value. Block-level reductions, as discussed in Section 2.2, represent a potential point
of inefficiency as fewer and fewer threads participate in each round of the reduction.
The idle threads must wait for the reduction to complete before continuing on to the
next mask offset. In the case of the PIV kernel, a reduction must be performed for
each offset for each mask tile, potentially incurring this performance impact several
thousand times during the lifetime of the kernel.
As a way to address this, the PIV kernel applies warp specialization, as described
in Section 2.3. Several different styles of warp specialization were applied to the PIV
kernel, with the three main tasks, loading data, accumulating squared differences,
and reducing accumulated data to a single value, distributed in different ways among
two- and three-stage pipelines. Each stage of the pipeline was assigned to a varying
number of warps. However, with register blocking and loop unrolling, the compute
threads in the PIV kernel produce significant instruction- and memory-level parallelism, limiting the benefit of double-buffering data loading through shared memory.
The best performance was observed with a two-stage double-buffered approach, with
one set of warps doing both the loading of new ROI data and computing the sum
square differences assigned to each thread, as shown in Figure 5.12. In this group of
CHAPTER 5. APPLICATIONS
77
Compute
Warps
Reduction
Warp
A
Time
Barrier
C
N-1
N+31
Figure 5.12: A depiction of the warp specialization used in the PIV kernel to remove
the reduction as a bottleneck.
warps, data is read directly by the threads consuming the values without a buffering
stage in shared memory. Once a mask tile at a given offset has been processed, each
thread writes its contribution for the current shift offset into shared memory.
The second warp group, consisting of a single warp, performs the reduction. Combined with double buffering, this allows the first group to immediately begin processing the next shift offset. Using a single warp for the reduction prevents the need for
any synchronization related to the reduction. When the number of threads producing intermediate results is greater than sixty-four, the warp accumulates extra data
within each thread until a standard reduction tree can be performed. A maximum
of five levels of reduction are needed.
With warp specialization, assigning multiple warps to the reduction task requires
the reduction warps to perform a sub-block synchronization. Sub-block synchronization is not well supported in the CUDA environment, requiring inline PTX, similar
to inline assembly in C/C++. It also results in more difficult debugging, but was
CHAPTER 5. APPLICATIONS
78
nonetheless successfully implemented. With relatively small reductions over an array whose element count matches the number of threads per block, the multi-warp
sub-block synchronization approach was observed to execute more slowly than the
single-warp synchronization-free alternative. With typcial block sizes, the transition
to the single warp synchronization-free stage occurs quickly, resulting in idle warps
when more than one warp participates in the reduction.
The single warp performing the reduction only accesses shared memory, and does
not compete with the other loading and computing warps, as input data for the
loading and computing warps comes from registers and global or texture memory.
The last remaining thread in the reduction tree accumulates the value into the correct
offset element in shared memory.
Once all mask tiles have been accumulated for all offsets, the final reduction scan
for the minimum value is performed. This single output is the index of the offset
coordinates that produces the optimal value.
5.2.2
Kernel Specialization
For PIV, kernel specialization is used in a number of ways to improve the performance of the run-time evaluated equivalent kernel. Key benefits offered by kernel
specialization for the PIV kernel is a combination of loop unrolling and enabling dynamic register blocking. NVIDIA GPU registers cannot be dynamically addressed by
the thread that owns them. Any loops operating over register-resident data must be
static at compile time so the specific registers can be encoded into the GPU binary.
CHAPTER 5. APPLICATIONS
79
Here, kernel specialization converts the number of registers used for register blocking
into a dynamic variable. This allows the kernel to adjust its register usage based
on the number of registers available on a target GPU and the current problem. In
addition, kernel specialization allows many other offsets and strides to be statically
compiled into the PIV kernel. These include thread identification thresholds used to
control inter-warp divergence and how many threads are in each specialized group.
The specialized kernel was derived much like the large template matching kernel,
described in Section 5.1.3.2. The run time kernel was developed first, and then the
specialized version was created by replacing kernel arguments and CUDA built-in
references with new macro names provided at compile time. Little other optimization
was applied to the kernel, including forgoing an opportunities to remove run-time
guards for cases where the natural thread block area is an integer multiple of the
mask area and to eliminate tile looping when the natural thread block area exactly
matches the full mask area.
The kernel that evaluates parameters at run time is not, for the purposes of
comparison, fully run-time evaluated. As mentioned, register blocking values must
be fixed once ahead of time. Combined with a desire to determine the optimal value
for register blocking in the run-time evaluated kernel variant, as well as to study the
impact of incorrectly choosing this value, the register blocking level is specialized for
each value, even for the run-time evaluated kernel.
CHAPTER 5. APPLICATIONS
80
Axis of
Rotation
X-ray Source
Detector
Object
Cone Beam
Orbit of Source
and Detector
5.3
The third application studied for kernel specialization is a CUDA cone beam backprojection implementation. In this case, backprojection is used to reconstruct threedimensional models of objects scanned using a series of two-dimensional imaging
projections. The projections determine the density of various parts of the interior
of an object based on the intensity of X-ray beams that pass through the object to
reach a detector on the other side.
In the case of X-ray cone beam computed tomography (CBCT), considered here
and shown in Figure 5.13, the X-ray beam is conical in shape, as opposed to the fan
beam used in standard CT imaging, where the X-ray beam is assumed to be two
dimensional. In a fan beam setup, a single row of detectors is used, while the cone
beam uses a two-dimensional arrays of detectors for each projection collected. A
greater amount of data is collected for each projection, enabling a higher-resolution
CHAPTER 5. APPLICATIONS
81
CHAPTER 5. APPLICATIONS
5.3.1
82
Kernel Specialization
The cone beam backprojection kernel provides an interesting case study for kernel
specialization. Like the OpenCV examples, the kernel was not developed with kernel
specialization in mind. The inner computation loop is too complex to be unrolled by
the GPU compiler, and while many scalar intermediate parameters are calculated,
most are data dependent and cannot be optimized to static values. Wherever possible, constant parameters are propagated and kernel-wide data-dependent optimizations are utilized. An example of the latter are scalar problem-specified parameters
that determine control flow decisions within the kernel. Based on the value of these
parameters for the current problem, only one code path is compiled into the kernel.
5.4
Summary
In this chapter, the three applications to which kernel specialization has been applied
were introduced. In the case of the first two applications, large template matching
and particle image velocimetry, the CUDA implementations were created as a part of
this research kernel specialization was included in their design. The third application,
cone beam back projection is included as an example of applying kernel specialization
to an existing GPU application. The next chapter covers a number of experiments
performed with these applications on two GPUs and includes information about the
specific tests as well as performance results.
Chapter 6
Experiments and Results
This chapter covers a number of experiments performed with the applications described in the last chapter. The details of the specific parameter ranges tested are
provided, as well as details of the systems used to perform benchmarking. Then,
results are provided, followed by a discussion and observations.
6.1
Experimental Setup
84
85
6.1.1
All results presented were generated with two NVIDIA GPUs: a Tesla C1060 and
a Tesla C2070. The C1060 is a Compute Capable 1.3 device and contains 4 GB of
RAM. It was installed on a workstation with an Intel Core2 Duo E8400 (3 GHz clock
with 6 MB of L3 cache) and 2 GB of RAM. The C2070 is a Compute Capable 2.0
device and provides 6 GB of RAM. The host machine contains an Intel Xeon W3580
(4 Nehalem cores at 3.33 GHz with 8 MB L3 cache). Both machines run 64-bit Linux
3.2 with CUDA 4.1, GCC 4.6.3, MATLAB 2012a, and CUDA driver version 295.20.
All CPU results presented here are from the Xeon-based workstation, as it has the
more powerful processor.
6.1.2
Each application studied was benchmarked on a number of different problem sets and
range of implementation parameters. The total number of different benchmarking
trials for any one of the applications can be determined by multiplying the number of
problem instances by the number of implementation parameter sets. The first two applications, large template matching and PIV, both contain a number of independent
implementation parameters, while the cone beam backprojection implementation is
only parameterized by the number of threads per block. As a result, the number
of discrete kernel configurations tested by the first two applications is significantly
86
Template Matching
For the large template matching application, the problem instances evaluated are
listed in Table 5.1 in Section 5.1.1. The data sets are real-world clinical data used by
the researchers and represent a wide range of template sizes and total computation.
Additionally, few of the parameters are those that are considered to be GPU-friendly.
The template matching numerator computation, the focus for analyzing the benefits of specialization, offers a number of implementation parameters that can be used
to further tune the processing. The first kernel, which performs the tiled corr2(),
can be run with a varying number of threads per block and main tile sizes. The
maximum number of threads per block may affect how many times the kernel must
be called to cover the search area, as the current implementation only handles one
search offset per thread. However, increasing the number of threads per block increases shared memory usage per thread, as the entire region of interest is loaded
into shared memory and more threads will cover a greater number of shift offsets.
As discussed in Section 5.1.3.2, changing the main tile size affects the number
of main tiles and can determine whether or not edge-case tiles exist in some cases.
Increasing the main tile size will use more shared memory, as the entire tile and
corresponding region of interest are loaded into shared memory. It also represents a
possible mechanism for balancing per-thread and device-wide workloads. A smaller
tile size will generate more independent blocks but reduce the amount of work each
87
Value Range
2 4 8 10 12 16
2 4 8 10 12 16
64 96 128 160 192 224 256 288 320
32 64 96 128 160
88
PIV
The PIV application was also tested over a wide variety of problem and implementation parameters. As the reference for performance of the PIV application was the
existing FPGA implementation, the same group of problem instances previously examined were also used to benchmark the GPU implementation. The configurations
were in terms of the original problem specification, and are listed in Table 6.2. Table 6.3 contains the same set of configurations in the new problem representation.
There are two groups of configurations among the ten. The first five configurations,
labeled A1 through A5, increase the image size while keeping other parameters constant. With a larger image, a regular grid with a given stride between interrogation
windows will contain more individual flow estimates to cover the entire image. The
second group, labeled B1 through B5, increases the size of the interrogation window
and overlap pixels while keeping the image and mask size constant. The overlap pixel
count grows in proportion with interrogation window size, which maintains a consis-
Image
Dimensions
320 256
512 512
1024 1024
1200 1200
1600 1200
1024 1024
1024 1024
1024 1024
1024 1024
1024 1024
Interrogation
Window Dimensions
40 40
40 40
40 40
40 40
40 40
24 24
32 32
40 40
48 48
56 56
89
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
16 16
16 16
16 16
16 16
16 16
Overlap
Counts
(20, 20)
(20, 20)
(20, 20)
(20, 20)
(20, 20)
(12, 12)
(16, 16)
(20, 20)
(24, 24)
(28, 28)
Table 6.2: The PIV problem set parameters, in terms of interrogation window and
image dimensions, used for comparing performance of the FPGA and GPU implementations.
Parameter
Set
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Mask
Count
165
576
2500
3481
4661
7056
3969
2500
1681
1225
Mask
Offsets
81
81
81
81
81
81
289
625
1089
1681
Total
Offsets
13 365
46 656
202 500
281 961
377 541
571 536
1 147 041
1 562 500
1 830 609
2 059 225
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
16 16
16 16
16 16
16 16
16 16
Table 6.3: The PIV problem set parameters, in terms of mask and offset counts, used
for comparing performance of the FPGA and GPU implementations.
tent scenario where one-half of the area of an interrogation window overlaps with an
adjacent interrogation window in either direction. With an increasing interrogation
window size, fewer interrogation windows can fit within a constant image size. The
size of the search area also increases. In all cases, the FPGA implementation used
8-bit integers for input, while all GPU results use single-precision floating-point.
To more fully characterize the GPU implementation across the PIV problem
space, three additional problem sets were benchmarked. The first, shown in Ta-
Mask
Count
676
676
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
81
81
81
81
81
81
81
81
81
81
81
90
Mask
Dimensions
88
11 11
16 16
25 25
32 32
43 43
48 48
57 57
64 64
75 75
88 88
96 96
Table 6.4: PIV problem set parameters used to test the impact of mask size on the
performance of the GPU implementation.
ble 6.4, varies the mask size while keeping the other parameters constant. The second
problem set, shown in Table 6.5, varies the number of offsets each mask is moved
through while keeping other values constant. In both cases, the search is dense, and
the interrogation windows are distributed so there is a fifty percent overlap in each
direction. For simplicity, the interrogation windows are organized into a regular grid,
as in the original problem specification.
Since each block processes a single mask, variations in the number of offsets and
the mask size affect how long a single block will execute. In all cases present in
the problem sets of Tables 6.4 and 6.5, 676 interrogation windows are used. This
provides more than enough independent blocks to fully saturate the 14 streaming
multiprocessors (32 CUDA Cores each) on the Tesla C2070 and the 30 multiprocessors
(8 CUDA Cores each) on the Tesla C1060. The run time should scale linearly with
the number of masks, as the A series problems in Table 6.3 is designed to test.
Parameter
Set
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
Mask
Count
676
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
169
289
441
625
841
1089
1369
1681
2025
2401
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
91
Interrogation
Window
Dimensions
40 40
44 44
48 48
52 52
56 56
60 60
64 64
68 68
72 72
76 76
80 80
Table 6.5: PIV problem set parameters used to test the impact of the number of
search offsets on the performance of the GPU implementation.
One additional benchmark set, shown in Table 6.6, was used to investigate interblock level performance impacts. It keeps the mask size and offset search space identical, but spaces out the interrogation windows further and further, which reduces the
common interrogation window data needed between thread blocks. Decreasing levels
of data in common between blocks puts more pressure on the block-level memories
when multiple blocks execute within the same multiprocessor, as well as the rest of
the GPU memory hierarchy not-exposed by CUDA.
Like the template matching application, the PIV implementation exposes a number of implementation-only parameters that are freely tunable for any given problem
instance. The tested parameter ranges are shown in Table 6.7. The register blocking
values adjust the amount of work each thread performs at the expense of per-thread
register usage. Increasing the register blocking value can also decrease the available
parallelism when the mask size is small. For both of these reasons, register blocking
Parameter
Set
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
Mask
Count
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
81
81
81
81
81
81
81
81
81
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
Interrogation
Window
Dimensions
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
92
Overlap
Counts
(36, 36)
(32, 32)
(28, 28)
(25, 25)
(20, 20)
(16, 16)
(12, 12)
(8, 8)
(4, 4)
(0, 0)
Overlap
Ratio
0.9
0.8
0.7
0.625
0.5
0.4
0.3
0.2
0.1
0
Table 6.6: PIV problem set parameters used to test the impact of interrogation
window overlaps on the performance of the GPU implementation.
may be in contention with the number of threads per block.
The Main threads parameter refers to the number of threads dedicated to loading data and computing the sum of squared differences. As it was discovered that
a synchronization-free reduction with a single warp always performed faster than a
traditional wide reduction tree, the reduction that occurs for each mask offset always takes place using a single warp. When warp specialization is not used (warp
specialization itself being a Boolean implementation parameter), one warp from the
Main threads group performs the reduction while the rest wait. When warp specialization is used, an extra thirty-two threads are allocated for the reduction. Finally,
the PIV kernel variants supported reading data from the interrogation window directly through global memory or through textures. Both scenarios were tested. With
the PIV kernel, all the parameters are orthogonal, resulting in 160 implementation
parameter configurations per problem instance. With the four series of problem
configurations and two GPUs, the total number of PIV instances benchmarked was
93
Value Range
1248
32 64 96 128 160 192 224 256 288 320
32
global or textured global
enabled or disabled
Projection
Dimensions
64 60
512 768
Projection
Count
72
361
Output
Volume
64 60 50
512 512 768
The cone beam backprojection kernel was tested with two data sets, shown in Table 6.8. The data sets contained synthetically generated objects, referred to as phantoms, but designed to match data sets generated from a real world hardware scanner,
a Siemens Inveon multimodal scanner in CT mode [50]. Data set V1 was used for
testing and the dimensions of V2 reflect real-world dimensions. For these data sets,
the projections were spaced one degree apart along the orbit. The phantom data was
generated using the Image Reconstruction Toolbox that also contains the reference
MATLAB implementation. The processing performed for cone beam backprojection
is not data dependent, so artificial phantoms provide a good performance estimate.
Also in contrast to the template matching and PIV kernels, the cone beam backprojection kernel had less inherent parameterizability. In this case, the single parameterizable value was the number of threads per block, as shown in Table 6.9. Thread
94
Value Range
32 64 96 128 160 192 224 256 288 320 352 384
6.2
Results
Across the range of problem parameters tested for each application, kernels utilizing
kernel specialization (KS) produced notable improvements over the corresponding
fully run-time evaluated (RE) kernels, both in performance and register usage per
thread. Results compare the kernel variants to each other and list the best performance across the wide range of implementation parameters for each GPU and
problem pair.
6.2.1
Comparative Performance
To establish the soundness of the basic GPU implementations, they are first compared
to the performance baseline implementations. For template matching, the results
shown in Table 6.10, the reference is multi-threaded C. The GPU times provided are
those corresponding to the best performing set of implementation parameters, across
all tested for the kernel specialized implementations. The results include not only
Data Set
P1
P2
P3
P4
P5
P6
Execution Time
CPU
C1060
2456.753 900.918
181.105 239.989
568.893 179.304
2295.653 493.490
2199.486 423.610
1700.251 305.997
(ms)
C2070
393.164
73.394
73.006
212.082
170.186
126.311
95
Speedup vs. CPU
C1060
C2070
2.83
6.48
0.75
2.47
3.17
7.79
4.65
10.82
5.19
12.92
5.56
13.46
Data Set
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
96
Speedup vs. FPGA
C1060
C2070
2.04
4.62
2.22
4.70
2.12
4.51
2.65
5.68
2.57
5.37
4.34
7.30
4.34
7.38
3.66
7.36
2.97
7.29
2.17
6.44
Table 6.11: PIV performance results comparing the FPGA implementation to the
best performing CUDA implementation on two GPUs.
Data Set
V1
V2
Execution Time
CPU
C1060
320
85.869
1 929 900 209 805
(ms)
C2070
69.721
48 208
Table 6.12: Cone beam backprojection results comparing the OpenMP CPU implementation with four threads to the best performing configuration on both GPUs.
fers, so the GPU numbers also include only the kernel execution time. Both GPUs
show significant and consistent speedup over the optimized FPGA implementation.
It should be noted, however, that the C1060 results do not use warp specialization
while the C2070 results do. The impact of warp specialization is examined below.
Table 6.12 shows the cone beam backprojection results comparing the OpenMP
CPU run time with four threads to each of the two GPUs. Again, both GPU implementations use kernel specialization. While the times provided include the two
preprocessing stages, the backprojection calculations dominate the total run time. As
expected, the GPUs show significant speedup on the highly parallel backprojection
step.
6.2.2
97
With the basic soundness of the GPU implementations established, this section explores the impact of kernel specialization on the performance of the various CUDA
kernels. Here, results compare the performance of single kernels as kernel specialization is a kernel-level technique.
6.2.2.1
Template Matching
The template matching application is unique among the applications tested due to
its multi-kernel nature. However, for simplicity, only the tiled accumulation kernel
from the numerator will be examined in detail. It represents between 60 and 80
percent of the total streaming run time, depending on the data set. It is the most
complicated kernel in the application and, with the associated reduction kernel, is
called most frequently.
Table 6.13 includes average per-kernel execution time (the kernel is called for
each template for each frame) and register usage counts for the specialized and nonspecialized versions. For execution times a speedup is listed, and for register usage,
a ratio between the non-specialized and specialized per-thread register count is provided. Across the data sets a clear advantage for the kernel specialized variants is
observable.
These results are for the best performing main tile size, which is also listed. It
is interesting to note that the optimal tile dimensions seem arbitrary. With the
specialized kernels, preferences for the hardware friendly values of sixteen elements
98
PIV
While the PIV kernel consists of a single kernel, it is unique among the applications in
that it has a wide variety of implementation parameters, including two non-numerical
parameters that control design decisions: selecting data source memory and whether
or not to use warp specialization.
Table 6.14 explores the impact of these design decisions for the same ten data sets
used to compare the GPU and FPGA implementations. In each case, the execution
time corresponding to the best performing set of numerical parameters is displayed.
For each GPU, the RE Baseline row shows the kernel execution time for the kernel
using global memory and no warp specialization. Successive lines show new run
times for a modified configuration, along with speedup values. The first speedup
value is relative to the previous configuration, and the second is cumulative relative
the baseline configuration.
The table shows an interesting performance divergence between the C1060 and
C2070. In both cases, texturing improves performance when added to the run-time
evaluated kernel. This is true across the PIV benchmark results, and texturing is
P1
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
Patient
Registers
SK Ratio
18
1.11
14
1.43
17
1.18
18
1.11
17
1.18
18
1.11
22
1.32
22
1.32
21
1.38
21
1.38
21
1.38
21
1.38
Tile
RE
16 4
12 2
16 4
16 16
16 12
16 12
12 10
12 2
16 4
16 10
16 16
16 12
Size
SK
88
12 2
4 12
12 16
8 16
10 16
12 10
12 2
4 16
16 12
16 8
16 8
Table 6.13: Template matching partial sums: performance and optimal configurations characteristics for the tiled
summation kernel. RE stands for runtime evaluated, and SK stands for specialized kernel.
C1060
C2070
GPU
100
used for the remainder of the results. However, adding warp specialization helps
with C2070, but decreases performance for most cases with the C1060. This effect is
examined in more detail in the discussion regarding Table 6.17.
Finally, kernel specialization helps both GPUs, producing speedups of around 2
across the tests, although it is greater for series A than series B, where the search
space for each mask is smaller.
Table 6.14 omitted information regarding the numerical implementation parameters. Table 6.15 provides the parameters for the best performing kernel configuration
for each problem. Texturing is used throughout, and warp specialization is used with
the C2070, but not the C1060.
It is worth noting that the results confirm the importance of increased register
file use on the newer C2070. While the number of masks and offsets change, the
mask size remains constant within each set of problem configurations. The mask size
appears to be the dominant factor in determining the optimal kernel configuration,
with the number of threads and data registers remaining constant within each set.
Between the A series and B series, there is a change in mask size from 32 32 (1024)
pixels to 1616 (256) pixels. In both cases, the mask size makes it easy to generate a
natural block area that covers the mask in one iteration. Transitioning from series
A to B, the optimal configuration occurs with the same number of data registers but
lower thread counts. This seems to confirm the notion that it is better to generate
high levels of ILP and register file usage than it is to add more threads.
A2
11.55
10.29
1.12
10.85
0.95
1.06
4.99
2.17
2.31
6.50
5.65
1.15
5.16
1.10
1.26
2.13
3.06
A3
48.37
43.09
1.12
45.42
0.95
1.06
20.69
2.20
2.34
26.70
23.01
1.16
21.73
1.06
1.23
8.87
3.01
A4
67.96
60.51
1.12
63.33
0.96
1.07
28.67
2.21
2.37
37.04
31.91
1.16
30.10
1.06
1.23
12.32
3.0
Data Set
A5
B1
89.77 35.88
79.99 32.19
1.12
1.11
84.28 35.03
0.95
0.92
1.07
1.02
38.31 17.58
2.20
1.99
2.04
2.34
49.13 25.21
42.37 23.66
1.16
1.07
39.98 20.11
1.06
1.18
1.23
1.25
16.78 9.46
2.93
2.67
B2
71.82
64.39
1.12
70.56
0.91
1.02
35.16
2.01
2.04
51.11
47.77
1.07
39.50
1.21
1.29
18.69
2.74
B3
106.29
95.40
1.11
95.16
1.00
1.12
49.58
1.92
2.14
69.82
65.27
1.07
53.71
1.22
1.30
25.53
2.73
B4
140.17
123.40
1.14
119.41
1.03
1.17
64.94
1.84
2.16
82.58
77.01
1.07
63.50
1.21
1.30
30.05
2.75
B5
202.30
184.03
1.10
133.10
1.38
1.52
89.09
1.49
2.27
101.14
95.37
1.06
72.64
1.31
1.39
38.32
2.64
Table 6.14: PIV GPU performance comparisons for several kernel variants across the FPGA benchmark set.
C2070
C1060
RE Baseline (ms)
+ Texturing (ms)
Speedup
+ Warp Specialization (ms)
Relative Speedup
Cumulative Speedup
+ Kernel Specialization (ms)
Relative Speedup
Cumulative Speedup
RE Baseline (ms)
+ Texturing (ms)
Speedup
+ Warp Specialization (ms)
Relative Speedup
Cumulative Speedup
+ Kernel Specialization (ms)
Cumulative Speedup
A1
3.54
3.13
1.13
3.29
0.95
1.08
1.53
2.15
2.32
1.92
1.71
1.12
1.51
1.13
1.27
0.65
2.95
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Config.
Registers
SK Ratio
23
1
23
1
23
1.26
23
1.26
23
1.26
23
0.87
23
0.87
23
0.87
23
0.87
23
0.87
16
2
28
1.41
16
2
16
2
16
2
16
2
16
2
16
1.63
12
2.17
12
1.83
Register Blocking
RE
K.S
4
4
4
4
8
4
8
4
8
4
2
4
2
4
2
4
2
4
2
4
4
4
4
8
4
4
4
4
4
4
4
4
4
4
2
4
2
2
1
2
Threads
RE SK
256 256
256 256
128 256
128 256
128 256
128 64
128 64
128 64
128 64
128 64
256 288
128 128
128 256
256 256
128 256
64
64
64
64
128 96
128 128
256 128
Table 6.15: PIV GPU performance data for the FPGA benchmark set, including optimal register blocking and thread
counts.
C1060
C2070
GPU
103
Table 6.16 shows similar performance information for the M series tests, which
vary only mask size while keeping the number of masks and the search constant. The
SK Normalized column divides the specialized kernel execution time by the mask
area. Between the actual execution time and the normalized time, the PIV kernel
behaves as expected. The run time increases with mask area, but remains relatively
constant on a per-area basis. For both GPUs, kernel specialization offers performance benefits, with noticeable jumps at mask sizes that are multiples of sixteen
for the C2070 and powers of two (thirty-two and sixty-four) for the C1060. These
correspond with drops in the normalized time, which is based on the specialized kernel time. At these sizes, loop unrolling is combined with memory-hierarchy friendly
values that provide additional performance. These factors can also affect the optimal
parameterization at these mask sizes, where more register hungry variants perform
better. This is expected, as greater register blocking counts can be used to balance
the availability of mask data with higher memory hierarchy performance.
Table 6.17 shows similar data as the previous tables, but for the S series of tests,
which vary the number of search offsets for each mask, while keeping the number
of masks and mask size constant. Here, the SK Normalized column is the kernel
execution time divided by the number of search offsets. Two sets of results are
provided for the C1060, both with and without warp specialization.
For the C2070, after an amortization period, the normalized performance and
speedup are constant, as expected, showing a purely linear relationship with the
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Config.
RE
0.989
1.354
2.014
4.293
6.013
12.164
13.296
19.269
23.860
32.722
45.002
53.014
1.274
2.284
3.171
9.370
11.951
24.543
27.410
38.837
47.549
65.380
91.185
106.913
Registers
SK Ratio
13
1.54
23
0.87
23
0.87
22
1
23
1
26
1.12
45
0.51
25
0.92
45
0.51
26
0.92
26
0.92
45
0.51
8
2.75
12
1.83
16
2
17
1.88
16
2
29
1.10
20
1.6
32
1
18
1.78
32
1
32
1
32
1
Register Blocking
RE
K.S
2
1
2
4
2
4
4
4
4
4
8
4
4
8
4
4
4
8
4
4
4
4
4
8
1
1
2
2
4
4
4
4
4
4
4
8
4
4
4
8
4
4
4
8
4
8
4
8
Threads
RE SK
32
64
64
32
128 64
160 160
256 256
128 256
288 96
288 288
256 128
288 256
288 256
288 288
64
64
128 64
64
64
224 160
128 256
256 256
128 192
128 64
128 256
128 64
128 128
128 128
Table 6.16: PIV GPU performance data for the varying mask size benchmark set, including optimal register blocking
and thread counts.
C1060
C2070
GPU
105
number of offsets. The contents of the for() loop iterating over the mask offset
set contains the main computation (for the compute warps) and contains the main
synchronization point between exchanging shared memory buffers with the reduction
warp.
With the C1060, however, the picture is more complicated. Both with and without warp specialization, the normalized performance and speedup decrease with an
increasing number of offsets, although it is not immediately clear why. The C1060
appears to incur overhead from warp specialization at small sizes, but the added task
parallelism soon overcomes the overhead once several hundred offsets are involved.
The warp specialized version, while not purely linear like the C2070, has a more linear relationship with the number offsets than the non-warp-specialized variant. This
implies that the decreasing performance as the number of offsets increases is due
to block-wide synchronization, as the width of per-offset reduction does not change
throughout the data set and is assumed to take constant time.
The discontinuities in the normalized performance appear to be related to GPU
occupancy values. The amount of shared memory used by the kernel grows linearly with the number of offsets, and the number of resident thread blocks per SM
drops between S6 and S7 for the non-warp-specialized implementation. With warpspecialization, the occupancy threshold changes between S8 and S9. The shared
memory requirements start out higher for the warp specialized variant as it is double
buffered and reaches an occupancy threshold at a different point. With more shared
106
memory per SM, the C2070 does not experience these problems in the tested range.
The final benchmark set tested with the PIV application is one that varies the
overlap between adjacent interrogation windows. The results of these tests are shown
in Table 6.18. Here, only the non-warp specialized variant is shown for the C1060,
while the C2070 uses warp specialization. The SK Normalized column shows the
execution time relative to the O1 execution time.
As can be seen from the data, there is little impact on kernel performance for
either the C1060 or C2070. With a dense search space there is high data locality
between consecutive search offsets, which is likely significantly more important than
sharing data between interrogation windows, which are all assigned to different thread
blocks.
6.2.2.3
The backprojection kernel results are enumerated in Table 6.19. Even though there
are fewer implementation and data set parameters, there is an interesting result: for
the larger data set, the older C1060 GPU reports slower results when using kernel
specialization. This appears to be a result of higher occupancy decreasing caching
performance. The backprojection kernel uses only global memory for reads, despite
the high data locality. As a newer GPU of Compute Capability 2.0, the C2070 GPU
has an L1 cache in addition to shared memory, which appears to effectively cache the
needed data.
On the other hand, the C1060 does not have an automatically managed L1 cache,
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
Config.
RE
6.009
12.434
21.187
32.268
45.699
61.470
79.575
99.999
122.816
147.998
175.444
11.955
24.816
42.589
64.500
91.499
124.524
166.199
209.058
260.200
417.971
499.351
12.678
26.273
44.776
68.247
96.702
131.942
173.047
217.170
266.570
321.537
383.775
RE
23
23
23
23
23
23
23
23
23
23
23
32
32
32
32
32
32
32
32
32
32
32
39
39
39
39
39
39
27
39
27
27
27
Registers
SK Ratio
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
16
2
16
2
16
2
16
2
16
2
28
1.14
28
1.14
16
2
16
2
16
2
16
2
16
2.44
16
1.83
16
2
16
1.88
16
2
16
1.10
16
1.6
16
1
16
1.78
16
1
16
1
Register Blocking
RE
K.S
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
8
4
8
4
4
4
4
4
4
4
4
8
4
8
4
8
4
8
4
8
4
8
4
4
4
8
4
4
4
4
4
4
4
Threads
RE SK
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
128 256
128 256
128 256
128 256
128 256
128 128
256 128
256 256
256 256
256 256
256 256
160 256
160 256
160 256
160 256
160 256
160 256
160 256
160 256
256 256
256 256
256 256
Table 6.17: PIV GPU performance data for the varying search benchmark set, including optimal register blocking and
thread counts.
specialization
warp
with
C1060
specialization
warp
without
C1060
specialization
warp
with
C2070
GPU
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
Config.
RE
5.973
5.994
6.003
6.011
6.010
6.012
6.016
6.016
6.0261
6.048
11.808
11.953
11.954
11.945
11.955
11.955
11.971
12.161
12.116
12.166
Registers
SK Ratio
13
1.54
23
0.87
23
0.87
22
1
23
1
26
1.12
45
0.51
25
0.92
45
0.51
26
0.92
16
2.75
16
1.83
16
2
16
1.88
16
2
16
1.10
16
1.6
16
1
16
1.78
16
1
Register Blocking
RE
K.S
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
Threads
RE SK
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
Table 6.18: PIV GPU performance data for the varying overlap benchmark set, including optimal register blocking and
thread counts.
C1060
C2070
GPU
GPU
C2070
C1060
Config
V1
V2
V1
V2
RE
32
32
34
34
109
Registers
KS Ratio
26
1.23
24
1.33
25
1.36
21
1.62
Thread
RE
320
160
128
224
Count
KS
160
192
320
32
Execution
RE
187 434
209 185
197 427
206 569
200 387
210 942
175 789
185 377
210 850
201 404
203 691
205 907
Time (ms)
KS
188 807
209 923
210 678
209 529
219 732
217 779
208 039
196 602
213 612
217 476
212 521
218 412
Blocks/SM
RE
KS
6
8
6
8
3
5
3
5
2
4
2
4
1
2
1
2
1
2
1
2
1
2
1
2
110
Warps/SM
RE
KS
6
8
12
16
9
15
12
20
10
20
12
24
7
14
8
16
9
18
10
20
11
22
12
24
Table 6.20: Occupancy and execution data for the C1060 on the V2 data set.
first data set, kernel specialization does provide a benefit.
6.3
Analysis
There are a number of overall conclusions that can be made across the set of applications and both GPUs. First, kernel specialization provides significant performance
improvements. Execution times are noticeably lower. These benefits are derived
on top of good baseline implementations of real-world scientific applications of nontrivial complexity. Simpler kernels may provide even greater performance improvements.
Both the PIV and cone beam backprojection application results show that kernel specialization provides significant benefits beyond those associated with loop unrolling. In the case of the PIV application, the run-time evaluated kernel was allowed
to cheat so that it could take advantage of variable register blocking. This results
in static unrolled loops over the core data loading and computation loop within the
111
PIV kernel. Despite this advantage, the fully specialized kernel still provides significant performance advantages over a wide range of parameters. Constant folding
and propagation, as well as some additional strength reduction add significantly to
performance.
At the same time, kernel specialization generally provides significant reductions
in the number of registers per thread used by a given kernel, assuming all other configuration values are equal. This has two important benefits: 1) When not otherwise
constrained, fewer registers per thread may result in more active warps per streaming
multiprocessor. This may allow the hardware to achieve better performance despite
high latencies. 2) It enables more sophisticated and resource intensive kernels to fit
within the same budget.
That second factor is important for another of the key advantages offered by
kernel specialization: increased adaptability without a performance penalty. The 51
data sets chosen cover a wide range of irregular parameters. Kernel specialization
provides significant performance improvements over the high-performing baseline implementations by removing overheads associated with flexibility.
Related to this is the ability to include additional implementation parameters for
further specializing the behavior of a given kernel. From the data in the various
tables in this section, a wide range of implementation parameters, across different
problems and the two GPU devices, are required to achieve optimal performance.
In contrast to the tables previously discussed, which reflect the best performing
112
set of specialized and run-time evaluated kernel parameters, Tables 6.21 and 6.22
show the performance of a single fixed and specialized configuration against the
highest performing configuration on a given data set for the template matching and
PIV applications, respectively. These scenarios represent typical CUDA development
practices where a single set of compile-time (specialized) parameters are used regardless of the incoming problem parameters or target hardware characteristics, other
than compute capability.
Table 6.21 shows the relative performance of various fixed main tile sizes and
thread counts compared to the optimal values for each tested data set. Similarly,
Table 6.22 shows the performance of several sets of fixed PIV implementation parameters against the best performing PIV configuration for the M series of problem
configurations. Between requirements for register blocking and the need to leverage important performance optimizations, these implementation parameters would
likely be fixed when the kernel was compiled ahead of time in a scenario without
kernel specialization. Both tables use values that are typically seen in GPGPU applications. With kernel specialization, there is no longer a need for fixed parameter
values, allowing unique values to be used for each problem and GPU.
In both cases, it is possible to select a configuration with reasonable performance.
For the C2070, a fixed configuration can achieve about 85 and 80 percent of peak kernel performance for the template matching and PIV applications, respectively. For
the C1060, the same values are about 80 and 90 percent, respectively. While accept-
Data Set
P1
P2
P3
P4
P5
P6
Average
Minimum
C2070
128 Threads
256 Threads
8 8 16 16 8 8 16 16
0.60
0.19
0.84
0.53
0.34
0.17
0.85
0.55
0.69
0.40
0.94
0.73
0.59
0.62
0.88
0.96
0.69
0.49
0.73
0.57
0.87
0.91
0.87
0.91
0.63
0.46
0.86
0.71
0.34
0.17
0.73
0.53
113
C1060
128 Threads
256 Threads
8 8 16 16 8 8 16 16
0.37
0.21
0.42
0.35
0.41
0.24
0.85
0.55
0.70
0.48
0.96
0.75
0.68
0.67
0.86
0.95
0.68
0.59
0.77
0.70
0.90
0.93
0.90
0.93
0.62
0.52
0.79
0.71
0.37
0.21
0.42
0.35
Table 6.21: Percentage of the peak performance for the template matching application with various fixed main tile sizes and thread counts.
able on average, in each case the minimum relative performance is significantly lower,
demonstrating the benefits of making these implementation parameters adjustable.
These results, as well as others in this chapter, further reinforce the complexity
of development on NVIDIA CUDA hardware. Performance can be unpredictable
and highly sensitive to variation. The fixed configurations in Tables 6.21 and 6.22
represent several possible intuitive kernel configurations that only sometimes correlate with the optimal configuration. To further demonstrate the non-predictability
of CUDA kernel parameterization, Figures 6.1 and 6.2 show the relative performance
of the PIV kernel on C1060 and C2070, respectively, across the M series tests, where
only the mask size is changing. Each plot shows performance and is scaled independently between zero and one, with the best combination of register blocking and
thread count marked with a white square.
As is visible from both the figures and the preceding tables, kernel performance is
highly non-linear and dependent on the correct selection of implementation parameter
values. As more free implementation parameters are available, manual selection of
64 Threads
0.89
0.93
1.00
0.69
0.65
0.67
0.63
0.78
0.58
0.72
0.68
0.58
0.73
0.58
0.89
0.94
1.00
0.78
0.82
0.80
0.89
0.91
0.87
0.90
0.90
0.88
0.88
0.78
4 Data Registers
128 Threads 256
0.62
0.65
0.69
0.70
0.85
0.90
0.78
0.99
0.76
0.97
0.89
0.76
0.80
0.62
0.68
0.83
0.89
0.85
0.92
0.93
0.96
0.98
0.98
0.98
0.97
0.97
0.91
0.68
Threads
0.35
0.36
0.39
0.77
1.00
1.00
0.73
0.97
0.77
1.00
1.00
0.77
0.76
0.35
0.35
0.48
0.57
0.90
1.00
0.99
0.91
0.96
1.00
0.98
0.99
0.99
0.84
0.35
64 Threads
0.94
0.93
0.72
0.56
0.93
0.54
0.46
0.57
0.84
0.52
0.49
0.83
0.69
0.46
0.76
0.85
0.91
0.83
0.93
0.90
0.98
1.00
0.99
1.00
0.99
0.99
0.93
0.76
8 Data Registers
128 Threads 256
0.67
0.67
0.68
0.82
0.71
0.77
0.64
0.81
1.00
0.75
0.72
1.00
0.77
0.64
0.41
0.60
0.63
0.92
0.98
0.98
0.94
0.97
1.00
0.99
1.00
1.00
0.87
0.41
Threads
0.34
0.35
0.37
0.66
0.62
0.74
0.56
0.82
0.98
0.77
0.74
0.63
0.63
0.34
0.19
0.33
0.36
0.68
0.71
1.00
0.83
0.95
0.93
0.96
0.95
0.94
0.74
0.19
Table 6.22: Percentage of the peak performance for the PIV application with various fixed data register counts and
thread counts.
C1060
C2070
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Average
Minimum
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Average
Minimum
Data Set
115
Figure 6.1: Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C1060. The location of peak performance is marked
with a white square.
optimal values becomes speculative and error prone. Autotuning techniques, a main
focus of many other related research projects, are therefore highly complimentary to
kernel specialization.
In this chapter, a number of experiments and their results were presented. Kernel
specialization was shown to have a number of important advantages for GPGPU
kernels. In the next chapter, plans for future work are discussed.
116
Figure 6.2: Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C2070. The location of peak performance is marked
with a white square.
Chapter 7
Conclusions and Future Work
In this chapter, ideas for future work and conclusions are presented.
7.1
Conclusions
While GPGPU computing can provide significant performance advantages over general purpose processors, it often comes at the expense of either adaptability or maximum performance. Static-value optimizations applied at compile time require choosing between these objectives. This dissertation has explored kernel specialization, a
technique where compilation is delayed until fixed parameter values are known. By
lowering the penalty associated with increased parameterization, kernel specialization allows for a single GPU implementation to offer greater adaptability without
sacrificing performance.
Using several real-world applications, kernel specialization was examined from
two angles: improved performance and reduction in per-thread register usage for
a given level of adaptability as well as the importance of adjusting normally static
implementation parameters in optimizing performance. For the first, kernels belong-
118
ing to non-trivial case studies, aided by kernel specialization, were shown to exhibit
adaptability not only to a wide range of problem parameters but also to two different NVIDIA GPU generations. The maximum performance observed for each data
set occurred with a different set of implementation parameters. Maximum performance can only be achieved with the ability to dynamically adjust implementation
parameters that are usually considered static.
Combined with autotuning, which can be used to help select optimal values from
an implementation parameter space that may be extremely large, kernel specialization can be used to create libraries of highly-parameterized GPGPU kernel implementations that can be effectively applied across different hardware and problem
configurations. This is of particular use in problem areas not covered by emerging
domain-specific tools.
7.2
Future Work
There are three main areas in which future research efforts will be applied: improving
the existing GPU applications covered in this dissertation, adding new capabilities
to GPU-PF, and further exploring the characteristics of kernel specialization.
7.2.1
Existing Applications
First, there are possibilities for improving the performance of the existing GPU applications. In the template matching application, the non-numerator stages, which
represent approximately 20 to 40 percent of the GPU run time could benefit from the
119
same kernel specialization optimizations that were applied to the numerator. The
application may also benefit from utilizing more of the register file instead of shared
memory for data storage. For the PIV kernel, implementing data reuse through
shared memory could improve performance. Additionally, more time could be spent
studying warp specialization, as the current CUDA tools provide little insight into
the relative performance of different warp groups. Finally, the cone beam backprojection kernel could be augmented to use texture memory. It is likely that there are
additional opportunities for using kernel specialization to improve the performance
of the kernel.
7.2.2
GPU-PF
Beyond the CUDA GPU implementations, there are a number of possible enhancements to the GPU Prototyping Framework. Of primary interest is creating an
OpenCL back end. Currently, the framework uses the driver-level CUDA API, which
should facilitate a port as it is more similar to OpenCL than the higher-level run-time
API. OpenCL presents a possible advantage for kernel specialization: the OpenCL
specification includes an API for compiling the C-like syntax of OpenCL kernels at
runtime. OpenCL support would enable targeting a wider variety of GPU vendors
and platforms. This may soon be possible with the CUDA language, as NVIDIA
has opened up an increasing portion of the CUDA development tools to third party
modification. Beyond additional platforms, CUDA evolution may also include better support for generating specialized binaries at run time without having to call
120
7.2.3
Kernel Specialization
While kernel specialization has been shown to be beneficial for both performance
and register usage reduction, additional work in this area could be done. There may
be major advantages between the benefits of fixing some parameters over others.
Additional experiments with subsets of optimizations enabled and disabled would
help elucidate the contours of this space. As mentioned, some key performance
121
optimizations require fixed values, like loop unrolling and register blocking. However,
others, such as pointer value inlining, may be less important. There is a natural tradeoff between kernel specializations and adaptability without recompilation. If pointer
or other values change frequently, avoiding recompilation may outweigh the benefits
of kernel specialization. Fixing as many parameters as possible for a given usage
scenario is a key objective.
Finally, more advanced GPU binary generations techniques, such as fusing multiple kernels into one, could be examined. With ever increasing support for a C++
class and template features, more advanced compile time specialization than what
was studied in this dissertation could be investigated. There is room for increasing
the adaptability of a single kernel compilation unit to different algorithms, problem
instances, and hardware configurations. In addition, developing a library of specialization abstractions that allow toggling the specialization state of a particular
variable would improve the usability of kernel specialization.
This research would be driven by applying kernel specialization to a wider variety
of applications and platforms. A larger corpus of knowledge would help inform the
benefits, limitations, and trade-offs of kernel specialization.
Appendix A
Glossary
Abbreviation
CT
FPGA
GPU
GPU-PF
GPGPU
ILP
IR
IRT
JIT
PIV
PTX
RCL
RE
RLG
ROI
SK
SM
TLP
Term
computed tomography
field programmable gate array
graphics processing unit
GPU Prototyping Framework
general purpose GPU computing
instruction level parallelism
intermediate representation
Image Reconstruction Toolbox
just-in-time
particle image velocimetry
Parallel Thread Execution; NVIDIA CUDAs IR
Reconfigurable Computing Laboratory
run-time evaluated
Robot Locomotion Group
region of interest
specialized kernel
streaming multiprocessor
thread level parallelism
122
Appendix B
Flexibly Specializable Kernel
Listing B.1: A CUDA C GPU kernel designed to demonstrate flexible kernel specialization. The kernel can be compiled both with and without specialization.
1
#include g p u F u n c t i o n s . cuh
2
3
4
5
#i f n d e f CT LOOPS
#define CT LOOPS 0
#endif
6
7
8
9
#i f n d e f LOOPS COUNT
#define LOOPS COUNT 0
#endif
10
11
12
13
#i f n d e f CT ARGS
#define CT ARGS 0
#endif
14
15
16
17
#i f n d e f ARG A
#define ARG A 0
#endif
18
19
20
21
#i f n d e f ARG B
#define ARG B 0
#endif
22
23
24
25
#i f n d e f CT BLOCK DIMS
#define CT BLOCK DIMS 0
#endif
26
27
28
29
#i f n d e f BLOCK DIM X
#define BLOCK DIM X 1
#endif
30
31
32
33
#i f n d e f BLOCK DIM Y
#define BLOCK DIM Y 1
#endif
34
35
36
37
38
#i f n d e f BLOCK DIM Z
#define BLOCK DIM Z 1
#endif
123
39
40
41
extern C {
global
42
43
44
45
46
void mathTest ( i n t
int
int
int
int
in ,
out ,
argA ,
argB ,
loopCount ) ;
47
48
49
global
void mathTest ( i n t in , i n t out , i n t argA , i n t argB , i n t loopCount ) {
int acc = 0 ;
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#i f d e f CT PTR OUT
( ( i n t )PTR OUT + o f f s e t ) = a c c ;
#e l s e
( out + o f f s e t ) = a c c ;
#endif
return ;
}
124
Appendix C
Sample Run Time Evaluated PTX
The PTX in Listing C.2 was generated from the kernel source in Appendix B with all
parameters evaluated at run time. No compile-time specialization is used by leaving
all of the macro definitions in Listing B.1 undefined. The nvcc command line used
to generate the PTX is provided in Listing C.1.
Listing C.1: The nvcc command line used to generate the PTX in C.2.
mathTest2.cu file contained the source of Listing B.1.
The
Listing C.2: The run-time adaptable PTX produced by calling nvcc on the CUDA C
source in Appendix B without any fixed parameters.
//
// Generated by NVIDIA NVVM Compiler
// Compiler b u i l t on Thu Jan 12 1 7 : 4 6 : 0 1 2012 ( 1 3 2 6 4 0 8 3 6 1 )
// Cuda c o m p i l a t i o n t o o l s , r e l e a s e 4 . 1 , V0 . 2 . 1 2 2 1
//
. version 3 . 0
. t a r g e t sm 20
. a d d r e s s s i z e 64
.
.
.
.
.
file
file
file
file
file
1
2
3
4
5
mathTest2 . cpp3 . i
mathTest2 . cu
. . / gpui n c l u d e / g p u A t t r i b u t e s . cuh
. . / gpui n c l u d e /gpuMath . cuh
. . / gpui n c l u d e /gpuNum . cuh
125
. e n t r y mathTest (
. param . u64 mathTest
. param . u64 mathTest
. param . u32 mathTest
. param . u32 mathTest
. param . u32 mathTest
)
{
. r e g . pred %p<3>;
. reg . s32
%r <29>;
. reg . s64
%r l <12>;
l d . param . u64
l d . param . u64
l d . param . u32
c v t a . t o . global . u64
c v t a . t o . global . u64
mov . u32
mov . u32
mov . u32
mad . l o . s 3 2
c v t . u64 . u32
s e t p . gt . s 3 2
@%p1 b r a
mov . u32
bra . u n i
BB0 2 :
l d . param . u32
l d . param . u32
mul . l o . s 3 2
mov . u32
mov . u32
mov . u32
BB0 3 :
c v t . u64 . u32
add . s 6 4
s h l . b64
add . s 6 4
l d . global . u32
add . s 3 2
add . s 3 2
add . s 3 2
l d . param . u32
setp . l t . s32
@%p2 b r a
BB0 4 :
s h l . b64
add . s 6 4
s t . global . u32
ret ;
}
param
param
param
param
param
0,
1,
2,
3,
4
%r l 4 , [ mathTest param 0 ] ;
%r l 5 , [ mathTest param 1 ] ;
%r3 , [ mathTest param 4 ] ;
%r l 1 , %r l 5 ;
%r l 2 , %r l 4 ;
%r12 , %n t i d . x ;
%r13 , %c t a i d . x ;
%r14 , %t i d . x ;
%r15 , %r12 , %r13 , %r14 ;
%r l 3 , %r15 ;
%p1 , %r3 , 0 ;
BB0 2 ;
%r28 , 0 ;
BB0 4 ;
%r l 6 , %r26 ;
%r l 7 , %r l 6 , %r l 3 ;
%r l 8 , %r l 7 , 2 ;
%r l 9 , %r l 2 , %r l 8 ;
%r20 , [% r l 9 ] ;
%r28 , %r20 , %r28 ;
%r26 , %r26 , %r4 ;
%r27 , %r27 , 1 ;
%r25 , [ mathTest param 4 ] ;
%p2 , %r27 , %r25 ;
BB0 3 ;
%r l 1 0 , %r l 3 , 2 ;
%r l 1 1 , %r l 1 , %r l 1 0 ;
[%r l 1 1 ] , %r28 ;
126
Appendix D
Sample Kernel Specialized PTX
The PTX in Listing D.2 was generated from the kernel source in Appendix B for the
case where every parameter was fixed at compile time. The kernel is fully specialized.
In this example, a loop iteration count of five, and a one-dimensional block of 128
threads, was used. The argA and argB inputs were fixed at 3 and 7, respectively.
The input pointer was set to 0x200ca0200, and the output pointer to 0x200b80000.
(Kernels are compiled after memory is allocated so pointer values are known.) The
nvcc command line used to generate the PTX is provided in Listing D.1.
Listing D.1: The nvcc command line used to generate the PTX in D.2.
mathTest2.cu file contained the source of Listing B.1
The
Listing D.2: Specialized PTX produced by calling nvcc on the CUDA C source in
Appendix B and specifying all parameters on the command line.
//
// Generated by NVIDIA NVVM Compiler
// Compiler b u i l t on Thu Jan 12 1 7 : 4 6 : 0 1 2012 ( 1 3 2 6 4 0 8 3 6 1 )
// Cuda c o m p i l a t i o n t o o l s , r e l e a s e 4 . 1 , V0 . 2 . 1 2 2 1
//
127
. version 3 . 0
. t a r g e t sm 20
. a d d r e s s s i z e 64
.
.
.
.
.
file
file
file
file
file
1
2
3
4
5
mathTest2 . cpp3 . i
mathTest2 . cu
. . / gpui n c l u d e /gpuNum . cuh
. . / gpui n c l u d e /gpuMath . cuh
. . / gpui n c l u d e / g p u A t t r i b u t e s . cuh
. e n t r y mathTest (
. param . u64 mathTest param
. param . u64 mathTest param
. param . u32 mathTest param
. param . u32 mathTest param
. param . u32 mathTest param
)
{
. reg . s32
%r <20>;
. reg . s64
%r l <2>;
mov . u32
s h l . b32
mov . u32
add . s 3 2
mul . wide . u32
l d . u32
l d . u32
add . s 3 2
l d . u32
add . s 3 2
l d . u32
add . s 3 2
l d . u32
add . s 3 2
s t . u32
ret ;
}
0,
1,
2,
3,
4
%r1 , %c t a i d . x ;
%r2 , %r1 , 7 ;
%r3 , %t i d . x ;
%r4 , %r2 , %r3 ;
%r l 1 , %r4 , 4 ;
%r5 , [% r l 1 +8603173460];
%r7 , [% r l 1 +8603173376];
%r9 , %r5 , %r7 ;
%r10 , [% r l 1 +8603173544];
%r12 , %r10 , %r9 ;
%r13 , [% r l 1 +8603173628];
%r15 , %r13 , %r12 ;
%r16 , [% r l 1 +8603173712];
%r18 , %r16 , %r15 ;
[%r l 1 +8601993216] , %r18 ;
128
Appendix E
OpenCV Kernel Source
The following sample is provided as a real-world example of the coding techniques
required to achieve best performance on GPUs. The listing is from the OpenCV
computer vision librarys CUDA module and implements row filtering as it is applied
in image processing [42, 41].
/M//////////////////////////////////////////////////////////////////////////////
//
// IMPORTANT: READ BEFORE DOWNLOADING, COPYING, INSTALLING OR USING .
//
// By downloading , c o p y i n g , i n s t a l l i n g or u s i n g t h e s o f t w a r e you a g r e e t o t h i s
license .
//
I f you do n o t a g r e e t o t h i s l i c e n s e , do n o t download , i n s t a l l ,
// copy or u s e t h e s o f t w a r e .
//
//
//
L i c e n s e Agreement
//
For Open Source Computer V i s i o n L i b r a r y
//
// C o p y r i g h t (C) 2000 2008 , I n t e l C o r p o r a t i o n , a l l r i g h t s r e s e r v e d .
// C o p y r i g h t (C) 2009 , Willow Garage I n c . , a l l r i g h t s r e s e r v e d .
// C o p y r i g h t (C) 1993 2011 , NVIDIA C o r p o r a t i o n , a l l r i g h t s r e s e r v e d .
// Third p a r t y c o p y r i g h t s a r e p r o p e r t y o f t h e i r r e s p e c t i v e owners .
//
// R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y forms , w i t h or w i t h o u t m o d i f i c a t i o n ,
// a r e p e r m i t t e d p r o v i d e d t h a t t h e f o l l o w i n g c o n d i t i o n s a r e met :
//
//
R e d i s t r i b u t i o n s o f s o u r c e code must r e t a i n t h e a b o v e c o p y r i g h t n o t i c e ,
//
t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r .
//
//
R e d i s t r i b u t i o n s i n b i n a r y form must r e p r o d u c e t h e a b o v e c o p y r i g h t n o t i c e ,
//
t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r i n t h e d o c u m e n t a t i o n
//
and/ or o t h e r m a t e r i a l s p r o v i d e d w i t h t h e d i s t r i b u t i o n .
129
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
//
//
130
//
//
// This s o f t w a r e i s p r o v i d e d by t h e c o p y r i g h t h o l d e r s and c o n t r i b u t o r s as i s and
// any e x p r e s s or i m p l i e d w a r r a n t i e s , i n c l u d i n g , b u t n o t l i m i t e d to , t h e i m p l i e d
// w a r r a n t i e s o f m e r c h a n t a b i l i t y and f i t n e s s f o r a p a r t i c u l a r p u r p o s e a r e d i s c l a i m e d
.
// In no e v e n t s h a l l t h e I n t e l C o r p o r a t i o n or c o n t r i b u t o r s be l i a b l e f o r any d i r e c t ,
// i n d i r e c t , i n c i d e n t a l , s p e c i a l , exemplary , or c o n s e q u e n t i a l damages
// ( i n c l u d i n g , b u t n o t l i m i t e d to , procurement o f s u b s t i t u t e g o o d s or s e r v i c e s ;
// l o s s o f use , data , or p r o f i t s ; or b u s i n e s s i n t e r r u p t i o n ) however c a u s e d
// and on any t h e o r y o f l i a b i l i t y , w h e t h e r i n c o n t r a c t , s t r i c t l i a b i l i t y ,
// or t o r t ( i n c l u d i n g n e g l i g e n c e or o t h e r w i s e ) a r i s i n g i n any way o u t o f
// t h e u s e o f t h i s s o f t w a r e , even i f a d v i s e d o f t h e p o s s i b i l i t y o f such damage .
//
//M/
43
44
45
46
47
48
49
#include
#include
#include
#include
#include
#include
i n t e r n a l s h a r e d . hpp
opencv2 / gpu / d e v i c e / s a t u r a t e c a s t . hpp
opencv2 / gpu / d e v i c e / vec math . hpp
opencv2 / gpu / d e v i c e / l i m i t s . hpp
opencv2 / gpu / d e v i c e / b o r d e r i n t e r p o l a t e . hpp
opencv2 / gpu / d e v i c e / s t a t i c c h e c k . hpp
50
51
52
53
54
55
56
57
constant
58
59
60
61
62
void l o a d K e r n e l ( const f l o a t k e r n e l [ ] , i n t k s i z e )
{
c u d a S a f e C a l l ( cudaMemcpyToSymbol ( c k e r n e l , k e r n e l , k s i z e s i z e o f ( f l o a t )
) );
}
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
shared
sum t smem [ BLOCK DIM Y ] [ ( PATCH PER BLOCK + 2 HALO SIZE )
BLOCK DIM X ] ;
82
83
84
131
i f ( y >= s r c . rows )
return ;
85
86
87
const T s r c r o w = s r c . p t r ( y ) ;
88
89
90
91
// Load l e f t h a l o
#pragma u n r o l l
f o r ( i n t j = 0 ; j < HALO SIZE ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + j BLOCK DIM X ] = s a t u r a t e c a s t <
sum t >(brd . a t l o w ( x S t a r t ( HALO SIZE j ) BLOCK DIM X,
src row ) ) ;
92
93
94
95
96
// Load main d a t a
#pragma u n r o l l
f o r ( i n t j = 0 ; j < PATCH PER BLOCK ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE BLOCK DIM X + j
BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd . a t h i g h ( x S t a r t + j
BLOCK DIM X, s r c r o w ) ) ;
97
98
99
100
101
// Load r i g h t h a l o
#pragma u n r o l l
f o r ( i n t j = 0 ; j < HALO SIZE ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + (PATCH PER BLOCK + HALO SIZE )
BLOCK DIM X + j BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd .
a t h i g h ( x S t a r t + (PATCH PER BLOCK + j ) BLOCK DIM X, s r c r o w ) ) ;
102
103
104
105
106
syncthreads () ;
107
108
#pragma u n r o l l
f o r ( i n t j = 0 ; j < PATCH PER BLOCK ; ++j )
{
const i n t x = x S t a r t + j BLOCK DIM X ;
109
110
111
112
113
114
115
116
117
#pragma u n r o l l
f o r ( i n t k = 0 ; k < KSIZE ; ++k )
sum = sum + smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE
BLOCK DIM X + j BLOCK DIM X an cho r + k ] c k e r n e l [ k
];
118
119
120
121
d s t ( y , x ) = s a t u r a t e c a s t <D>(sum ) ;
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
i f ( c c >= 2 0 )
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 8 ;
132
138
}
else
{
139
140
141
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 4 ;
PATCH PER BLOCK = 4 ;
142
143
144
145
146
147
148
149
B<T> brd ( s r c . c o l s ) ;
150
151
152
153
154
i f ( stream == 0 )
cudaSafeCall ( cudaDeviceSynchronize ( ) ) ;
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
s t a t i c const c a l l e r t c a l l e r s [ 5 ] [ 3 3 ] =
{
{
0,
l i n e a r R o w F i l t e r c a l l e r < 1 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 2 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 3 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 4 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 5 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 6 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 7 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 8 , T, D,
l i n e a r R o w F i l t e r c a l l e r < 9 , T, D,
l i n e a r R o w F i l t e r c a l l e r <10 , T, D,
l i n e a r R o w F i l t e r c a l l e r <11 , T, D,
l i n e a r R o w F i l t e r c a l l e r <12 , T, D,
l i n e a r R o w F i l t e r c a l l e r <13 , T, D,
l i n e a r R o w F i l t e r c a l l e r <14 , T, D,
l i n e a r R o w F i l t e r c a l l e r <15 , T, D,
l i n e a r R o w F i l t e r c a l l e r <16 , T, D,
l i n e a r R o w F i l t e r c a l l e r <17 , T, D,
l i n e a r R o w F i l t e r c a l l e r <18 , T, D,
l i n e a r R o w F i l t e r c a l l e r <19 , T, D,
l i n e a r R o w F i l t e r c a l l e r <20 , T, D,
l i n e a r R o w F i l t e r c a l l e r <21 , T, D,
l i n e a r R o w F i l t e r c a l l e r <22 , T, D,
l i n e a r R o w F i l t e r c a l l e r <23 , T, D,
l i n e a r R o w F i l t e r c a l l e r <24 , T, D,
l i n e a r R o w F i l t e r c a l l e r <25 , T, D,
l i n e a r R o w F i l t e r c a l l e r <26 , T, D,
l i n e a r R o w F i l t e r c a l l e r <27 , T, D,
l i n e a r R o w F i l t e r c a l l e r <28 , T, D,
l i n e a r R o w F i l t e r c a l l e r <29 , T, D,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
197
198
199
200
201
},
{
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
133
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
c a l l e r <23 ,
c a l l e r <24 ,
c a l l e r <25 ,
c a l l e r <26 ,
c a l l e r <27 ,
c a l l e r <28 ,
c a l l e r <29 ,
c a l l e r <30 ,
c a l l e r <31 ,
c a l l e r <32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
},
{
260
261
262
263
264
265
266
267
268
269
270
271
272
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
<23 ,
<24 ,
<25 ,
<26 ,
<27 ,
<28 ,
<29 ,
<30 ,
<31 ,
<32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
c a l l e r <23 ,
c a l l e r <24 ,
c a l l e r <25 ,
c a l l e r <26 ,
c a l l e r <27 ,
c a l l e r <28 ,
c a l l e r <29 ,
c a l l e r <30 ,
c a l l e r <31 ,
c a l l e r <32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
},
{
273
306
134
},
{
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
<16 ,
<17 ,
<18 ,
<19 ,
<20 ,
<21 ,
<22 ,
<23 ,
<24 ,
<25 ,
<26 ,
<27 ,
<28 ,
<29 ,
<30 ,
<31 ,
<32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
135
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>
340
};
341
342
loadKernel ( kernel , k s i z e ) ;
343
344
345
346
347
348
349
350
351
352
353
354
Appendix F
OpenCV Specialized Kernel Source
The following source listing is a hypothetical version of the same OpenCV GPU
kernel in Appendix E. A detailed explanation is provided in Section 4.2.
#include
#include
#include
#include
#include
#include
i n t e r n a l s h a r e d . hpp
opencv2 / gpu / d e v i c e / s a t u r a t e c a s t . hpp
opencv2 / gpu / d e v i c e / vec math . hpp
opencv2 / gpu / d e v i c e / l i m i t s . hpp
opencv2 / gpu / d e v i c e / b o r d e r i n t e r p o l a t e . hpp
opencv2 / gpu / d e v i c e / s t a t i c c h e c k . hpp
7
8
9
10
11
12
13
14
15
16
17
void l o a d K e r n e l ( const f l o a t k e r n e l [ ] )
{
c u d a S a f e C a l l ( cudaMemcpyToSymbol ( c k e r n e l , k e r n e l , KSIZE s i z e o f ( f l o a t )
) );
}
18
19
20
21
22
23
24
shared
sum t smem [ BLOCK DIM Y ] [ ( PATCH PER BLOCK + 2 HALO SIZE )
BLOCK DIM X ] ;
25
26
27
28
29
i f ( y >= s r c . rows )
return ;
30
136
137
const T s r c r o w = s r c . p t r ( y ) ;
31
32
33
34
// Load l e f t h a l o
#pragma u n r o l l
f o r ( i n t j = 0 ; j < HALO SIZE ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + j BLOCK DIM X ] = s a t u r a t e c a s t <
sum t >(brd . a t l o w ( x S t a r t ( HALO SIZE j ) BLOCK DIM X,
src row ) ) ;
35
36
37
38
39
// Load main d a t a
#pragma u n r o l l
f o r ( i n t j = 0 ; j < PATCH PER BLOCK ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE BLOCK DIM X + j
BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd . a t h i g h ( x S t a r t + j
BLOCK DIM X, s r c r o w ) ) ;
40
41
42
43
44
// Load r i g h t h a l o
#pragma u n r o l l
f o r ( i n t j = 0 ; j < HALO SIZE ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + (PATCH PER BLOCK + HALO SIZE )
BLOCK DIM X + j BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd .
a t h i g h ( x S t a r t + (PATCH PER BLOCK + j ) BLOCK DIM X, s r c r o w ) ) ;
45
46
47
48
49
syncthreads () ;
50
51
#pragma u n r o l l
f o r ( i n t j = 0 ; j < PATCH PER BLOCK ; ++j )
{
const i n t x = x S t a r t + j BLOCK DIM X ;
52
53
54
55
56
57
58
59
60
#pragma u n r o l l
f o r ( i n t k = 0 ; k < KSIZE ; ++k )
sum = sum + smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE
BLOCK DIM X + j BLOCK DIM X ANCHOR + k ] c k e r n e l [ k
];
61
62
63
64
d s t ( y , x ) = s a t u r a t e c a s t <D>(sum ) ;
65
66
67
68
69
70
}
} // namespace r o w f i l t e r
}}} // namespace cv { namespace gpu { namespace d e v i c e
71
72
73
74
75
extern C {
void l i n e a r R o w F i l t e r c a l l e r (DevMem2D <T TYPENAME> s r c , DevMem2D <D TYPENAME> dst
, const f l o a t k e r n e l [ ] , c u d a S t r e a m t stream )
{
using namespace cv : : gpu : : d e v i c e : : r o w f i l t e r ;
76
77
loadKernel ( kernel ) ;
78
79
80
81
82
138
83
84
85
86
i f ( stream == 0 )
cudaSafeCall ( cudaDeviceSynchronize ( ) ) ;
87
88
89
90
}
} // e x t e r n C
Listing F.2: Modified OpenCV CUDA example: this portion is compiled into the
host program
1
2
3
4
5
6
#include
#include
#include
#include
#include
#include
i n t e r n a l s h a r e d . hpp
opencv2 / gpu / d e v i c e / s a t u r a t e c a s t . hpp
opencv2 / gpu / d e v i c e / vec math . hpp
opencv2 / gpu / d e v i c e / l i m i t s . hpp
opencv2 / gpu / d e v i c e / b o r d e r i n t e r p o l a t e . hpp
opencv2 / gpu / d e v i c e / s t a t i c c h e c k . hpp
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
i n t BLOCK DIM X ;
i n t BLOCK DIM Y ;
i n t PATCH PER BLOCK ;
28
29
30
31
32
33
34
35
36
37
38
39
40
i f ( c c >= 2 0 )
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 8 ;
PATCH PER BLOCK = 4 ;
}
else
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 4 ;
PATCH PER BLOCK = 4 ;
}
41
42
43
44
45
139
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
c u d a S t r e a m t stream ) ;
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
c u d a S t r e a m t stream ) ;
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
c u d a S t r e a m t stream ) ;
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
c u d a S t r e a m t stream ) ;
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
c u d a S t r e a m t stream ) ;
64
65
66
} // namespace r o w f i l t e r
}}} // namespace cv { namespace gpu { namespace d e v i c e
Appendix G
Sample GPU-PF Log Output
The following output is a highly abridged example of the output provided by the GPU
Prototyping Framework. The particular output show below is from the template
matching application (see Section 5.1) run on the data set for Patient 1.
Before execution of the first pipeline iteration new parameter values are registered
and propagated to resources and actions, as shown in Listing G.1. Some one-time
events, such as GPU memory allocation or transferring static data to the GPU also
occur.
0 ] Updates :
0]
Zero Param Update
0]
0
0]
Data Type Update
0]
single
0]
Per Frame S c h e d u l e Update
0]
P e r i o d 1 1 ; Delay 0
0]
Per Template S c h e d u l e Update
0]
P e r i o d 0 ; Delay 0
0]
Per Frame End S c h e d u l e Update
0]
P e r i o d 1 1 ; Delay 11
0]
S h i f t S i z e X Update
0]
37
0]
S h i f t Area Update
0]
703
0]
Template Area Update
0]
2862.000000
0]
Template Dimensions Update
140
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
[ 5 3 54 1 2 ]
0]
Template Dimensions Z Update
0]
One Time S c h e d u l e Update
0]
P e r i o d 1; Delay 0
0]
Frame Data Host Update
0]
I n i t i a l [ 0 , 0 , 0 ] to [ 8 8 , 71 , 0 ]
0]
Stride [0 , 0 , 1]
0]
Frame Data D e v i c e Extent Update
0]
[ 8 9 , 72 , 1 ] 4 (25632 bytes )
0]
Frame Data Host Extent Update
0]
[ 8 9 , 72 , 442] 4 (11329344 bytes )
0]
Template Data Extent Update
0]
[ 5 3 , 54 , 12] 4 (137376 bytes )
0]
Template Denominator Extent Update
0]
[ 1 2 , 1 , 1 ] 4 (48 bytes )
0]
Frame A v e r a g e s Extent Update
0]
[ 3 7 , 19 , 1 ] 4 (2812 bytes )
0]
Frame Denominator Extent Update
0]
[ 3 7 , 19 , 1 ] 4 (2812 bytes )
0]
Numerator Extent Update
0]
[ 7 0 3 , 12 , 1 ] 4 (33744 bytes )
0]
F i n a l M u l t i p l i c a t i o n Extent Update
0]
[ 7 0 3 , 12 , 1 ] 4 (33744 bytes )
0]
G l o b a l P i t c h e d Frame Data Update
0]
G l o b a l P i t c h e d Frame Data A l l o c a t i o n
0]
A l l o c a t e d [ 8 9 , 7 2 , 1 ] 4 , ( p512 )
0]
Memory Mappped Host Frame Data Update
0]
Memory Mappped Host Frame Data A l l o c a t i o n
0]
[ 8 9 , 72 , 442] 4
0]
Update Frame Data S u b s e t Update
0]
[ 8 9 , 72 , 1 ] 4
0]
S t a r t i n g at [ 0 , 0 , 0 ]
0]
25632 b y t e s s t r i d e
0]
Per Frame S c h e d u l e Exe Group Update
0]
Frame Data S u b s e t Extent Update
0]
[ 8 9 , 72 , 1 ] 4 (25632 bytes )
0]
Memory Copy from Host Frame Data S u b s e t t o G l o b a l P i t c h e d Frame Data
Update
0]
Per Frame S c h e d u l e Exe Group Update
0]
G l o b a l P i t c h e d Template Data Update
0]
G l o b a l P i t c h e d Template Data A l l o c a t i o n
0]
A l l o c a t e d [ 5 3 , 5 4 , 1 2 ] 4 , ( p512 )
0]
Memory Mappped Host Template Data Update
0]
Memory Mappped Host Template Data A l l o c a t i o n
0]
[ 5 3 , 54 , 12] 4
0]
Memory Copy from Host Template Data t o G l o b a l P i t c h e d Template Data
Update
0]
Copied [ 5 3 , 5 4 , 1 2 ] 4 ( 1 3 7 3 7 6 b y t e s )
0]
from Template Data <0x 7 f 4 6 4 9 4 d e 0 0 0 >
0]
t o Template Data <0x200300000>
0]
G l o b a l L i n e a r Template Denominator Update
0]
G l o b a l L i n e a r Template Denominator A l l o c a t i o n
0]
Allocated [12 , 1 , 1] 4
0]
Memory Mappped Host Template Denominator Update
0]
Memory Mappped Host Template Denominator A l l o c a t i o n
0]
[12 , 1 , 1] 4
0]
Memory Copy from Host Template Denominator t o G l o b a l L i n e a r Template
Denominator Update
0]
Copied [ 1 2 , 1 , 1 ] 4 ( 4 8 b y t e s )
0]
from Template Denominator <0x7f46494dd000>
0]
t o Template Denominator <0x200400000>
0]
G l o b a l P i t c h e d Frame A v e r a g e s Update
0]
G l o b a l P i t c h e d Frame A v e r a g e s A l l o c a t i o n
141
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
142
A l l o c a t e d [ 3 7 , 1 9 , 1 ] 4 , ( p512 )
G l o b a l L i n e a r Frame Denominator Update
G l o b a l L i n e a r Frame Denominator A l l o c a t i o n
Allocated [ 3 7 , 19 , 1 ] 4
G l o b a l P i t c h e d Numerator Update
G l o b a l P i t c h e d Numerator A l l o c a t i o n
A l l o c a t e d [ 7 0 3 , 1 2 , 1 ] 4 , ( p3072 )
G l o b a l P i t c h e d F i n a l M u l t i p l i c a t i o n Update
Global Pitched Final M u l t i p l i c a t i o n A l l o c a t i o n
A l l o c a t e d [ 7 0 3 , 1 2 , 1 ] 4 , ( p3072 )
PageLocked Host F i n a l M u l t i p l i c a t i o n Update
PageLocked Host F i n a l M u l t i p l i c a t i o n A l l o c a t i o n
[ 7 0 3 , 12 , 1 ] 4
Still within the initial application refresh, the log segment in Listing G.2 shows
the compilation and loading of the numerator stage kernels.
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
Numerator P a r t s Update
0]
Numerator P a r t s A l l o c a t i o n
0]
Using system ( ) t o c a l l : nvcc c u b i n keep keepdir /home/ nmoore /
gpup f / b i n d e v i c e debug 0 o /home/ nmoore /gpup f / b i n /
c o r r 2 T i l e d N u m e r a t o r P a r t s C o m b i n e d 0 1 6 1 2 2 5 3 0 4 4 5 2 7 6 3 7 . debug . c u b i n I /home/ nmoore /gpu
p f /gpui n c l u d e DGPU DEBUG gpua r c h i t e c t u r e compute 20 gpucode sm 20
DREGULAR TILE SIZE X=16 DREGULAR TILE SIZE Y=2 DBOTTOM TILE SIZE X=5
DRIGHT TILE SIZE Y=0 DGRID DIM X=4 DGRID DIM Y=27 DSHIFT X=37 /home/ nmoore /
gpup f / apps /tm/ k e r n e l s / c o r r 2 T i l e d N u m e r a t o r P a r t s C o m b i n e d . cu
0]
Numerator R e d u c t i o n Update
0]
Numerator R e d u c t i o n A l l o c a t i o n
0]
Using system ( ) t o c a l l : nvcc c u b i n keep keepdir /home/ nmoore /
gpup f / b i n d e v i c e debug 0 o /home/ nmoore /gpup f / b i n / c o r r 2 T i l e d S u m s P i t c h e d 0 1 0 8
. debug . c u b i n I /home/ nmoore /gpup f /gpui n c l u d e DGPU DEBUG gpua r c h i t e c t u r e
compute 20 gpucode sm 20 DSUMS UNROLL COUNT=108 /home/ nmoore /gpup f / apps /tm/
k e r n e l s / c o r r 2 T i l e d S u m s P i t c h e d . cu
0]
G l o b a l P i t c h e d Numerator P a r t s Update
0]
G l o b a l P i t c h e d Numerator P a r t s A l l o c a t i o n
0]
A l l o c a t e d [ 7 0 3 , 1 0 8 , 1 2 ] 4 , ( p3072 )
0]
Numerator P a r t s S u b s e t I n f o Update
0]
I n i t i a l [ 0 , 0 , 0 ] to [ 7 0 2 , 107 , 0 ]
0]
Stride [0 , 0 , 1]
0]
Update Numerator P a r t s S u b s e t Update
0]
[ 7 0 3 , 108 , 1 ] 4
0]
S t a r t i n g at [ 0 , 0 , 0 ]
0]
331776 b y t e s s t r i d e
0]
Per Template S c h e d u l e Exe Group Update
0]
Numerator P a r t s S u b s e t Extent Update
0]
[ 7 0 3 , 108 , 1 ] 4 (303696 bytes )
0]
Numerator P a r t s Update
0]
27 r e g i s t e r s p e r t h r e a d
0]
Numerator R e d u c t i o n Update
0]
23 r e g i s t e r s p e r t h r e a d
0]
Bind t e x t u r e frameDataTex Update
0]
from module Numerator P a r t s
0]
Bound t o global Frame Data
143
The log segment in Listing G.3 shows a single pipeline iteration. The iteration
shown is a the last iteration for the current frame, resulting in a final reduction and
file I/O, as described in Section 5.1.3. The frame averages and denominator are not
executed as they run only once per frame while the numerator executes for each
template. This log segment is itself abridged, with the missing portion marked by a
string of periods.
[
[
[
[
[
[
[
[
[
[
[
[
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
( 7 4 1 1 ) Block
1048 b y t e s SM
9 K e r n e l Arguments :
P o i n t e r : Numerator P a r t s S u b s e t ( 0 x200b7b000 )
P i t c h : Numerator P a r t s S u b s e t ( 3 0 7 2 )
Integer : 2
Integer : 2
I n t e g e r : 16
Integer : 2
Integer : 5
Integer : 0
I n t e g e r : 37
............................................................
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
144
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
145
47]
6 K e r n e l Arguments :
47]
P o i n t e r : Numerator ( 0 x200209000 )
47]
P o i n t e r : Frame Denominator ( 0 x200600000 )
47]
P o i n t e r : F i n a l M u l t i p l i c a t i o n ( 0 x200212000 )
47]
P i t c h : Numerator ( 3 0 7 2 )
47]
Pitch : Final M u l t i p l i c a t i o n (3072)
47]
I n t e g e r : 703
4 7 ] Execute : Data P u l l S t a g e
47]
+Per Frame End S c h e d u l e Exe Group
47]
Memory Copy from G l o b a l P i t c h e d F i n a l M u l t i p l i c a t i o n t o Host Pinned
Final Multiplication
47]
Copied [ 7 0 3 , 1 2 , 1 ] 4 ( 3 3 7 4 4 b y t e s )
47]
from F i n a l M u l t i p l i c a t i o n <0x200212000>
47]
t o F i n a l M u l t i p l i c a t i o n <0x200700000>
4 7 ] Execute : Post E x e c u t i o n S t a g e
47]
+Per Frame End S c h e d u l e Exe Group
47]
Write f i l e /home/ nmoore /gpup f / b i n /gpuOut . b i n
Listing G.4 shows a log segment containing GPU-PF timing output, which includes the number of times and average time for each operation execution. This
segment is illustrative of the typical level of overhead resulting in granular timing.
The item titled Frame Averages Full Shift Subqueue is a GPU-PF abstraction that
wraps the four Frame Averages kernels that follow in a single object. In this run,
only main and bottom tiles existed. This produces fine-grained results for the two
kernels, as each launch is timed individually, but the cumulative overhead can be
seen in the amount of time it takes to execute the whole wrapper object. The same
effect can be seen in the stage summary, which lists a cumulative time obtained by
summing the averages, and a total time, which measures the real end-to-end stage
time.
146
Cumulative O p e r a t i o n Time : 0 . 0 0 1 ms
Avg T o t a l Queue Time : 0 . 0 0 1 ms
Timing f o r queue E x e c u t i o n S t a g e ( 5 3 0 4 e x e c u t i o n s ) :
Frame A v e r a g e s F u l l S h i f t Subqueue : 442 e v e n t s 0 . 2 5 2 ms
Frame A v e r a g e s Main P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 8 ms
Frame A v e r a g e s Bottom P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 7 ms
Frame A v e r a g e s Right P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame A v e r a g e s Corner P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame A v e r a g e s R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 1 5 ms
Frame Denominator F u l l S h i f t Subqueue : 442 e v e n t s 0 . 2 5 0 ms
Frame Denominator Main P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 8 ms
Frame Denominator Bottom P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 7 ms
Frame Denominator Right P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame Denominator Corner P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame Denominator R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 1 4 ms
Numerator F u l l S h i f t Subqueue : 5304 e v e n t s 0 . 1 6 2 ms
Numerator P a r t s k e r n e l e x e c u t i o n : 5304 e v e n t s 0 . 0 1 0 ms
Numerator R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 4 0 ms
F i n a l M u l t i p l y k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 6 ms
Cumulative O p e r a t i o n Time : 0 . 2 2 2 ms
Avg T o t a l Queue Time : 0 . 4 2 1 ms
147
After, the Component Times provides some more granular information about
application overheads not associated with executing the application pipeline. The
Application Initialization Time includes CUDA library initialization overhead,
which can be quite large. Options Processing refers to parsing the configuration file for the application, Implementation Build refers to the application using
the GPU-PF API to construct the representation of the application, and Implementation Update refers to the argument setup and memory and other resource
allocation phase that takes place before iterative execution. The Implementation
Update Time includes kernel compilation. For this run, individual operation timing
was disabled.
Listing G.6, shows GPU-PF log output for the same application as Examples A in
Listings G.5. The application was rerun without clearing the compiled GPU binary
cache so that the CUDA kernels did not have to be recompiled. The difference in
Implementation Update Time between Example A and B indicates the overhead
incurred by kernel compilation.
148
Listing G.7, shows GPU-PF log output for the same application as Examples
A and B in Listings G.5 and G.6. The application was rerun without individual
operation timing and without clearing the compiled GPU binary cache so that the
CUDA kernels did not have to be recompiled. Disable individual operation timing
removes nearly all of the timing overhead, significantly reducing overall application
run times.
Bibliography
[1] AMD Accelerated Parllel Processing (APP) SDK Website. http://developer.
amd.com/gpu/AMDAPPSDK/Pages/default.aspx, last accessed 1 April 2011.
[2] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop
nests for GPGPUs. In ICS 08: Proceedings of the 22nd annual international
conference on Supercomputing, pages 225234, New York, NY, USA, 2008. ACM.
[3] M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory
bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis,
SC 11, pages 12:112:11, New York, NY, USA, 2011. ACM.
[4] A. Bennis. Implementing a Highly Parameterized Digital PIV System On Reconfigurable Hardware. PhD thesis, Northeastern University, 2010.
[5] A. Bennis, M. Leeser, and G. Tadmor. The effect of parameterization on a
reconfigurable implementation of PIV. In ERSA 09: Proceedings of the 2009
149
BIBLIOGRAPHY
150
BIBLIOGRAPHY
151
http://developer.nvidia.com/object/cuda
BIBLIOGRAPHY
152
BIBLIOGRAPHY
153
Computation on Graphics Processing Units, pages 1930, New York, NY, USA,
2010. ACM.
[22] D. Grewe and A. Lokhmotov. Automatically generating and tuning GPU code
for sparse matrix-vector multiplication from a high-level representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics
Processing Units, GPGPU-4, pages 12:112:8, New York, NY, USA, 2011. ACM.
[23] L. Gui and W. Merzkirch. A comparative study of the MQD method and several
correlation-based PIV evaluation algorithms. Experiments in Fluids, 28:3644,
2000. 10.1007/s003480050005.
[24] M.
Harris.
Corporation,
Santa Clara,
NVIDIA
CA 95050.
BIBLIOGRAPHY
154
[27] A. Klockner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. CoRR,
abs/0911.3456, 2009. http://arxiv.org/abs/0911.3456, last accessed 24 May
2012.
[28] S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and
tuning for GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC
10, pages 111, Washington, DC, USA, 2010. IEEE Computer Society.
[29] A. Leung, O. Lhotak, and G. Lashari. Automatic parallelization for graphics
processing units. In PPPJ 09: Proceedings of the 7th International Conference
on Principles and Practice of Programming in Java, pages 91100, New York,
NY, USA, 2009. ACM.
[30] B. G. Levine, J. E. Stone, and A. Kohlmeyer. Fast analysis of molecular dynamics trajectories with graphics processing unitsradial distribution function
histogramming. Journal of Computational Physics, 230(9), 2011.
[31] J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu. Multi-core acceleration of chemical kinetics for simulation and prediction. In SC 09: Proceedings
of the Conference on High Performance Computing Networking, Storage and
Analysis, pages 111, New York, NY, USA, 2009. ACM.
BIBLIOGRAPHY
155
[32] Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU
program optimizations. In IPDPS 2009: IEEE International Symposium on
Parallel & Distributed Processing, 2009, pages 110, May 2009.
[33] G. Mainland and G. Morrisett. Nikola: embedding compiled GPU functions in
Haskell. In Proceedings of the third ACM Haskell symposium on Haskell, Haskell
10, pages 6778, New York, NY, USA, 2010. ACM.
[34] MathWorks Parallel Computing Toolbox.
http://www.mathworks.com/
http://
BIBLIOGRAPHY
156
Available athttp://code.opencv.org/projects/opencv/
repository/changes/trunk/opencv/modules/gpu/src/cuda/row filter.cu,
revi-
BIBLIOGRAPHY
157
http://www.coe.neu.edu/
BIBLIOGRAPHY
[50] Siemens
Inveon
158
Platform
Website.
http://www.medical.siemens.com/
webapp/wcs/stores/servlet/CategoryDisplay?catalogId=-1&categoryId=
1029721&catTree=100010,1007660,1011525,1029715,1029721&langId=
-1&storeId=10001, last accessed 21 March 2012.
[51] A. Stivala, P. Stuckey, and A. Wirth. Fast and accurate protein substructure
searching with simulated annealing and GPUs. BMC Bioinformatics, 11(1):446,
2010.
[52] J. E. Stone, J. Saam, D. J. Hardy, K. L. Vandivort, W.-m. W. Hwu, and K. Schulten. High performance computation and interactive display of molecular orbitals
on GPUs and multi-core CPUs. In GPGPU-2: Proceedings of 2nd Workshop
on General Purpose Processing on Graphics Processing Units, pages 918, New
York, NY, USA, 2009. ACM.
[53] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation
of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for
High Performance Computing, Networking, Storage and Analysis, SC 11, pages
35:135:11, New York, NY, USA, 2011. ACM.
[54] V. Volkov. Better performance at lower occupancy. In GPU Technology Conference (GTC), 2010.
BIBLIOGRAPHY
159
[55] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra.
In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC 08,
pages 31:131:11, Piscataway, NJ, USA, 2008. IEEE Press.
[56] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In Performance
Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium
on, pages 235 246, march 2010.
[57] H. Yu, M. Leeser, G. Tadmor, and S. Siegel. Real-time particle image velocimetry
for feedback loops using fpga implementation. In AIAA 06: Proceedings of the
43rd American Institute of Aeronautics and Astronautics Aerospace Sciences
Meeting and Exhibit, 2005.
[58] Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3D stencil codes
on GPU clusters. In International Symposium on Code Generation and Optimization (CGO), 2012.