GPUs

Kernel Specialization for Improved Adaptability and Performance on
Graphics Processing Units (GPUs)
A Dissertation Presented
by
Nicholas John Moore
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Computer Engineering
Northeastern University
Boston, Massachusetts
June 2012
c Copyright 2012 by Nicholas Moore

All Rights Reserved
Abstract
Graphics processing units (GPUs) offer significant speedups over CPUs for certain
classes of applications. However, maximizing GPU performance can be a difficult
task due to the relatively high programming complexity as well as frequent hardware
changes. Important performance optimizations are applied by the GPU compiler
ahead of time and require fixed parameter values at compile time. As a result, many
GPU codes offer minimum levels of adaptability to variations among problem instances and hardware configurations. These factors limit code reuse and the applicability of GPU computing to a wider variety of problems. This dissertation introduces
GPGPU kernel specialization, a technique that can be used to describe highly adaptable kernels that work across different generations of GPUs with high performance.
With kernel specialization, customized GPU kernels incorporating both problemand implementation-specific parameters are compiled for each problem and hardware instance combination. This dissertation explores the implementation and parameterization of three real world applications targeting two generations of NVIDIA
CUDA-enabled GPUs and utilizing kernel specialization: large template matching,
particle image velocimetry, and cone-beam image reconstruction via backprojection.
iii
Starting with high performance adaptable GPU kernels that compare favorably to
multi-threaded and FPGA-based reference implementations, kernel specialization is
shown to maintain adaptability while providing performance improvements in terms
of speedups and reduction in per-thread register usage. The proposed technique offers productivity benefits, the ability to adjust parameters that otherwise must be
static, and a means to increase the complexity and parameterizability of GPGPU
implementations beyond what would otherwise be feasible on current GPUs.
iv
Acknowledgements
I would like to thank Professor Leeser for her guidance and patience over the past several years. MathWorks generously supported my studies. Together with supervision
by James Lebak, this assistance significantly enhanced the quality of my education.
Completing this work would not have been possible without the support of Catherine,
my family, and my friends.
Contents
1 Introduction
2 Background
2.1
NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
In-Block Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Recent CUDA Techniques . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
CUDA Development and Adaptability . . . . . . . . . . . . . . . . .
13
2.5
Implementation Parameters and Parameterization . . . . . . . . . . .
17
2.6
OpenCV Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3 Related Work
22
3.1
Run Time Code Generation . . . . . . . . . . . . . . . . . . . . . . .
23
3.2
Autotuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3
Domain Specific Tools . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4 Kernel Specialization
33
vi
4.1
Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.2
OpenCV Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3
Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4
Implementation and Tools . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4.1
GPU-PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.4.2
Validation and Parameterization . . . . . . . . . . . . . . . . .
50
5 Applications
5.1
Large Template Matching . . . . . . . . . . . . . . . . . . . . . . . .
53
5.1.1
Algorithm and Problem Space . . . . . . . . . . . . . . . . . .
54
5.1.2
CUDA Implementation Challenges . . . . . . . . . . . . . . .
57
5.1.3
CUDA Implementation . . . . . . . . . . . . . . . . . . . . . .
58
5.1.3.1
Numerator Stage . . . . . . . . . . . . . . . . . . . .
59
5.1.3.2
Variable Tile Sizes and Kernel Specialization . . . . .
62
5.1.3.3
Other Stages . . . . . . . . . . . . . . . . . . . . . .
64
5.1.3.4
Runtime Operation . . . . . . . . . . . . . . . . . . .
65
CPU Implementations . . . . . . . . . . . . . . . . . . . . . .
66
Particle Image Velocimetry . . . . . . . . . . . . . . . . . . . . . . . .
67
5.2.1
CUDA Implementation . . . . . . . . . . . . . . . . . . . . . .
73
5.2.2
Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . .
78
Cone Beam Backprojection . . . . . . . . . . . . . . . . . . . . . . . .
80
5.3.1
82
5.1.4
5.2
5.3
53
Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . .
vii
5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Experiments and Results

6.1
6.2
6.3
82
83
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.1.1
Hardware and Software Configurations . . . . . . . . . . . . .
85
6.1.2
Problem and Implementation Parameterization . . . . . . . .
85
6.1.2.1
Template Matching . . . . . . . . . . . . . . . . . . .
86
6.1.2.2
PIV . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
6.1.2.3
Cone Beam Back Projection . . . . . . . . . . . . . .
93
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.2.1
Comparative Performance . . . . . . . . . . . . . . . . . . . .
94
6.2.2
Kernel Specialization Performance . . . . . . . . . . . . . . . .
97
6.2.2.1
Template Matching . . . . . . . . . . . . . . . . . . .
97
6.2.2.2
PIV . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.2.2.3
Cone Beam Backprojection . . . . . . . . . . . . . . 106
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Conclusions and Future Work
117
7.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2.1
Existing Applications . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.2
GPU-PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii
7.2.3
Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . . 120
A Glossary
122
B Flexibly Specializable Kernel
123
C Sample Run Time Evaluated PTX
125
D Sample Kernel Specialized PTX
127
E OpenCV Kernel Source
129
F OpenCV Specialized Kernel Source
136
G Sample GPU-PF Log Output
140
ix
List of Figures
2.1
A graphical representation of the NVIDIA CUDA-capable GPU architecture realization. This figure appears in [36]. . . . . . . . . . . . . .
2.2
Example of a parallel reduction tree starting with eight elements. . .
10
4.1
Different template tile regions . . . . . . . . . . . . . . . . . . . . . .
48
5.1
are the matrix averages of A and B . . . . . . . . . . . . . .

A and B
55
5.2
is the matrix average of B, AC is the current template with the

B
template average subtracted from each value, and AD is the template
contribution to the denominator. . . . . . . . . . . . . . . . . . . . .
5.3
57
is
The actual functionality implemented by the numerator stage. B
the matrix average of B, and AC is the template with its average value
subtracted.
5.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Different template tile regions. . . . . . . . . . . . . . . . . . . . . . .
60
5.5
Graphical data layout of the shift area for a single tile. Regardless of
dimensions, all tiles are applied to the same shift area. A single block
of the summation kernel may only combine a subset, shown in gray,
of the template locations. . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Each thread accumulates the tile contributions for a singe shift offset,
an example of which is shown in gray. . . . . . . . . . . . . . . . . . .
5.7
61
62
A pseudocode representation of the computation performed by each

CPU thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.8
A graphical depiction of the physical PIV configuraiton. . . . . . . . .
68
5.9
A graphical depiction of the terminology originally used for the FPGA

PIV implementation. This image is from Benniss dissertation [6]. . .
70
5.10 The per-mask offset sum of squared differences similarity score, defined
in terms of the original PIV problem specification as shown in Figure 5.9. 73
5.11 Example of a set of threads striped across a masks area. . . . . . . .
75
5.12 A depiction of the warp specialization used in the PIV kernel to remove
the reduction as a bottleneck. . . . . . . . . . . . . . . . . . . . . . .
77
5.13 A graphical depiction of cone beam scanning setup. . . . . . . . . . .
80
6.1
Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C1060. The location of peak performance
is marked with a white square. . . . . . . . . . . . . . . . . . . . . . . 115
xi
6.2
Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C2070. The location of peak performance
is marked with a white square. . . . . . . . . . . . . . . . . . . . . . . 116
xii
List of Tables
2.1
CUDA thread block characteristics by compute capability . . . . . . .
2.2
The amount of register and shared memory available within each
streaming multiprocessor for NVIDIA GPUs of various compute capabilities [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
4.1
Various parameter types provided by GPU-PF. . . . . . . . . . . . .
46
4.2
Various resource types provided by GPU-PF. . . . . . . . . . . . . . .
47
4.3
Various memory types provided by GPU-PF. . . . . . . . . . . . . . .
47
4.4
Various actions provided by GPU-PF. . . . . . . . . . . . . . . . . . .
47
5.1
Per patient, the number of image frames, template number and size,
vertical/horizontal shift within ROI, and number of corr2() calls. . .
5.2
6.1
56
Template tiling examples for the template size associated with Patient
4 (156 116 pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Template matching GPU implementation parameters benchmarked. .
87
xiii
6.2
The PIV problem set parameters, in terms of interrogation window

and image dimensions, used for comparing performance of the FPGA
and GPU implementations.
6.3
. . . . . . . . . . . . . . . . . . . . . . .
89
The PIV problem set parameters, in terms of mask and offset counts,
used for comparing performance of the FPGA and GPU implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
PIV problem set parameters used to test the impact of mask size on
the performance of the GPU implementation. . . . . . . . . . . . . .
6.5
90
PIV problem set parameters used to test the impact of the number of
search offsets on the performance of the GPU implementation. . . . .
6.6
89
91
PIV problem set parameters used to test the impact of interrogation

window overlaps on the performance of the GPU implementation. . .
92
6.7
PIV GPU implementation parameters benchmarked. . . . . . . . . .
93
6.8
Cone beam backprojection GPU problem parameters benchmarked. .
93
6.9
Cone beam backprojection GPU implementation parameters benchmarked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.10 Template matching performance results comparing the multi-threaded

C CPU implementation to the best performing CUDA implementation
on two GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
6.11 PIV performance results comparing the FPGA implementation to the

best performing CUDA implementation on two GPUs. . . . . . . . .
xiv
96
6.12 Cone beam backprojection results comparing the OpenMP CPU implementation with four threads to the best performing configuration
on both GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.13 Template matching partial sums: performance and optimal configurations characteristics for the tiled summation kernel. RE stands for
runtime evaluated, and SK stands for specialized kernel. . . . . . . .
99
6.14 PIV GPU performance comparisons for several kernel variants across
the FPGA benchmark set. . . . . . . . . . . . . . . . . . . . . . . . . 101
6.15 PIV GPU performance data for the FPGA benchmark set, including
optimal register blocking and thread counts. . . . . . . . . . . . . . . 102
6.16 PIV GPU performance data for the varying mask size benchmark set,
including optimal register blocking and thread counts. . . . . . . . . . 104
6.17 PIV GPU performance data for the varying search benchmark set,
6.18 PIV GPU performance data for the varying overlap benchmark set,
6.19 Performance comparisons for the backprojection kernels. . . . . . . . 109
6.20 Occupancy and execution data for the C1060 on the V2 data set. . . 110
6.21 Percentage of the peak performance for the template matching application with various fixed main tile sizes and thread counts. . . . . . . 113
xv
6.22 Percentage of the peak performance for the PIV application with various fixed data register counts and thread counts. . . . . . . . . . . . 114
xvi
Listings
4.1
A CUDA C GPU kernel designed to demonstrate the common ways

kernel specialization can improve the performance of kernels. This
kernel is a regular fully run-time evaluated kernel and does not rely
on specialization in any way. . . . . . . . . . . . . . . . . . . . . . . .
4.2
34
A CUDA C GPU kernel designed to demonstrate kernel specialization.

The constants specified by identifiers in all capital letters must be
provided at compile time. . . . . . . . . . . . . . . . . . . . . . . . .
5.1
34
MATLAB function implementing the sliding window correlation for

each template for each ROI within the current frame. . . . . . . . . .
55
B.1 A CUDA C GPU kernel designed to demonstrate flexible kernel specialization. The kernel can be compiled both with and without specialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
C.1 The nvcc command line used to generate the PTX in C.2. The mathTest2.cu
file contained the source of Listing B.1. . . . . . . . . . . . . . . . . . 125
C.2 The run-time adaptable PTX produced by calling nvcc on the CUDA
C source in Appendix B without any fixed parameters. . . . . . . . . 125
xvii
D.1 The nvcc command line used to generate the PTX in D.2. The mathTest2.cu
file contained the source of Listing B.1 . . . . . . . . . . . . . . . . . 127
D.2 Specialized PTX produced by calling nvcc on the CUDA C source in
Appendix B and specifying all parameters on the command line. . . . 127
E.1 Unmodified OpenCV CUDA example . . . . . . . . . . . . . . . . . . 129
F.1 Modified OpenCV CUDA example: this portion is specialized . . . . 136
F.2 Modified OpenCV CUDA example: this portion is compiled into the
host program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
G.1 Initial application refresh, Part 1 . . . . . . . . . . . . . . . . . . . . 140
G.2 Initial application refresh, Part 2 . . . . . . . . . . . . . . . . . . . . 142
G.3 Pipeline iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
G.4 Per-operation timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
G.5 High-level timing, Example A . . . . . . . . . . . . . . . . . . . . . . 147
G.6 High-level timing, Example B . . . . . . . . . . . . . . . . . . . . . . 147
G.7 High-level timing, Example C . . . . . . . . . . . . . . . . . . . . . . 148
xviii
Chapter 1
Introduction
General purpose computing on graphics processing units (GPGPU) has seen wide
adoption in the past few years. The promise of significant performance gains over
traditional CPUs has resulted in an ever increasing variety of problem types to be
accelerated on GPUs.
However, developing GPGPU applications is often difficult, and peak performance
is obtained by carefully accommodating particular GPU hardware characteristics.
This leads to programming practices that limit the adaptability of kernel implementations to specific problems. Compounding this issue is the rapid evolution of
GPU hardware over time. Constructing a GPGPU application around the specific
properties of one GPU model can limit the performance of an implementation on
other hardware. Together, these practices hinder code reuse and the applicability of
GPGPU to a wider range of problem instances.
This dissertation introduces kernel specialization of GPGPU kernels, specifically
CUDA-enabled GPUs from NVIDIA. Kernel specialization refers to the generation of
CHAPTER 1. INTRODUCTION
GPU binaries that are customized for the current problem and/or hardware parameters. Specifically, the approach uses developer-friendly CUDA C language kernels
and run-time calls to the CUDA compiler. Kernel specialization allows GPU kernel
implementations to achieve greater levels of adaptability to variations among both
problems and hardware while preserving the performance associated with hard-coded
approaches. It may also enable more complicated kernels to fit within the constraints
imposed by the CUDA environment.
Many important GPU optimizations, such as loop unrolling, strength reduction,
and register blocking, require static values at compile time. With GPGPU kernel
compilation typically done ahead of time, it is common practice to hard code both
problem and hardware parameters. This limits the ability of many GPGPU kernels
to adapt to a wide range of problems and GPU targets. When adaptability is required
and parameter values are not specified, performance often suffers since many crucial
performance optimizations cannot be applied by the compiler. Generation of specialized GPU binaries once parameter values are known results in higher performance
code based on the now-fixed parameters. Parameters selected for specialization can
only be changed through recompilation.
The fixed parameters used to generate customized kernels are derived in part from
the particular problem instance. However, many CUDA kernel implementations also
provide a set of implementation parameters that can be adjusted independent of
the problem parameters. These often provide an opportunity to tune kernel imple-
mentations to the particular characteristics of the current GPU. Examples of these

parameters include the number of threads per block and the amount of work assigned
to each thread.
The kernel specialization method presented here, while helping improve the performance of GPU kernels, can also improve the CUDA development experience.
Common techniques that select implementation parameters at compile time using
preprocessor macros or instantiate many versions of the same kernel for different
parameter and data type combinations can be avoided. This improves the maintainability and readability of GPU kernels. The size of program binaries are also
minimized, as they do not contain compiled versions of all possible versions of the
kernel. C++ template techniques can be used to extend this benefit to higher levels
of algorithmic specialization.
This dissertation covers the application of kernel specialization to CUDA versions
of three real-world applications: large template matching, particle image velocimetry,
and cone-beam image reconstruction via backprojection. Each application utilizes
specialization differently to achieve better performance than is otherwise possible.
The applications utilize a custom framework developed as a part of this research
that automates the compilation of customized CUDA C kernels at run time.
The technique presented here complements other related research projects that
focus on greater parameterization of GPU kernels. It fills in gaps between automated domain-limited frameworks and auto-tuning tools. Custom human-developed
kernels are often required to augment libraries problem space coverage. Auto-tuning
approaches are important for determining optimal parameterizations for kernels developed with kernel specialization in mind, but may not explore the large variety of
fundamental implementation approaches that a human designer or domain-specific
tool may employ.
Contributions of this research include a methodology to write kernels once and
recompile for different parameters, along with an application framework that automates the process. Using the methodology, this research:
Demonstrates improved adaptability of GPU kernels to different problems and
architectures while offering good performance using kernel specialization
Employs register blocking and loop unrolling in a run-time adjustable fashion
with kernel specialization
Demonstrates reduced register usage with kernel specialization
Chapter 2 provides background information, and Chapter 3 discusses related research projects. Chapter 4 presents kernel specialization in more detail. The details
of the three application case studies to which kernel specialization has been applied
are described in Chapter 5. The testing scenarios and performance results for each
application are covered in Chapter 6. Chapter 7 concludes the dissertation with a
summary and plans for future work. A glossary is provided in Appendix A.
Chapter 2
Background
This section provides an overview of the challenges this research addresses. First,
Section 2.1 provides a brief description of the CUDA architectural abstraction and
programming environment provided by NVIDIA. While the principles discussed in
this dissertation apply to GPGPU in general, all experiments to date have targeted
NVIDIA CUDA-enabled GPUs. Some aspects of typical CUDA usage make writing
general code that can adapt to arbitrary incoming problems challenging, and these
issues are examined in Section 2.4.
2.1
NVIDIA CUDA
CUDA provides a software environment for writing GPGPU kernels and specifies an
abstract hardware target with large amounts of parallelism. The abstractions provided help shield the application developer from a number of low-level hardware considerations and allow NVIDIA to change the underlying implementation of the hardware abstraction, increasing peak performance over time. However, this paradigm
still requires application programmers to contend with a number of issues not of-
CHAPTER 2. BACKGROUND
ten considered when developing software for general purpose processors, including
a structured and restricted parallelism model, several distinct memory spaces, and
transferring data to and from a device that is external to the host processor. Additionally, developers have not been completely isolated from changes in the CUDA
hardware realization, which NVIDIA has been updating rapidly over time. New
GPUs have changed important hardware parameters and introduced completely new
capabilities.
In CUDA, parallelism is presented in the form of thousands of threads. To organize these threads, CUDA provides a two-level thread hierarchy composed of thread
blocks and the grid. Thread blocks can have one, two, or three logical dimensions.
The grid specifies a one, two, or, on newer GPUs, three dimensional space consisting
of identically shaped thread blocks. Together, the grid and blocks can be used to
define a large thread space defined independently for each kernel invocation but fixed
for the duration of a kernels execution.
Inter-thread communication is possible at thread block scope via a manually managed block-local memory called shared memory, which is available for reading and
writing by all threads. Block-level thread barriers allow for synchronized shared
memory access by all the threads within a block. However, large scale inter-block
communication is generally not efficient. Combined, the thread hierarchy and communication restrictions allow for the scalability of CUDA applications by constraining
kernels to relatively small and independent chunks of computation. CUDA applica-
tions scale to larger problems by increasing the number of blocks, and newer CUDA
hardware scales performance by providing more parallel block execution resources.
This requires, however, that the given problem is amenable to the problem partitioning required by CUDA.
The amount of memory and execution resources used at the block level can affect performance and must be considered by kernel developers. NVIDIAs hardware
realization of the abstract CUDA architecture uses streaming multiprocessors (SMs)
to execute blocks, as shown in Figure 2.1. Full thread blocks are assigned to a SM,
and within the SM a thread block is mapped to groups of 32 consecutive threads,
collectively referred to as a warp. Warps are the unit of execution, with the threads
within a warp executed in a single-instruction, multiple-thread (SIMT) manner. Fast
context switching between warps and the ability to execute multiple thread blocks
within the same SM generates a significant amount of thread-level parallelism. However, kernel configurations can have a significant impact on the ability of the GPU
to execute more warps simultaneously. SMs contain a limited number of shared
resources, including registers, shared memory, and warp and block state tracking
hardware. The number of blocks simultaneously executed by a SM is limited by the
block-level resource usage required by a kernel.
The above restrictions require CUDA developers to balance a number of tradeoffs when developing GPGPU applications. However, developers concerned with
targeting a wide variety of CUDA-capable GPUs must also account for changes in
Figure 2.1: A graphical representation of the NVIDIA CUDA-capable GPU architecture realization. This figure appears in [36].
Compute Capability 1.0 & 1.1
Date of Toolkit Support [13] June 2007
512
Max. Threads
Shared Mem.
16 KB
16
Shared Mem. Banks
8K
32-bit Registers
Max. Warps/SM
24
1.2 & 1.3

August 2008
512
16 KB
16
16 K
32
2.0 & 2.1

March 2010
1024
16/48 KB
32
32 K
48
Table 2.1: CUDA thread block characteristics by compute capability

both the CUDA hardware abstraction and the physical hardware. In general, newer
CUDA-capable GPUs have had an increasing number of resources per block, as shown
in Table 2.1, with the biggest changes associated with the introduction of CUDA
devices of compute capability 2.x (the Fermi architecture). While the number of many
block-level resources has increased, the physical realization of the CUDA abstraction
has also changed.
Further complicating CUDA development are memory usage concerns. While
they are not covered here, the CUDA environment provides a number of different
memory types, each with its own set of size and performance characteristics. Accommodating some memory access requirements (e.g. coalescing for global memory)
are fundamental to achieving good performance on CUDA-enabled devices. Many
other issues, however, are more nuanced, and while they occur less frequently, they
can also significantly degrade performance of the memory hierarchy.
2.2
In-Block Reductions
There are many fundamental parallel programming patterns. Of interest here are
parallel reductions. Reductions have been an important part of GPGPU programming since its inception, and an examination of techniques for implementing highperformance reductions with CUDA has been included with the CUDA SDK for
many releases [24]. In particular this work relies on reductions that include an associative operation, such as addition. Reductions are often applied to large amounts
of data and the usual concerns about generating parallelism across a set of thread
blocks still apply. Generally, each block will perform a moderately-sized reduction
on a subset of the data, usually producing a single output value per block. Multiple
rounds of the reduction kernel call are used, with each successive call requiring fewer
and fewer thread blocks to span the data set.
The reductions discussed in this dissertation are of moderate size, so only the
in-block behavior is relevant. Much like the multiple reduction kernel calls, multiple
1
10
2
+
1
+
2
+
4
+
1
+
1
Figure 2.2: Example of a parallel reduction tree starting with eight elements.
reduction rounds within a block are used, as shown in Figure 2.2. Less parallelism
available after each round, with the working set, and therefore number of threads
needed, reduced by half (assuming power-of-two initial element counts). This forms
a tree where the number of levels, or rounds, in the tree is determined by the base-2
logarithm of the initial number of elements.
In NVIDIA GPUs, register memory is private and not accessible by other threads.
This requires reductions to take place through shared memory. For each level of the
tree, data is written to shared memory, a thread-synchronization barrier is executed,
and then the remaining participating threads read out the newly compacted data.
However, with CUDA, the number of synchronizations is not the same as the number
of levels in the tree. Due to the thirty-two-thread SIMT width (warp) in NVIDIA
GPUs, the threads within a warp are guaranteed to be in sync. Once the number
of elements reaches sixty-four, a single warp can finish a reduction without synchronization.
Compute Capability
1.0 1.1
1.2 1.3
2.x
11
Register File
per SM
32 KB
64 KB
128 KB
Shared Memory
Size per SM
16 KB
16 KB
16/48 KB
Table 2.2: The amount of register and shared memory available within each streaming
multiprocessor for NVIDIA GPUs of various compute capabilities [38].
Regardless, throughout the duration of a block-level reduction, fewer and fewer
threads participate after each level of the tree. This results in a increasing number
of idle threads.
2.3
Recent CUDA Techniques
Since the introduction of CUDA, many kernel implementation techniques have been
suggested. Most have not broken with the original paradigm suggested by NVIDIA,
which emphasizes context switching among many warps. However, some recent
research has proposed new techniques that take advantage of particular traits of
NVIDIA hardware.
First, NVIDIA GPUs provide a memory hierarchy that is inverted relative to
CPUs, with more on-die memory dedicated to the register file than shared memory [54], as shown in Table 2.2. However, since shared memory does not provide
enough throughput to achieve peak performance; sourcing ALU operands from registers is mandatory to maximize computational throughput.
While this encourages increased register usage, other factors can exert downward pressure on register usage. When sufficient instruction-level parallelism is not
12
available, CUDA-enabled GPUs rely on rapid low-latency context switching between

warps, possibly from multiple thread blocks, to hide latencies. However, a streaming
multiprocessor can only execute warps from multiple blocks if it contains enough
hardware resources, including registers, to execute multiple full blocks simultaneously. As each thread uses more registers, fewer thread blocks may be processed by
a single SM. Even when a SM is only executing a single thread block, complex or
highly register-blocked kernels often require a reduction in the number of threads per
block to fit within resource limits.
Increasing the number of warps that can be run simultaneously within a streaming
multiprocessor is one of NVIDIAs main performance recommendations [37]. However, NVIDIA GPUs have been shown to be able to take advantage of significant
memory- and instruction-level parallelism, even to the extent where a relatively small
number of warps fully saturate the available memory or computational hardware resources [54]. While occupancy may be very low, additional active warps will not
improve performance. This creates a scenario where maximum performance can be
achieved using resource intensive thread blocks with low numbers of threads per
block.
One technique to increase the number of operands coming from register memory
is register blocking. Register blocking has been shown to be an effective performance
technique for GPUs [55, 53]. Higher levels of register blocking on the GPU generally
assign more work to each thread, shifting the balance from thread-level parallelism
13
(TLP) to instruction-level parallelism (ILP).

A second relatively new CUDA kernel programming technique is called warp
specialization. While intra-warp thread divergence is undesirable, there is no intrinsic
penalty associated with inter-warp divergence. Introduced by Bauer et al. [3] in their
CudaDMA library, warp specialization was used to dedicate several warps to loading
data and the rest to compute. Data was double buffered through shared memory, and
a block-wide synchronization point was used to control toggling buffers between the
two groups of threads. By manually assigning and limiting warps to the distinct tasks
of data movement and processing, CudaDMA is able to maximize the parallelism used
by ensuring that both the memory hierarchy and compute resources are constantly
engaged.
2.4
CUDA Development and Adaptability
Much like traditional CPU code, typical CUDA C development practices compile
code once ahead of time. However, unlike traditional C-like languages on CPUs, the
target for compilation is often not machine executable. CUDA C is first compiled
to PTX (parallel thread execution), an assembly-level intermediate representation
for NVIDIA GPUs [40], and then, at run time, translated to the final instruction
set architecture (ISA) used by the target GPU. The separation of PTX from a fixed
ISA allows NVIDIA to make hardware changes between generations of GPUs while
preserving portability of compiled programs.
14
It is possible to instruct the CUDA C compiler front-end to generate the actual

binary ISA version of a given kernel before run time, as is done in this work. PTX and
binary versions of a compiled kernel can reside in the same Executable and Linkable
Format (ELF) formatted executable objects. A compatible binary version will be
selected before translating a compatible PTX version.
The syntax of PTX files is similar to many other assembly language formats,
with some extra CUDA-specific registers and instructions for managing the wide
parallelism available on NVIDIA GPUs. Sample PTX listings are provided in Appendices C and D. Familiar instruction types, such as load, add, multiply, etc., are
suffixed with a data type and/or memory space. PTX uses load-store semantics, with
the destination specified before source operands. Registers, whose names are prefixed
with %, are virtual. Register assignment takes place during the translation from PTX
to a binary ISA. This abstraction helps to accommodate the varying register files on
different CUDA-capable GPUs.
The just-in-time (JIT) translation from PTX to a binary ISA is intended to be
light-weight so as to minimize overhead and leaves little time for applying optimizing
code transformations. Many important optimizations, especially those important to
high performance on NVIDIA GPUs, are applied when CUDA C is compiled to PTX.
Optimizations of particular relevance for this dissertation include strength reduction,
constant folding and propagation, loop unrolling and register blocking.
An important property of these optimizations is that they require fixed values
15
at compile time. For example, the compiler must know when scalars are powers of
two to strength reduce division or modulus (two relatively expensive operations on
NVIDIA GPUS) to bit-wise operations. Likewise loop counts must be fixed for the
compiler to unroll loops or implement register blocking.
While program loops are fully supported by GPGPU languages like CUDA C, they
may incur significant overhead [49]. Unrolling loops is a key CUDA performance optimization. Rolled loops need to include several instructions for loop setup, iteration,
termination condition checking, and branching, all of which introduce overhead and
reduce the ability of the GPU to take advantage of ILP.
Register blocking can be complicated by the fact that existing NVIDIA GPU
architectures cannot indirectly address registers. Fixed loop counts are required for
the CUDA C compiler to specify the use of extra registers for data and assign them
unique virtual registers in the PTX representation of a program.
A similar issue occurs with constant memory space declarations. These must have
a fixed size at compile time. The constant memory is reserved when a CUDA module
(the CUDA translation unit) is loaded, and the total amount of CUDA memory that
can be allocated across all loaded kernels is limited to 64 KB.
A fundamental issue with fixing values at compile time is limiting the ability of a
kernel to adapt to new problems without recompilation. It is possible to leave CUDA
C kernels without fixed values, forgoing the aforementioned optimizations, but this
may incur additional performance penalties. Registers may have to be dedicated to
16
the storage of intermediate values computed from one or more adjustable parameters.
Independent parameters, either inputs or intrinsic values like thread indexes, have to
be loaded from shared memory or special registers into regular registers before they
can be used. Dynamic code without fixed values at compile time also often requires
extra run-time guards against illegal values and memory accesses.
Adaptability can also interact with preferences for hardware-friendly parameter
values. Common fundamental parallel algorithmic components are simplest in terms
of control flow at powers of two. Reductions, for example, are guaranteed to have
an even number of elements at each tree level if the initial size is a power of two.
Otherwise, extra logic to handle tree levels with odd element counts must be incorporated. All told, adaptability can contribute a significant number of non-compute
instructions to a GPU kernel.
Since CUDA C supports many features of C++ templates, they can be used to
help circumvent these restrictions while offering some level of adaptability. Templates
with template arguments for parameter values that control optimizations can be
explicitly instantiated multiple times for different fixed parameter values. A table
lookup of function or method pointers can be used to select the optimal specialization
for a given problem. This technique is useful, especially for handling multiple data
types, but has drawbacks. First, the adaptability is limited to the pre-compiled range,
and second, it can significantly increase compiled binary size. While the impact of
applying this technique to a single kernel may be limited, the increase in code size
17
across a large software project may not be acceptable.

The above issues can make CUDA development difficult, but the same types of issues apply to hardware, adding another layer of complexity. As mentioned, NVIDIA
has frequently updated its GPU architecture. Newer devices have increased the
number of block-level resources that may affect the partitioning of a large problem
among thread blocks. New versions of CUDA have also introduced new features. Examples of these include atomic operations, double-precision floating-point support,
and more advanced block-level synchronization primitives. The relative latencies of
several operations has also changed over time. For example, the relative throughput
of the * operator and the
[u]mul24() intrinsic for integer multiplies was inverted
between NVIDIA GPUs of compute capability 1.3 and 2.0. Another example is the
throughput of shared memory relative to the register file decreases between GPUs
of CUDA capability 1.3 and 2.0, putting additional emphasis on effective use of the
register file in newer GPUs. These factors can result in different optimal implementation approaches between architecture generations, making performance portability
difficult.
2.5
Implementation Parameters and Parameterization
The thread block and grid structured parallelism provided by CUDA, block-level resource limitations, the need to utilize important optimizations, and variations among
CUDA-capable GPUs all combine to create a complicated development environment
18
that requires developers to balance many non-obvious trade-offs. One mechanism for
balancing some of these trade-offs is implementation parameters.
Implementation parameters are generally scalar values that control how a CUDA
kernel performs its computation. These are usually distinct from problem parameters
and can be independently adjusted. The shape and number of threads in a threadblock, for example, is a built-in CUDA implementation parameter that every kernel
must select. At the most basic level, the number of threads in a block trades off the
availability of block-level resources with the number of total thread blocks. A second
CUDA built-in parameter is warp size. This, however, has not been a major issue as
NVIDIA has so far kept the warp size at thirty-two threads.
In addition to the required CUDA implementation parameters, many kernel implementations add a number of other parameters. An example of this occurs with
tiling, a common kernel implementation technique that breaks a large parallel computation up into smaller chunks that are distributed among thread blocks. The size
of the tiles can control how much shared memory is required per block, potentially
accommodating increased shared memory sizes in newer GPUs. Increasing tile sizes
often assigns a larger amount of work to each block, which can be used to scale the
block-level work load to newer GPUs with more threads and resources available in
each thread block.
Another example involves adjusting the number of registers used for register blocking, which can be used to adjust the amount of work assigned to each thread. As
19
discussed above, register blocking can improve ILP but comes at the cost of increased
register usage.
In addition to performance, changing implementation parameters can also impact
the complexity of CUDA kernel code. Shared memory, for example, can be allocated
either statically at compile time or dynamically at kernel launch time. Dynamically
allocated shared memory, however, is more complicated and error prone to use.
It is common practice, and required for certain parameters, such as register blocking levels, to fix implementation parameters for all launches of the kernel. This allows
the compiler to apply optimizations, but a single value may not be optimal across a
range of problem parameters. CUDA kernel performance is notoriously sensitive to
small variations in both problem and implementation parameters. Several research
projects have focused on investigating techniques for the optimal selection of implementation parameters across problem parameter ranges. Some of these efforts are
described in Chapter 3. There is, however, a fundamental tension between adjusting
implementation parameters for different problem and hardware parameters and the
relative importance of compile-time optimizations for GPGPU performance. This
issue is the focus of the work described in this dissertation.
2.6
OpenCV Example
Taken together, the issues discussed in this chapter produce a complex environment
that negatively affects GPGPU kernel adaptability and code reuse. Ultimately, this
20
reduces the application and adoption of GPGPU. As a real world example of these
issues, Appendix E contains a listing from the open-source OpenCV computer vision
librarys CUDA module that implements row filtering, a basic image processing algorithm [42, 41]. The kernel is indicative of the effort required to achieve maximum
performance with CUDA while maintaining adaptability.
A number of the kernel development issues discussed in the last section appear
in the filter kernel. Lines 67 through 77 encode fixed block and grid dimensions
and other parameters in the kernel directly, which are used to control loop unrolling
counts. The preprocessor conditional selects values based on compute capability.
The macro definition on line 71 declares constant memory for storing the filter that
will be applied to the input. The constant size creates an arbitrary ceiling on the size
of filters that can be applied. This may make sense for the application it is unlikely
that a user will have a filter larger than thirty-two pixels but is also required for
meeting the CUDA restriction that the constant memory size is known at compile
time.
Many of the compile-time optimizations discussed above are present in the OpenCV
kernel. The kernel utilizes explicit instantiation of kernel variants for every supported
parameter combination, contained in the array of function pointers starting on line
164. The declaration explicitly instantiates template specializations for filter sizes
from one to thirty-two pixels, which are necessary for loop unrolling, and for each
addressing mode.
21
The explicit specializations starting on line 348 are used to include multiple versions of the filter code based on the needed data types. Instead of relying on a
lookup table, C++ overloading is used. Versions of the lookup table are compiled for
each data type pair. All told, 800 variants of the kernel are generated and compiled
into the kernel binary. It should be noted that for each kernel version a dedicated
CPU-invoking function (linearRowFilter caller()) is also compiled, increasing
the penalty for including multiple kernel variants in a binary.
Chapter 3
Related Work
This chapter discusses a number of other projects and research that are related to the
work discussed in this dissertation. The related work can be roughly categorized as:
run-time code generation generation, autotuning, and domain-specific tools. Many
of the domain-specific tools also include an autotuning component.
In addition to these main categories, there have also been examples in the literature of kernel customization for a very specific application. Stone et al. [52] have
demonstrated the potential for improving the performance of GPU kernels through
the use of custom compilation for a specific problem instance, but did not integrate
runtime compilation of GPU kernels into their molecular orbital visualization software. The group manually generated variants of their kernels for specific problem
instances so that loops are unrolled. They report a 40% performance improvement
for the problem-specific kernels.
Linford et al. [31] used CUDA for a GPGPU implementation of large scale atmospheric chemical kinetics simulations. The basis of their GPU implementation
CHAPTER 3. RELATED WORK
23
is a problem instance specific C kernel generated by the Kinetic PreProcessor [15]

software package. While the serial C code is problem specific and generated automatically, the conversion to CUDA C was a manual process.
Stivala et al. [51] created a CUDA implementation of a bioinformatics protein
structures database search problem. Before performing a search, each thread block
loads information about the query into shared memory. However, for some searches,
the query information is too large for shared memory. To handle this, they compile
a second version of the kernel that only uses global memory. While the kernel that
only uses global memory is much slower, the number of queries in their data sets that
exceed shared memory usage is small so that it is still faster to run the inefficient
kernel on the GPU instead of transferring the data to the CPU for handling the
outliers and then back to the GPU.
Also to address hardware differences, Levine et al. [30] provide different implementations of radial distribution function histogramming kernels for different generations
of NVIDIA GPUs and determined optimal implementation parameters.
3.1
Run Time Code Generation
In this dissertation, kernels that are customized for a specific problem and target
GPU at run time are explored. Another approach for executing customized GPU is
generation of GPU code on the fly at run time.
The NVIDIA OptiX ray tracing engine [44] uses run-time manipulation and com-
24
pilation of PTX kernels, but requires them to be compiled offline using the CUDA
nvcc compiler. While operating only on PTX, the OptiX PTX to PTX compilation
provides a number of domain specific optimizations at runtime by analyzing both
the PTX source and the current problem. These include memory selection, inlining
object pointers, register reduction, and efforts to reduce the impact of divergent code.
Garg et al. [21] have created a framework that accepts parallelism and type annotated Python code and automatically generates GPU code targeting the AMD
Compute Abstraction Layer (CAL) [1], a low-level intermediate language and API
for AMD GPUs. An ahead-of-time compilation step converts Python functions to an
intermediate state and a runtime just-in-time (JIT) compiler generates both GPU
kernel and host control code using program information that only becomes known
at runtime. The JIT compiler can perform loop unrolling and fusion and optimizes
some memory access patterns.
Dotzler et al. [18] present a framework (JCudaMP) that takes Java code with
OpenMP parallelization pragmas and automatically generates code to run on NVIDIA
CUDA capable GPUs. A custom Java class loader converts a subset of Java/OpenMP
parallel code sections with additional JCudaMP annotations and dynamically generates C for CUDA source. JCudaMP relies on the CUDA nvcc compiler and invokes it at runtime to compile and link C for CUDA to a shared object that is
loaded by the Java virtual machine. The generated CUDA code will check for and
indicate addressing violations for exception handling and uses Javas array-of-arrays
25
multi-dimensional array organization. While tiling of problems larger than the target
GPUs global memory is supported, important aspects of CUDA performance, like
shared memory, are not. Overhead information is provided for CUDA compilation
but not for CUDA code generation.
Also starting with Java, Leung et al. [29] have modified the JikesRVM Java virtual
machine to connect to the RapidMind framework for the purpose of running Java code
on NVIDIA GPUs. After parallelization, their modifications generate RapidMind
IR to feed directly into the RapidMind back-end, which generates target specific
code. The RapidMind template libraries and front-end stage is not used. Loops are
automatically identified and parallelized, and a cost-benefit analysis is used to decide
which loops should be executed on the GPU.
Mainland and Morrisett [33] create an abstract domain-specific language, Nikola,
and embed it inside of Haskell. While significantly raising the abstraction - Nikola
is a first-order language represented in Haskell using higher-order abstract syntax and allowing the use of NVIDIA GPUs from within a pure Haskell environment, the
approach seems limited in its ability to effectively utilize many CUDA resources and
target all algorithms. Nikola automatically generates and compiles CUDA kernel
code and manages data movement for the user. It appears that CUDA functions are
compiled once at program compile time or at runtime at each invocation.
Ocelot, created by Diamos et al. [17], dynamically translates NVIDIA PTX code
to Low Level Virtual Machine (LLVM) IR. From there, a just-in-time compiler can
26
generate code for either a multi-core CPU system or NVIDIA GPUs at runtime. The
focus of the project is on restructuring CUDA-oriented threads and memory use to
match multi-core CPU platforms.
Bergen et al. [7] discuss their experiences using OpenCL in an HPC environment
creating a compressible gas dynamics simulator that runs on GPUs from multiple
vendors. They cover several middle-level abstractions they developed to increase
programmer productivity. Of particular interest, the authors use C-based macros
to compile in parameter constants to reduce memory usage and specialize kernels.
Additionally, one of two variations of the kernel is selected at run time based on
the hardware target. One variation consists of separate stages that the increases the
memory accesses to computation ratio but use fewer registers. The second variation
fuses the two stages but uses much register resources.
PyCUDA [27] explicitly attempts to address many of the same issues discussed
in this dissertation, including parameterization and GPGPU kernel flexibility. PyCUDA includes a number of Python object oriented abstractions around CUDA API
entities and provides many of the same features as the GPU Prototyping Framework, discussed in the next chapter, such as GPU memory abstractions, precision
timing, and caching of compiled GPU binaries. PyCUDA also provides higher-level
abstractions such as GPU-based numerical arrays and support a number of elementwise and reduction operations and fast Fourier transforms. In addition to the builtin operations, PyCUDA provides element-wise and reduction code generators for
27
user-provided functions. By combining PyCUDA with other Python-based tools for

modifying code templates, it is possible to implement parameter space exploration
functionality. PyCUDA forms the basis for some even higher-level domain-specific
tools discussed in Section 3.3.
The MATLAB Parallel Computing Toolbox [34] provides a set of CUDA-accelerated
versions of many standard MATLAB functions. Functionality is largely hidden from
the user behind custom MATLAB array types that allow users to program at the
same high abstraction level as CPU-based MATLAB code. Interfacing custom CUDA
C kernels into the built-in GPU functionality is supported, as is automated kernel
generation from user-specified MATLAB functions. Instead of generating CUDA C
and relying on the CUDA C compiler, the Parallel Computing Toolbox generates
code that runs on the GPU.
3.2
Autotuning
Autotuning is becoming an increasingly popular approach to improving kernel performance. There are many examples in the literature regarding autotuning specific
classes of problems, but the work highlighted here focuses on more general autotuning tools. It has been employed to both improve performance and help adapt
to different problems. Choi et al. [12] have worked on sparse matrix-vector multiplication on NVIDIA GPUs. Starting with two high-performance hand-optimized
kernels, they use offline autotuning with a performance prediction model to select
28
implementation parameters quickly at runtime. The performance model is derived

from assumptions drawn from the characteristics of the sparse matrix-vector problem
and incorporates a number of hardware, kernel, and problem instance parameters to
select thread block counts and data tiling sizes. For the problems tested, their model
selects the optimal configuration in 86.5 percent of test cases. When a non-optimal
configuration is selected, the performance was only a few percent lower than that of
the optimal configuration. Nath et al. [35] apply similar techniques, but for dense
linear algebra.
While Baskaran et al. [2] perform affine loop nest transformations similar to other
work, they also use a model-driven empirical search for optimizing GPU kernels.
Their work focuses on memory hierarchy performance and determines optimal tile
sizes and loop unroll values.
The G-ADAPT framework of Liu et al. [32] takes GPU kernels and empirically
evaluates multiple kernel variation and configuration parameters through benchmarking. At runtime G-ADAPT can estimate with high accuracy the best kernel and
configuration for an incoming set of problem parameters, including the size of the
inputs. Kernel configurations are generated automatically from instrumented GPU
code.
Rudy et al. [48] have created the CUDA-CHiLL framework for generating optimized CUDA code. The framework can apply transformation recipes to C-language
loop nests to CUDA C, accounting for memory accesses patterns, tiling, and assign-
29
ing data to registers or shared memory. Combined with autotuning to search the
transformation and parameter space, highly optimized kernels can be generated.
Using OpenMP code as a starting point, Lee and Eigenmann [28] document their
OpenMP to CUDA source to source compiler called OpenMPC. Directives and environment variables are used to guide the CUDA code generation and include a large
subset of CUDA features, but translation between OpenMP and CUDA concepts is
automatic as is selecting portions of the application code to convert to GPU kernels.
Parallel regions are selected for GPU acceleration based on the level of inter-thread
data sharing and communication, and data transfers between the host and GPU are
minimized. The authors provide tools for searching the configuration space, including tools to prune unnecessary variables from the search space and generate and test
variants. Results using the tool set on a few OpenMP benchmark applications and
algorithm kernels show that they achieve 88 percent, on average of the performance
of hand-tuned kernels.
3.3
Domain Specific Tools
A number of domain-specific tools allow for highly automated high performance

GPGPU implementations from still highly automated and high-level representations
of algorithms. Many projects include aspects of run-time code generation and/or
autotuning.
Tools from the FLAME project [25] can generate very efficient dense linear algebra
30
code by applying a number of loop-based transformations. Designed for distributed

memory systems, including GPUs, cost functions for computation and data movement are used to estimate the performance of each. Modeling the performance of the
transformations enables a much faster search of the permutation space than empirical
benchmarking.
Ravi et al. [45] have developed a system for easing the implementation of generalized reduction (MapReduce) applications on systems with CPUs and GPUs. Users
provide serial kernels conforming to the MapReduce paradigm and their compiler
will automatically parallelize and generate both kernel and host code. A runtime
component dynamically adjusts the workload balance between the CPU and GPU.
A number of frameworks relying on embedded domain specific languages (DSL)
have been documented in the literature. The Delite Compiler Framework and Runtime [11, 8] embeds OptiML, a machine learning DSL, in Scala and targets CUDA,
while the Chai [26] framework reimplements the PeakStream [43] language for targeting OpenCL. Both feature code generation and automatic GPU memory management. Liszt is a DSL in Scala by DeVito et al. [16] used to represent partial differential
equations, which are solved with stencil-based computations. Their framework translates the DSL into an intermediate representation, and at run time it is optimized
for the target problem and translated to CUDA C.
Grewe et al. [22] have presented a framework that generates and then autotunes
CUDA or OpenCL kernels for sparse matrix-vector multiplication from a high-level
31
input representation. Zhang and Mueller [58] have developed a framework for generating highly optimized 3D stencils, from a concise user-provided representation
consisting of a single equation and a number of general stencil parameters. After
code generation, an autotuning facility search for optimal kernel design and configuration and supports targeting one or more GPUs.
Also working with stencils Catanzaro et al. [10] are able to use the JIT compilation
features of Ruby and Python (based on PyCUDA) to transform concise embedded
stencil representations into GPU code. The performance, however, is slower than
hand-coded kernels, partially due to JIT compilation overhead, but some benefit is
seen from specialized kernels with fixed loop bounds. Building on this work is the
Copperhead [9] domain-specific language embedded in Python that can be used to
represent data parallel operations in the form of map, reduce, scan, gather, etc. These
are then compiled into CUDA code at run time.
3.4
Summary
The literature surveyed here relates to this research in a number of different ways.
While able to specialize GPU kernels for specific problems at run time once the
parameters are known, run time code generation tools are limited by the range of
problem domains that they understand. Likewise, by limiting a tool to a specific
problem area, such as many of the domain-specific tools, the automatic generation
of high performance GPU code becomes tractable. However, users are restricted
32
to a language subset that may not be able to represent the problem at hand. The
approach to kernel specialization presented here relies on standard CUDA C kernel
input, which can be designed to target any problem domain. While a much more
manual approach, this can be used to fill in gaps between the capabilities offered by
more restricted tools.
General autotuning tools, while highly effective, can take a long time to search the
parameter space. In addition, these tools may be limited by the variety of supported
code transformations and may never radically restructure implementations the way a
skilled human designer may. The research presented here is complementary to these
types of advanced parameter space mapping and performance estimation tools. By
using highly parameterized CUDA kernels that are specialized quickly at run time,
autotuning tools can be used to characterize the performance of a given implementation so that effective parameters can be selected quickly and used to compile a
specialized kernel.
Chapter 4
Kernel Specialization
Chapter 2 described the current state of CUDA development, with particular emphasis on the trade-offs between performance and adaptability. In this chapter, a
technique for addressing these trade-offs, kernel specialization, is introduced.
Kernel specialization is as a technique to enable greater GPGPU kernel adaptability while mitigating the associated performance overheads. Kernel specialization
involves delaying the generation of GPU binaries until problem and hardware parameters are known. This enables a number of static-value optimizations that are
important for achieving high-performance on GPUs.
From a kernel development perspective, all that is required to use kernel specialization is to write a kernel in terms of undefined constants. These constants are
provided values when the kernel is compiled. The CUDA kernel in Listing 4.1 is
designed to highlight several of the most common types of optimizations that are applied at compilation time. In the example kernel, each thread determines its global
offset in the thread space to use as a base offset to the in and out pointers. Then,
CHAPTER 4. KERNEL SPECIALIZATION
34
using a stride of argA * argB, count values from in are accumulated in acc. The
kernel assumes that the memory regions pointed to by in and out are large enough
for any accesses generated by the other input values.
global
void mathTest ( int in , int out , int argA , int argB , int
loopCount ) {
int a c c = 0 ;
1
2
3
4
5
6
7
8
9
10
11
12
13
const unsigned int s t r i d e = argA argB ;

const unsigned int o f f s e t = b l o c k I d x . x blockDim . x + t h r e a d I d x . x ;
f o r ( int i = 0 ; i < loopCount ; i ++){
a c c += ( i n + o f f s e t + i s t r i d e ) ;
}
( out + o f f s e t ) = a c c ;
return ;
}
Listing 4.1: A CUDA C GPU kernel designed to demonstrate the common ways
kernel specialization can improve the performance of kernels. This kernel is a regular
fully run-time evaluated kernel and does not rely on specialization in any way.
global
void mathTest ( int in , int out , int argA , int argB , int
loopCount ) {
int a c c = 0 ;
1
2
3
4
const unsigned int o f f s e t = b l o c k I d x . x BLOCK DIM X + t h r e a d I d x . x ;
5
6
7
8
9
10
11
12
f o r ( int i = 0 ; i < LOOP COUNT; i ++){

a c c += ( ( int ) PTR IN + o f f s e t + i ARG A ARG B) ;
}
( ( int )PTR OUT + o f f s e t ) = a c c ;
return ;
}
Listing 4.2: A CUDA C GPU kernel designed to demonstrate kernel specialization.

The constants specified by identifiers in all capital letters must be provided at compile
time.
The kernel in Listing 4.1 is fully run-time evaluated (RE). That is, no parameters
35
are assumed to have fixed values a compile time. As discussed in Chapter 2, this
allows the kernel to adapt to any set of inputs but comes at the expense of lower
performance.
The kernel can be modified for use with kernel specialization by replacing only
a few identifiers with constants that will be specified at compile time, as shown in
Listing 4.2. Unique names are required, assuming that the original kernel prototype is not modified, and as a convention, macro values that will be specified on
the command line are in all capital letters. In the specialized kernel (SK), a fixed
value for LOOP COUNT allows the for() loop to be unrolled. Fixed values for ARG A,
ARG B, and PTR IN result in constant folding and propagation. Lastly, fixed values for
BLOCK DIM X and PTR OUT allow the use of immediate values in the generated PTX,
removing the need to load special registers or kernel arguments from shared memory.
Expanding on the specialized kernel of Listing 4.2, the sample kernel code in
Appendix B demonstrates how it is possible to create a single CUDA C kernel that can
be used in both specialized and non-specialized situations. The kernel source listing
implements the same functionality as the first two examples, but allows for individual
control over which parameters are evaluated at run time or specialized when the
kernel is compiled. It also demonstrates two basic routes for taking advantage of the
parameter values specified on the compiler command line: preprocessing directives
and constant-value macro expansion, as previously shown, or as arguments to C++
templates.

In the example, macro definition names that are prefixed with CT
36
function as
Boolean flags to control whether a parameter is evaluated at run time or specialized at

compile time. The series of preprocessor directives before the kernel declaration define
default values for macros when left undefined and allow the kernel to be compiled in
a fully run-time evaluated (non-specialized) state when it is not compiled in a kernel
specialization-aware setting.
Lines 51 through 57 in the kernel demonstrate using C++ templates to toggle
the specialization state of three dependent parameters, count, stride, and offset.
The typedef lines of 51 through 53 use the CT -prefixed macros to select between
template specializations that either return the run-time evaluated argument or the
value specified at compilation time. Here, the loop iteration count, thread block
dimensions, and a purely dependent parameter are optionally specialized. The gpu
and gpu::ctrt namespace utilities are defined in the gpuFunctions.hpp header. The
gpu::num class converts macro values to types for use in additional math template
classes, like gpu::ctrt::mult on line 52. (The ctrt namespace is an acronym for
compile time/run time.)
Lines 55 through 57 invoke static methods belonging to the template classes instantiated on lines 51 through 53. In the run-time evaluated non-specialized parameter template specializations, the argument, which can refer to an input or a computed
value, is returned. This produces a run-time dependency. For example, the run time
evaluated specialization for computedStride::op() is simply a * operator applied
37
to the kernel inpus argA and argB.

The template specializations associated with specialized parameters generate static
values at compile time that the compiler can optimize. Applying the
forceinline
CUDA keyword function attribute guarantees inlining, even with the method call
syntax, and allows the use of the same kernel code to access values regardless of
specialization status. There is no need to access the static const struct members
directly, as is typical with C++ template compile time processing. For the same
computedStride::op() example, the specialized version returns the multiplication
of the two template arguments specified on line 52: ARG A and ARG B. Since, as template arguments, these are static, the compiler performs the multiplication during
compilation. As a constant variable with a known value, the compiler can also propagate the result of this multiplication when it is used later on in the kernel.
Using preprocessor directives to toggle specialization status is shown by the two
if/else conditionals starting on lines 60 and 67. The conditionals control inclusion of
non-specialized code dependent on run-time evaluated kernel arguments or specialized
code based on fixed values. In the later case, pointer values are expanded directly
into the code1 .
To demonstrate the significant differences produced by kernel specialization, Appendices C and D contain PTX generated from the single CUDA C kernel in Appendix B. For the listing in Appendix C, none of the parameters values were special1
Statically compiled pointer values are provided as unsigned long hexadecimal values, so casting
is needed. Single- and double-precision floating-point values can be specified on the command line
in a similar manner. See Section 4.4
38
ized, making the PTX fully run-time evaluated. In contrast, Appendix D contains
PTX for the case where every parameter was specialized and fixed at compile time.
In this case, a loop iteration count of five, and a one-dimensional block of 128 threads,
was used. The argA and argB inputs were fixed at 3 and 7, respectively. The input
pointer was set to 0x200ca0200, and the output pointer to 0x200b80000.
Comparing the two PTX samples, a number of observations are immediately
apparent. First, the specialized PTX in Appendix D has no control flow. The
for() loop is completely unrolled. The run-time adaptable kernel includes several
instructions for loop setup, iteration, termination condition checking, and branching.
As discussed in Chapter 2, loop unrolling is important on NVIDIA GPUs.
Other optimizations are also visible in the PTX samples. These include strength
reduction and constant folding and propagation. While many computed values depend on thread index values that are unique to each thread and must be generated at
run time, such as offsets from a base memory pointer, the computations often involve
common subsets dependent only on parameters that can be determined at compile
time, such as thread block, grid, or data dimensions.
In the specialized PTX, base plus offset addressing is used and fully unrolled.
The base input pointer (0x200ca0200 is 8,603,173,376 in decimal) plus an stride of
84 bytes (argA * argB * 4) is propagated through the unrolled loads, but the base
register is still dependent on a run-time variable - the thread index. A strength
reduced multiply of 128 (shift left by 7) also appears.
39
In this case, the specialization is complete. The specialized PTX kernel contains
no references to the input arguments. The kernel arguments, however, are kept to
preserve the interchangeability of the kernel in the case that various input parameters
are toggled between run-time evaluated and statically compiled variants.
4.1
Benefits
Instead of locking a kernel into one of two regimes of higher performance without
adaptability or adaptability with lower performance, kernel specialization allows for
both performance and adaptability. The static-value requirements for many important optimizations are effectively removed.
Beyond generating efficient code for a given problem, kernel specialization can
also improve the performance portability of a single GPU kernel implementation between different GPUs. Implementation parameters that are adjustable independently
from problem parameters can be used to adjust the characteristics of a kernel for a
particular device. Of particular importance are implementation parameters that impact the amount of work assigned per-block and per-thread, discussed in Chapter 2,
such as adjusting tile sizes and the number of registers used for register blocking.
Second order performance effects resulting from the interplay between problem
and hardware characteristics are possible and can also be addressed with kernel specialization. For example, problem parameters at low extremes can reduce the amount
of work and/or inherent parallelism available at the block-level to the point where
40
ILP alone is not sufficient to maximize performance. Reducing the amount of work
assigned to each thread can increase thread counts, potentially improving performance.
A less measurable benefit of kernel specialization at run time is better code reuse
and maintainability. While greater adaptability may increase initial development
time due to the need to account for a greater number of corner cases and parameter
interactions, once completed, a single GPU kernel can often be specialized for and
applied to a wide range of problems. Assumed block and grid dimensions and parameters related to differing compute capabilities often appear both in the kernel code
(usually as macro definition constants or hard-coded values) and host code. With
kernel specialization, these assumptions can be removed from kernel code, as they
are provided to the kernel when it is compiled. Similar to loop unrolling and register
blocking, kernel specialization can convert fixed size constant memory declarations
to ones that are dynamically sized. Kernel specialization also allows developers to
use the simpler static shared memory allocation syntax but have it behave like dynamically allocated shared memory.
Kernel specialization can be very powerful when combined with C++ templates
and compile-time techniques. Kernels can be further customized beyond purely numerical parameter values. Template specializations can be used to select among a
number of variants based on the specific problem parameters or hardware characteristics. An example related to the former would be selecting a different data type or
41
per-pixel comparison metric (i.e. sum of absolute differences to sum of squared differences) for a sliding window matching operation. For the later, kernel specialization
could be used to select between the * operator and the
[u]mul24() intrinsic for
integer multiplies. With kernel specialization, the kernels for the particular problem
and hardware instance are generated as needed instead of compiling all supported
variants ahead of time. This can significantly reduce program binary size.
Put together, a single implementation can generate optimized GPU binaries at
hardware friendly values, but also perform well on other values. Kernel implementations may include many additional implementation parameters for greater problem
and hardware adaptability without many of the associated impacts of more flexible
GPU code.
4.2
OpenCV Example
As a concrete example of several of the benefits discussed in this section, Appendix F

includes a lightly specialized version of the same OpenCV GPU kernel shown in
Appendix E. It represents a real-world study of the possible benefits provided by
kernel specialization at run time. It also provides an example of incrementally adding
kernel specialization to existing CUDA C code. An implementation designed with
kernel specialization in mind may have produced a significantly different design.
The modified source is split over two compilation units, Listings F.1 and F.2. The
first contains the GPU kernel and the host code that launches the kernel. This file
42
would be compiled with specialized values at run time. The second grouping contains
the host function that OpenCV applications invoke from the host.
The kernel code in the specialized portion is similar to the original version since
many specific template specializations were present in the original. The anchor argument was converted from run-time evaluation to compile time. Similarly, the template
arguments to linearRowFilter caller() function were replaced with tokens that
are replaced directly with the type names. Another difference is the elimination of
the preprocessor conditional that controls certain kernel launch parameters based on
the current GPUs compute capability. This logic is now isolated in the host code.
The second compilation unit contains the host code that is invoked by OpenCV.
The specialize() function accepts a source file name, a target function name, and
then key-value pairs in the form of strings. This specialize() function generates
a specialized version of the CUDA C source and returns a function handle to a
customized version of the linearRowFilter caller() function. The type specializations for the linearRowFilter gpu() host code are kept since they are used for
host-side C++ linking with the rest of the OpenCV library.
In this example, the specialize() function is designed to work with the compilation model of run-time CUDA API, where a host function is usually compiled along
with one or more kernels. The actual mechanism used in this research is described
in Section 4.4.
Notably absent from the example are the explicit template specializations used
43
to pre-compile many kernel variants. These are now generated on the fly and for any
combination of types, parameter values, and target Compute Capabilities.
4.3
Trade-Offs
While compiling kernels at run time offers a number of benefits, the enabling mechanism is itself the downside: compiling kernels incurs overhead. The delay incurred
from compiling kernels can vary significantly, depending on the platform and the
complexity of the kernel.
Despite the overhead, it is not a factor in many situations. For example, with
many long-running streaming applications processing throughput is more important
than the length of the initialization phase. The framework used in this work and
described in the next section also caches generated binaries. If the same set of parameters is encountered, the previously generated kernel can be loaded quickly (with
speed similar to loading a dynamically linked shared object). Additionally, parameters that change often between kernel invocations can be left as run-time evaluated.
Kernel specialization is more relevant for parameters that control algorithmic behavior and control flow than those that can be considered data, such as scaling factors.
4.4
Implementation and Tools
Kernel specialization is a general technique, but a specific approach is used in this

dissertation: the CUDA C compiler driver, nvcc, is invoked at run time to convert
CUDA C language source to GPU binaries. This approach leverages existing com-
44
piler tools to generate highly-optimized binaries for hardware-friendly values while

preserving adaptability to arbitrary values. Working with the high-level CUDA C
kernels also has the benefit of improving the kernel development experience.
This approach to using NVIDIA-provided tools is also currently required, as the
CUDA API supports compiling kernels at run time from PTX input, but not CUDA
C. The undefined constants in kernel code are provided at compile time to the nvcc
compiler by using the macro definition flag, -D. As discussed, many important optimizations are applied during the translation from CUDA C to PTX. Relying on the
NVIDIA provided front-end for this stage leverages vendor-implemented code generation and optimization techniques, minimizes effort spent on reimplementing existing
technologies and keeps application kernel development in the CUDA C language.
For OpenCL, which includes an API call for C-source level kernel compilation, or
a future CUDA API that includes high-level source compilation, an alternative would
be a source-to-source translation where identifiers are replaced with the constant values before compilation. Instead of providing the constant values on the command
line, the source itself would be directly customized. Another option would be generating CUDA C or PTX code directly like some other existing frameworks have, as
discussed in Chapter 3. Version 4.1 and above of the CUDA toolchain has transitioned the CUDA compiler to LLVM, opening up the possibility for more complex
code generation for CUDA C as well as other high-level languages [39].
To automate the process of kernel compilation within the CUDA environment
45
and assist with application prototyping, testing, and parameterization, a pair of

tools were developed. The first is a C-language framework, referred to as the the
GPU Prototyping Framework, that automates kernel specialization and was used to
build the applications considered here. The second is a MATLAB-based tool for
verification against reference code, parameterization, and performance analysis.
4.4.1
GPU-PF
While the macro definition mechanism for kernel specialization is relatively straightforward, as is most CUDA driver-level API code, it is often verbose and error prone.
In response to this, the GPU Prototyping Framework (GPU-PF) was developed
with many host-code abstractions, including kernel specialization automation. The
framework is designed for rapidly constructing applications with streaming processing pipelines. It provides a problem and implementation parameter-focused set of
objects where resources and actions are defined in terms of parameters. Once an
application has been specified, only the values of various parameters need to be adjusted. The framework handles propagating the effects of the parameter changes
through the application. The current set of supported parameter types are listed in
Table 4.1.
Parameters are used to control the properties and behavior of both resources
and actions. Resources include data locations (memory or files) and module-based
resources like CUDA kernels. Table 4.2 lists the currently supported resource types.
The single memory reference type may refer to any memory type except for textures,

Parameter Type
Memory Extent
Memory Subset
Schedule
Integer
Float
Array Traits
Pointer
Triplet
Pair
Type
Boolean
Step
46
Description
Geometry (up to three dimensions) and element
size of a memory reference.
Subrange of a memory extent with associated
stride between updates.
Period between events and delay before first occurrence.
Scalar integer parameter.
Scalar floating point parameter.
Various properties used for CUDA texture and array memory types.
A pointer value.
Three integer values. General, but commonly
used for grid and block dimensions. Individual
elements can be referenced.
Two integer values. Individual elements can be
referenced.
Data type (int32, uint8, float4, double, etc.)
True or false boolean parameter.
Self-updating parameter that iterates through a
specified range with an associated stride.
Table 4.1: Various parameter types provided by GPU-PF.

due to the decoupling of the actual memory allocation in host code as global or CUDA
array memory from the texture reference inside a CUDA module. The supported
memory types are detailed in Table 4.3. They include a memory subview that can
be used any place a full allocation reference can be used. The subviews can update
themselves on each pipeline iteration to refer to a different subset of the full memory
allocation.
Table 4.4 lists supported actions. The number of actual GPU-PF functions representing actions is quite low, as the single memory reference resource type can
represent any type of host or GPU memory. The GPU-PF framework automatically
determines the correct actions based on the underlying memory type.
The three classes of concepts (parameter, resource, and action) form a natural
Resource Type
Module
Description
CUDA Module
Kernel
Memory reference
CUDA kernel
A generic reference to any type
of memory except texture
CUDA texture reference
Texture
47
Dependencies
Optional:
boolean, integer,
pointer, pitch, and floating point
parameters
Module
Various: see Table 4.3.
Module, memory reference, and
array traits (not required for
binding to CUDA arrays).
Table 4.2: Various resource types provided by GPU-PF.
Memory
Type
Constant
Description
Dependencies
Constant memory from a module
Array
Global
Host
CUDA Array
Pitched or linear global memory.
Malloced, CUDA pinned, memory
mapped, or user-provided host memory.
Generic reference to a subset of any
memory type. Can move subset through
the full memory reference over time.
Can be used any place a regular reference can be used.
Module resource and data type parameter.

Extent and array traits parameters.
Extent parameter.
Extent parameter.
Subset
Memory reference, memory subset,

schedule, integer parameter (reset period).
Table 4.3: Various memory types provided by GPU-PF.
Action
Memory copy
Kernel Execution
Description
Single function transfers data properly
according to underlying memory types
at each end point.
Kernel launch arguments and configuration.
User function
File I/O
Arbitrary user function.

Binary data input or output.
Dependencies
2memory reference, schedule.
Required: 2 triplet (grid and block

shape), int (dynamic shared memory),
and schedule parameters; kernel resource.
Optional: integer, pointer,
pitch, and floating point parameters as
arguments; texture reference.
Schedule parameter.
Memory reference, schedule.
Table 4.4: Various actions provided by GPU-PF.

Dimensions
DataType
Host
Memory
Global
Memory
48
Integer
Module
Kernel
Transfer
Execution
Transfer
Figure 4.1: Different template tile regions

hierarchy of dependencies. A resource may depend on one or more parameters, and an
action may depend on one or more parameters and/or resources. Figure 4.1 depicts
an example of the dependencies that form between the three concept types. The
GPU-resident global memory object depends on two parameter objects that specify
the memorys data type and dimensions. These parameters determine the size of the
memory allocation. The kernel execution, which takes the global memory allocation
as a source or destination for data, depends on the memory object to provide a
pointer value. While not directly dependent, the kernel execution can be affected by
the parameters associated with the memory allocation.
The GPU-PF library breaks a programs lifetime into three phases: specification,
refresh, and execution. First, a pipeline is constructed (specified) by instantiating
parameters, resources like memory allocations and kernel modules, and actions like
data transfer and kernel execution. At this point, nothing is allocated and no actions
are performed, other than error checking.
The refresh stage occurs before the first pipeline execution and before the next
49
pipeline execution whenever parameters are updated. A separate refresh phase allows for comprehensive error checking and convenient abstractions without sacrificing performance during repeated execution. Only the subset of application resources
affected by parameter changes are updated for a given refresh, and all resource allocation takes place during the refresh phase instead of during processing. Resource
allocation includes generation of CUDA module-based resources, allocating memory,
and preparing many CUDA API arguments. The framework automatically manages
constructing the nvcc command line, building and loading the module, and extracting
any kernel, texture, or constant memory references from the kernel. Compiled GPU
binaries are cached so that the next time the same kernel configuration is required,
it does not not have to be recompiled.
During the execution phase, events are triggered based on an associated schedule
parameter that specifies a period and delay in terms of pipeline iterations. The
period delay values associated with a GPU-PF schedule parameter allow for more
complex program behaviors than those that would otherwise be possible without
a real application task graph. The C-language API provided by GPU-PF returns
opaque handles, and cleanup of all resources and memory is handled automatically
at program termination.
To aid debugging applications, the GPU-PF framework can provide a detailed
log of the actions performed and parameters used. Listings G.1, G.2, and G.3 in
Appendix G provide excerpts of the output for the template matching application
50
discussed in Section 5.1. The framework also optionally times all events using either
CUDA GPU events or host timers, depending on the operation type. This timing
information can be reported as application totals or per-operation. Listings G.4
through G.7 provide examples of both kinds of output. The GPU and application
timing results reported here use the timing facilities provided by the framework.
4.4.2
Validation and Parameterization
With more highly adaptable kernel implementations that respond to a number of

different parameters, testing, validation, and parameterization of kernels can become
difficult. In many cases, crossing orthogonal sets of problem and implementation parameters results in a large parameter space. It must be ensured that the implementation works across the parameter range. At the same time, the ability to effectively
select high-performance implementation parameters based on the current problem
and hardware combination helps to make kernels applicable to more scenarios.
To that end, a brute force benchmarking and validation facility was constructed,
providing services that are complimentary to the GPU-PF. Built in MATLAB, the
framework tests all configurations within sets of predetermined parameter ranges.
Problem parameter range sets may be designed to only vary one parameter type so
that the performance effects of a given parameter can be isolated. For example, in a
template matching application the search area for the template may be held constant
while the template dimensions are varied. Similarly, for each GPU kernel variant, a
set of implementation parameter ranges can be used to explore the contours of the
51
GPU kernels performance space. These ranges may include valid thread counts and
the number of registers assigned for register blocking, for example.
For each set of parameters, GPU performance is measured, register usage is obtained, and the output is optionally compared to a reference. The applications considered in this paper all relied on MATLAB implementations as references, simplifying
integration into the testing facility. Built around the idea that the CPU will be slow
relative to the GPU, CPU output is cached for later use, and all GPU variants and
parameter sets are tested before iterating to the next set of problem parameters.
Once benchmarking has been performed, several data reduction functions are
provided to help analyze the collected data. Beyond reporting information such as
any execution failures or output not matching the reference, a few functions provide
textual summations or graphical depictions of slices of the multidimensional performance data that can enable humans to quickly identify patterns in the data. GPU
implementation variants can also be compared head-to-head. While advanced autotuning techniques can be used to help determine parameters close to optimal given
the often highly non-linear relationship between various parameters and resulting
GPU performance, often simpler parameterization solutions will suffice.
To this end, the benchmarking facility allows users to experiment with various
configuration policies. A policy function provides a GPU and implementation parameter configuration based on the current problem and hardware parameters. Policies
can be applied to a collected performance data set, and for each problem instance the
52
differences between the selected and optimal configurations, in both implementation

parameter selection and performance, if any, will be reported. Policies found to be
effective can be incorporated into the GPU applications logic.
Chapter 5
Applications
To explore and demonstrate the benefits of kernel specialization, three MATLABbased applications were implemented using the prototyping framework and kernel
specialization. The applications are arbitrary and large template matching, particle
image velocimetry, and cone beam backprojection.
The first two applications did not have existing GPU implementations, and for
each of these, specialization-friendly implementations were created. The cone beam
backprojection application already provided a GPU kernel implementation, to which
some specializations were applied.
5.1
Large Template Matching
The first application considered is a lung tumor tracking application written in MATLAB [14]. The goal of this work is to track lung tumors in fluoroscopic imagery in
real time without the use of visual markers surgically implanted in the body.
CHAPTER 5. APPLICATIONS
5.1.1
54
Algorithm and Problem Space
The algorithm studied for implementation in CUDA generates templates from training data and uses Pearsons correlation to generate similarity scores between templates and the current image data. While the template generation algorithm is specific to this work, 2D Pearsons correlation is a common similarity score. It represents
the bulk of the computation in the application and is the only part of the application targeted for GPU acceleration. The template generation and template-specific
processing can be done once at application setup time and does not constitute a
significant amount of processing. For any given problem instance, template data are
considered static inputs.
Two methods are used to account for the periodic and irregular variation associated with the respiratory cycle: using multiple templates and shifting the templates
over a region of interest (ROI). These provide significant accuracy improvements at
the cost of significant increases in the processing required. For each template, a similarity score must be calculated for each possible position of the template within the
ROI, and this is repeated for each frame. Scanning the resultant scores for the best
match is currently not considered. The reference application uses the corr2() function in the MATLAB Image Processing Toolbox to implement Pearsons correlation.
The definition of corr2() is shown in Figure 5.1.
A simplified MATLAB representation of the tumor tracking computation is shown
in Listing 5.1. As written, the function would be called once for each video frame,
55
P P
corr2(A, B) = r
P P
M
AM N

AM N A BM N B
2
2 P P
B
A
MN
M
N
are the matrix averages of A and B

Figure 5.1: A and B
which is a matrix of frame data, and it iterates through each template. To implement
the 2D sliding window behavior, a pair of nested for loops selects the current ROI
from the incoming frame. The corr2() function is called for each possible sliding
window location within the frame, assuming that only ROI data is passed in for the
frame input and no apron is required.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function s c o r e s = p r o c e s s F r a m e C o r r 2 ( frame , temp , ROIc , s h i f t H , s h i f t V )

% S l i d i n g window 2D c o r r e l a t i o n f o r a s e t o f f i x e d t e m p l a t e s
% ROIc d e f i n e s t h e c e n t e r o f t h e s e a r c h a r e a : [ row column ]
% s h i f t H d e f i n e s t h e +/ s e a r c h i n t h e h o r i z o n t a l d i r e c t i o n
% s h i f t V d e f i n e s t h e +/ s e a r c h i n t h e v e r t i c a l d i r e c t i o n
vRange = ( 1 : s i z e ( temp , 1 ) )f l o o r ( ( s i z e ( temp , 1 ) +1) / 2 ) ;
hRange = ( 1 : s i z e ( temp , 2 ) )f l o o r ( ( s i z e ( temp , 2 ) +1) / 2 ) ;
s c o r e s = zeros ( 2 s h i f t +1 , c l a s s ( frame ) ) ;
f o r n=1: s i z e ( temp , 3 )
t e m p l a t e=temp ( : , : , n ) ;
f o r l r = s h i f t H : s h i f t H
f o r ud= s h i f t V : s h i f t V
s c o r e s ( ud+s h i f t V +1 , l r+s h i f t H +1 ,n ) = c o r r 2 ( t e m p l a t e ,
frame ( ROIc ( 1 )+ud , ROIc ( 2 )+l r ) ) ;
end
end
end
...
Listing 5.1: MATLAB function implementing the sliding window correlation for each
template for each ROI within the current frame.
Table 5.1 shows characteristics of the reference data sets studied [14]. They
display significant variation, and none of the values are powers of two. The per
frame computational requirements vary widely between the patients and are affected
by the template size and the number of times corr2() is called. The template size
Patient
P1
P2
P3
P4
P5
P6
Frames
442
348
259
290
273
210
Templates
12
13
10
11
12
14
56
Template Size
53 54
23 21
76 45
156 116
86 78
141 107
Shift V/H
18/9
11/5
9/4
9/3
11/6
9/2
corr2()
Per Frame
8 436
3 289
1 710
1 463
3 588
1 330
Calls
Total
3 728 712
1 144 572
442 890
424 270
979 524
279 300
Table 5.1: Per patient, the number of image frames, template number and size,
vertical/horizontal shift within ROI, and number of corr2() calls.
increases the iteration counts associated with the double summations used in both the
numerator and the denominator of the corr2() function. The per-frame invocation
count of the corr2() function is dependent on both the number of templates as well
as the shift values, which affect the loop counts in the for loops of Listing 5.1.
In this application, the static nature of the templates allows us to eliminate
significant redundant computation. Calling the corr2() function multiple times with
the same template results in recalculating the template matrix averages for use in
the denominator and numerator. Similarly, the total template contribution to the
denominator is, for a given patient run, constant across correlations and can be
computed once and reused. However, the corresponding frame data values must be
computed for each position within the ROI and for each incoming frame. As the
template-sized sliding window traverses the ROI, the underlying frame data changes,
changing the frames contribution to both the average of the frame data and the
frames contribution to the denominator. All implementations of the sliding window
corr2() implementation, discussed later in detail, take advantage of this redundancy
to improve performance. Assuming that the template data is represented by A and
57

P P
C
A
B
B
M
N
corr2(A, B) = r M N M N
2
P P
B
AD
MN
M
N
is the matrix average of B, AC is the current template with the
Figure 5.2: B
template average subtracted from each value, and AD is the template contribution
to the denominator.
the frame data is represented by B, the actual implemented functionality is reduced
to the equation shown in Figure 5.2.
Several implementations of the computational core of the application have been
created: (1) using CUDA without kernel specialization, (2) using CUDA with kernel specialization, (3) using C with POSIX Threads-based multithreading, and (4)
using MATLAB. All implementations take advantage of the redundant computation
described above.
5.1.2
CUDA Implementation Challenges
This particular template matching application is an interesting case study since the
data sets are created by humans and not carefully selected to match GPU architectures. Both the template size and shift area vary significantly and make it difficult
to assume general relationships between the parameters, as would be the case when
building a general purpose library of GPU kernels.
For each patient a single ROI is defined, but it is of relatively small size (between
95 and 703 total template shift positions) compared to typical linear image filtering.
This introduces concerns about generating enough parallelism, especially when each
58
frame is processed individually in a streaming manner.

The individual computation for each window position is also relatively complex,
even after the simplifications made by reducing redundant computation, as described
above. The subtraction of the average of the data under the current window position makes an effective frequency domain implementation difficult, and it cannot be
assumed that the template values represent separable filters.
Finally, the data sizes in the tumor tracking application make it more difficult to
leverage constant and shared memories. Most template-based GPU kernels assume
the image is too large for shared memory, but assume a small and square template size
that easily fits into shared or constant memory. As seen in Table 5.1, the template
sizes under consideration are as large as 156 by 116 pixels, which prevents the storage
of a single complete single precision floating-point template in constant or shared
memory. Likewise, the typical approach of tiling the ROI and loading a subset of the
image data into shared memory will not work as the template sizes are too large for
storage in shared memory, let alone any additional space for a corresponding ROI.
These differences put this tracking application in a different part of the sliding
window parameter space than is usually considered for GPGPU acceleration.
5.1.3
CUDA Implementation
The GPU version was implemented using six CUDA kernels over four stages that
correspond to various components of the corr2() calculation: (1) frame data averages,
(2) frame data contribution to the denominator for each shift offset within the ROI,
59
XX
M
AC
M N BM N B
is the
Figure 5.3: The actual functionality implemented by the numerator stage. B
matrix average of B, and AC is the template with its average value subtracted.
(3) numerator, and (4) final result. The first three stages consist of two distinct
kernel types that are of nearly identical structure: a tiled kernel that computes
partial results and a reduction kernel that combines the partial results into a final
product for that stage. Since these first three stages are similar, the numerator stage
is first presented alone. Then, the differences between the numerator stage and the
two frame data statistics stages are explained, followed by the final stage.
5.1.3.1
Numerator Stage
The numerator, shown in Figure 5.3, is similar to non-separable convolution except

for the subtraction of the frame data average from each value.
To address the data set working size and parallelism generation problems discussed in Section 5.1.2, we take advantage of the fact that the computation is formed
from two nested summations, which are associative and can be computed in parallel.
The template is broken down into subregions, or tiles, as shown in Figure 5.4, with
each tiles contribution to the final summation computed independently. By selecting
tile sizes that are amenable to the size restrictions of shared memory, it becomes possible to take advantage of one of the most important techniques for improving CUDA
kernel performance. A reduction sum is then required to combine the contribution
of each tile into a final value.
60
Template
Main Tiles
Bottom Tiles
Right
Tiles
Corner
Tiles
Figure 5.4: Different template tile regions.

The individual tiles are handled like individual sliding window operations with
independent templates, with each tile assigned to a thread block. This allows for the
processing of arbitrary template sizes by growing the grid dimensions. This scales
up to 65,536 tiles in each direction per kernel launch; additional kernel launches can
handle remaining tiles, if necessary.
While the thread blocks of a grid are used to handle the different template tiles,
the various positions of the template within the frames region of interest are mapped
to threads within each thread block. This works well for most of the test cases in our
data sets, where the total number of window positions per template per frame are 95,
133, 171, 253, 299, and 703. However, the general sliding-window problem, including
the patient with the largest search domain, requires searches with more than 512
locations the maximum number of threads allowed per block on NVIDIA GPUs with
a CUDA compute capability that is less than 2.0. The maximum number of threads
per kernel launch is an implementation parameter, and the current implementation
uses additional kernel launches with shift offsets to handle large shift areas.
61
Horizontal Shift (strided)
0xXXXXXX00
Vertical Shift
(sequential)
Figure 5.5: Graphical data layout of the shift area for a single tile. Regardless of dimensions, all tiles are applied to the same shift area. A single block of the summation
kernel may only combine a subset, shown in gray, of the template locations.
Each block of the first kernel of the numerator stage writes the output for each
tile over the tiles search space contiguously in memory, with each tiles contribution
starting at the beginning of a pitched memory segment, as shown in Figure 5.5. Each
grid square represents the partial result contributed by a single template tile sliding
over the shift area. The shift area is stored in a column-major organization, with the
gray squares representing the shift area covered by a single kernel launch.
This data layout produces a regular data pattern with the partial contributions
for consecutive tiles separated by a constant stride. Additionally, since the data is
grouped by shift area, the data layout is constant regardless of tile shape, simplifying
the inclusion of the edge case tile contributions. As a result, the second kernel of
the numerator stage, which performs a reduction sum over the contributions of each
tile, is relatively simple: each thread is assigned to a unique shift offset. Since the
62
Intra-thread
Access
Consectuve
Threads/Data
Tiles
Figure 5.6: Each thread accumulates the tile contributions for a singe shift offset, an
example of which is shown in gray.
reduction for each shift offset is independent, no coordination between threads is
required. Each thread simply accumulates the contributions independently, with
consecutive threads accumulating the contribution of consecutive shift offsets. The
data accesses are fully coalesced.
5.1.3.2
Variable Tile Sizes and Kernel Specialization
For the implementation to adapt to arbitrary template sizes, as is relevant for the
tumor tracking application, it must be able to handle template sizes that are not
integer multiples of any tile size, let alone a tile size that executes efficiently on the
GPU. Data padding is made difficult by the definition of the corr2() function, as
the average of the data under the window is different at each window position. This
scenario, shown in Figure 5.4, results in leftover template pixels not covered by the
Main Tiles
Size
Count
88
19 4
16 10 9 11
44
39 29
Right Tiles
Size
Count
84
19
16 6
9
63
Bottom Tiles
Size
Count
48
14
12 16
11
Corner Tile
Size
44
12 6
Table 5.2: Template tiling examples for the template size associated with Patient 4
(156 116 pixels).
regular set of template tiles. The main tile size chosen will affect the number and
shape of any edge tiles that are present. Table 5.2 shows the dimensions and number
of each type of tile for the 156116 pixel template size of Patient 4 for three different
main tile sizes. The included tile sizes are examples of how tile sizes affect the total
number of tiles and the presence of irregular edge case tiles. Although 4 4 pixel
tiles eliminate edge cases for Patient 4, small tile sizes do not execute efficiently on
NVIDIA GPUs. This is discussed in Section 6.2.
Blocks that are assigned tiles around the right and bottom edges of the templates
may perform the nested summation over a differently shaped area. The core computational functionality is implemented as a
device
function with the template
dimensions, shift amounts, and all other parameters passed in as run time evaluated arguments. Each block determines the necessary parameters before invoking
the function. As a result, the nested for loops over these areas may have different
loop bounds that have to be evaluated at runtime, which will prevent loop unrolling
optimizations. The non-specialized versions of the CUDA kernels incur significant
performance impacts as a result.
However, through kernel specialization, we are able to maintain adaptability and
64
handle the edge cases while preserving the benefits of compile-time optimization over
fixed values. This application is a natural fit for compiling highly-problem specific
specialized kernels. While parameters can vary widely between patients, they are
fixed for each patient after an initial setup stage. For the kernel specialized implementation, the core computational functionality is contained within a
device
template function with nearly all parameters, including the tile size, converted to template arguments that are determined at compile time. Up to four separate explicit
specializations of the processing code are instantiated within a wrapper
global
function, one for each needed tile size. The appropriate version is selected by each
thread block, resulting in different blocks executing different kernel code.
Here, kernel specialization allows the compiler to unroll loops for both the main
and edge-case tile sizes. In addition to loop unrolling, strength reduction and compiledin constants are used to calculate a number of offsets and strides, eliminating repeated
runtime evaluation of non-compute code by each thread or block.
The second reduction kernel also has non-specialized and specialized variants.
Each thread loops over the tiles accumulating each partial result for a given shift
offset. The total number of tiles, including edge cases, is provided as a compile time
parameter so the loop may be unrolled in the specialized kernel.
5.1.3.3
Other Stages
The remaining stages share a number of similarities with the numerator stage. While
the working set requirements are not as high for the frame data average and denom-
65
inator calculations as the numerator kernel since no template data is needed, the
non-tiled working set for most of the patients would still exceed the available shared
memory. As a result, the frame data averages and contribution to the denominator
are each implemented as a two kernel solution similar to that of the numerator. The
frame denominator and numerator share the same reduction sum kernel, but the
frame averages reduction relies on a different kernel that also performs a normalization to produce average values.
The final kernel from the current corr2() implementation computes the fraction
and square root operation from the denominator. The reciprocal square root function
is used on the denominator data to perform the square root and convert the division
implied by the fraction into a multiplication and combine two expensive operations
into one, leaving only an additional multiplication to produce the final value for each
shift position. As is the case with the numerator stage, the final kernel is implemented
assuming the template data is precomputed and static. Since the sliding window
shift operation is not applied to the template data, the template portion of the
denominator exists as a single scalar value for each template.
5.1.3.4
Runtime Operation
The host application responsible for coordinating I/O, data transfers, and kernel execution was built using GPU-PF. Each frame is streamed onto the GPU and processed
independently during execution. Currently, only the necessary ROI data from each
frame is pushed to the GPU, which limits the size of the data transfers. To reduce
66
complexity, the numerator kernel processes a single template at a time, requiring

multiple calls to process each frame. (This may be in addition to multiple calls to
handle large shift areas.) The kernels read both the original frame data and template
data through textures bound to pitched global memory, but reads and writes to data
generated on the GPU are direct coalesced accesses to global memory. In the case
of the frame data, new frame data transferred from the host overwrites old frame
data, allowing reuse of the same frame data texture binding between frames. The
template data is static for the duration of the processing of the current patient and
is only pushed once; each template is placed into a separate pitched global memory
allocation. GPU-PF is used to rebind the texture reference between the processing
of each template to iterate among the templates.
5.1.4
CPU Implementations
In addition to the CUDA implementations, two CPU-based versions were created.

The first is a MATLAB-based implementation that optimizes the reference MATLAB
application through eliminating the redundant computation described in Section 5.1.1
and through general MATLAB optimizations.
The second CPU implementation was written in C and and uses POSIX Threads
(Pthreads) for multithreading. Since the number of cores on most current CPUs is
in the low single digits and the memory hierarchy on multicore CPUs is very different than the memory hierarchy provided by NVIDIA, the parallelism strategy used
was significantly different than the strategy used for CUDA. Figure 5.7 provides an
67
for all local shift offsets do

for all window pixels do
Accumulate frame data average
end for
for all window pixels do
Calculate frame data and average differences
Accumulate frame data contribution to denominator
end for
for all templates do
for all current template pixels do
Accumulate numerator
end for
Calculate final fraction
Store result
end for
end for
Figure 5.7: A pseudocode representation of the computation performed by each CPU

thread.
algorithmic description of the computation performed by each CPU thread. Instead
of tiling the template, the search space is partitioned among the thread pool in contiguous sections. Each thread computes the similarity score for the full set of full-size
templates. For a given shift offset, the computation largely follows the same steps
as the CUDA implementation: frame data averages, frame data contribution to the
denominator, numerator, and final fraction. An optimization not utilized in the GPU
version is saving a modified version of the current frame data that has the current
frame data average subtracted out. This reduces the total number of floating-point
operations. This is not utilized on the GPU as it generates extra memory traffic.
5.2
Particle Image Velocimetry
Kernel specialization was also applied to improve the performance of a particle image
velocimetry (PIV) application. PIV is an optical flow technique that attempts to
68
Figure 5.8: A graphical depiction of the physical PIV configuraiton.

determine the direction and magnitude of fluid flows. The PIV setup considered
here processes video streams of particles seeded into a moving fluid, as shown in
Figure 5.8. The particles are illuminated by laser pulses, making them more easily
distinguishable from the fluid. Tracking the individual particles between frames can
be used to determine characteristics of the fluid flow within the field of view. The
PIV work presented here builds on previous work, where the computation associated
with PIV was accelerated with a field programmable gate array (FPGA) [6, 4, 5].
To compare a pair of video frames, a patch, or mask, is extract from one image
and compared, relative to the same position, to several nearby locations in the second
image. The location offset in the second image that produces the greatest similarity
is selected to generate a local velocity estimation for the location of the patch. Doing
the same computation many times over the image pair will produce a velocity field
estimate.
In order to produce reliable estimates, it is required that flow is more or less locally
uniform, requiring the difference in time between the frames to be small relative to
69
the velocity of the fluid. The details of the constraints depend on a number of physical
factors that change between physical setups, but in general this results in a moderate
search area for each mask. This requirement makes the scale of the search similar
to the template matching application, but the size of the template is not expected
to be nearly as large. In addition, many individual template matching operations
are performed, and the mask data, analogous to the template data in the template
matching application, is location dependent. Each individual velocity estimate in the
field will use a unique patch.
The mask associated with any velocity estimate is referred to as a sub-area, as
shown in Figure 5.9. The space of mask offsets searched is defined by the dimensions
of the ROI, called the interrogation window, which is rectangular. The search space
is just the difference in dimensions between the sub-area and interrogation window
and is always dense, meaning that a similarity score is generated for every possible
offset of the sub-area within the interrogation window. As an example, consider an
interrogation window of forty pixels square and a sub-area of thirty-two pixels square.
In this case, with the sub-area at the center of the interrogation window, the sub-area
can move up to four pixels in each direction up, right, down, and left, producing a
nine-by-nine square offset space.
In the FPGA implementation, the distribution of local searches throughout an
image pair for which velocity estimates need to be generated was assumed to be a
regular grid and was defined in terms of interrogation window parameters. Start-
(0,0)
70
Image
x
(x0,y0)
Interrogation Window
(u0,v0)
Sub-area
Figure 5.9: A graphical depiction of the terminology originally used for the FPGA
PIV implementation. This image is from Benniss dissertation [6].
ing from the top left corner, the distance between adjacent velocity estimates was
controlled by a parameter called overlap. Overlap, specified in pixels for each dimension, is the number of pixels that adjacent interrogation windows overlap. Continuing
with the forty-by-forty interrogation window example, an overlap of twenty in each
direction results in a grid of velocity estimates that are twenty pixels apart. In each
dimension, adjacent interrogation windows will share half of their pixels. For a given
interrogation window, the patch and location of the estimate are based on the center
of the interrogation window.
The FPGA implementation was designed with the goal of integration with an
existing PIV test bed used by the Robot Locomotion Group (RLG) at the Massachusetts Institute of Technlology, headed by Professor Russ Tedrake [47]. Based
on updated requirements from the RLG, the CUDA version, discussed in detail below,
uses a significantly different problem definition.
The new problem specification is more flexible than the FPGA implementation
71
and is defined in terms of mask corners and offsets. The mask corners are provided
as a list of arbitrary x- and y-coordinates representing the locations of the top left
corner of each desired mask. The offsets at which to calculate similarity scores
are also specified as a list of arbitrary x- and y-coordinates. The set of offsets are
applied to each mask. The new problem specification allows the same implementation
to perform both global and local searches under different conditions. For example,
determining a flow estimate at image-wide scales can be used to determine the overall
fluid motion. When the flow is non-turbulent and a direction can be determined
ahead of time, the search space defined by the offsets can be biased in the forward
direction. On the other hand, the same problem specification can be used to define
a PIV problem instance that includes a closely clustered set of masks and offsets
around an area of turbulent flow. This can produce better estimates in turbulent
flows.
While the new problem specification provides more flexibility, it presents a more
challenging problem for implementation with CUDA. The regular spacing of interrogation windows and dense rectangular search space create a very regular data access
pattern across interrogation windows and between consecutive mask offsets. The
regularity provides rigidly structured opportunities to minimize data traffic through
the memory hierarchy, often a key performance consideration for GPUs. Depending
on the value of the overlap parameter, a significant amount of data may be shared
between adjacent interrogation windows. Adjacent mask offsets will also share all
72
but one row or column of the data from the interrogation window. The FPGA
implementation takes advantage of these opportunities.
With the new problem specification, it is still be possible to examine the values
within the list of mask corners or offsets to determine whether it is possible to reuse
any data at the top of the memory hierarchy probably shared or register memory
within a thread block. However, the limitations of fast block-local memory, specifically the fact that different thread blocks cannot share data, may limit opportunities
for data sharing.
In addition to the specification of the mask and offset distributions, the FPGA
and CUDA versions differ in two other ways: similarity score and data type. The
FPGA application uses cross-correlation in the time domain. Based on guidance
from the RLG and literature [23], the similarity score for the GPU implementation
was switched to the sum of squared differences, shown in Equation 5.10. Timedomain processing was selected since Yu [57] et al. has shown that the time domain is
more computationally efficient for the problem instances of interest, due to padding
requirements. The FPGA implementation was highly parameterized, including by
bit-widths of the data type. The processing was changed from integer to singleprecision floating-point, as per newer requirements from the RLG.
Finally, not yet considered is the selection process for a given masks velocity
based on computed similarity scores. This is a simple reduction scan for each mask
corner. In the case of cross-correlation, a maximum value is desired, while for the
ssqd(x0 , y0 , u0 , v0 ) =
73
i=m1
X j=n1
X
i=0
(I(x0 + i, y0 + j) m(u0 + i, v0 + j))2
j=0
Figure 5.10: The per-mask offset sum of squared differences similarity score, defined
in terms of the original PIV problem specification as shown in Figure 5.9.
sum of squared differences metric, a minimum value is optimal.
5.2.1
CUDA Implementation
The high-level design of the PIV CUDA implementation was based on several observations and guidance from the RLG. The target range for the number of mask
corners is between 200 and 2000 and up to 2000 offsets. At the same time, mask
dimensions will be between twenty-five and sixty-four pixels and always square. Note
that the computation associated with each mask is entirely independent, and while
significant, the target parameter ranges reduce the computation associated with a
single mask to a relatively small portion of the total computation.
Based on this, a mapping of one mask per block was chosen. At the low end of
the range, this produces more than enough blocks to fully utilize current NVIDIA
hardware, while at the same time generates significant potential per-thread block
parallelism except at extremely low counts of shift offsets and mask sizes. As with the
template matching application, each iteration of the summation loops is independent.
Another benefit of this mapping scheme is that the mask data becomes static for each
block. It should be possible to load the mask data to block-local memory only once.
A possible downside of this approach is discarding the ability to optimize for data
74
accessed by multiple masks, as described above. However, the scale of the data reuse
is highly problem instance dependent and based on the arbitrary distribution of both
the distribution of masks and corners. Going without data reuse simplifies the GPU
implementation. Reading interrogation window data through texture memory and
the L1-style caching on Fermi-architecture and later GPUs can potentially mitigate
the performance impact.
Another observation about the PIV problem as posed by the RLG is that the
reduction operation is per-mask and therefore moderate in scale no more than
2000 values. Performing this reduction at the block level is efficient, removes the need
to incur overheads associated with a second reduction kernel launch, and decreases
global memory traffic by reducing the output of the kernel to a single value per block.
Beyond a per-block reduction to find the optimal similarity score, the accumulation associated with computing the similarity score at each offset must be mapped
to a blocks threads. Based on the increasing importance of using register file over
shared memory, as discussed in Section 2.3, and that for each block the mask data is
static, register blocking the mask data was chosen, as shown in Figure 5.11. Consecutive threads load consecutive column-major mask data values into registers. While
this generates a stride for intra-thread accesses, it produces coalesced inter-thread
data access patterns for both the initial mask data load and reading ROI data from
the other video frame.
This block-level mapping decision breaks apart the mathematically associative
75
Thread
Set
Sample Mask
0
1
2
3
Figure 5.11: Example of a set of threads striped across a masks area.
accumulation of squared differences across the threads and keeps the threads in lockstep over the offset space. This inverts the thread mapping chosen for the large
template matching application, where threads operated in lock-step over the same
template tile data but for different shift offsets.
The PIV thread mapping decision provides sufficient per-thread work for each
offset and generates ILP. At the low end of the expected template size range of
twenty-five pixels per side, this still maps more than four pixels of mask data to each
thread for a thread block of 128 threads. This results in parallelism availability that
is less sensitive to the size of the offset space. Even down to a sixteen pixel mask,
128 threads-per-block results in two pixels per thread or four for a 64 thread block.
Given a value for the number of pixels of mask data assigned to each thread and
a number of threads per block, a thread block will cover a certain natural area.
For example, 128 threads per block at four pixels per thread results in a natural area
of 512 pixels. For larger masks, the mask is tiled and the block loops over each tile.
This tile looping occurs outside of the loop traversing the offset space. This allows
each mask tile to be accessed only once for the entire offset space.
However, looping over the offset space inside the mask tiling requires tracking
76
and accumulating mask tile contributions for each offset. A shared memory area
with the same number of elements as elements in the offset space is reserved for
this purpose. However, before a given mask tiles contribution for a given offset can
be accumulated, the portion calculated by each thread must be reduced to a single
value. Block-level reductions, as discussed in Section 2.2, represent a potential point
of inefficiency as fewer and fewer threads participate in each round of the reduction.
The idle threads must wait for the reduction to complete before continuing on to the
next mask offset. In the case of the PIV kernel, a reduction must be performed for
each offset for each mask tile, potentially incurring this performance impact several
thousand times during the lifetime of the kernel.
As a way to address this, the PIV kernel applies warp specialization, as described
in Section 2.3. Several different styles of warp specialization were applied to the PIV
kernel, with the three main tasks, loading data, accumulating squared differences,
and reducing accumulated data to a single value, distributed in different ways among
two- and three-stage pipelines. Each stage of the pipeline was assigned to a varying
number of warps. However, with register blocking and loop unrolling, the compute
threads in the PIV kernel produce significant instruction- and memory-level parallelism, limiting the benefit of double-buffering data loading through shared memory.
The best performance was observed with a two-stage double-buffered approach, with
one set of warps doing both the loading of new ROI data and computing the sum
square differences assigned to each thread, as shown in Figure 5.12. In this group of
77
Compute
Warps
Reduction
Warp
A
Time
Barrier
C
N-1
N+31
Figure 5.12: A depiction of the warp specialization used in the PIV kernel to remove
the reduction as a bottleneck.
warps, data is read directly by the threads consuming the values without a buffering
stage in shared memory. Once a mask tile at a given offset has been processed, each
thread writes its contribution for the current shift offset into shared memory.
The second warp group, consisting of a single warp, performs the reduction. Combined with double buffering, this allows the first group to immediately begin processing the next shift offset. Using a single warp for the reduction prevents the need for
any synchronization related to the reduction. When the number of threads producing intermediate results is greater than sixty-four, the warp accumulates extra data
within each thread until a standard reduction tree can be performed. A maximum
of five levels of reduction are needed.
With warp specialization, assigning multiple warps to the reduction task requires
the reduction warps to perform a sub-block synchronization. Sub-block synchronization is not well supported in the CUDA environment, requiring inline PTX, similar
to inline assembly in C/C++. It also results in more difficult debugging, but was
78
nonetheless successfully implemented. With relatively small reductions over an array whose element count matches the number of threads per block, the multi-warp
sub-block synchronization approach was observed to execute more slowly than the
single-warp synchronization-free alternative. With typcial block sizes, the transition
to the single warp synchronization-free stage occurs quickly, resulting in idle warps
when more than one warp participates in the reduction.
The single warp performing the reduction only accesses shared memory, and does
not compete with the other loading and computing warps, as input data for the
loading and computing warps comes from registers and global or texture memory.
The last remaining thread in the reduction tree accumulates the value into the correct
offset element in shared memory.
Once all mask tiles have been accumulated for all offsets, the final reduction scan
for the minimum value is performed. This single output is the index of the offset
coordinates that produces the optimal value.
5.2.2
For PIV, kernel specialization is used in a number of ways to improve the performance of the run-time evaluated equivalent kernel. Key benefits offered by kernel
specialization for the PIV kernel is a combination of loop unrolling and enabling dynamic register blocking. NVIDIA GPU registers cannot be dynamically addressed by
the thread that owns them. Any loops operating over register-resident data must be
static at compile time so the specific registers can be encoded into the GPU binary.
79
Here, kernel specialization converts the number of registers used for register blocking
into a dynamic variable. This allows the kernel to adjust its register usage based
on the number of registers available on a target GPU and the current problem. In
addition, kernel specialization allows many other offsets and strides to be statically
compiled into the PIV kernel. These include thread identification thresholds used to
control inter-warp divergence and how many threads are in each specialized group.
The specialized kernel was derived much like the large template matching kernel,
described in Section 5.1.3.2. The run time kernel was developed first, and then the
specialized version was created by replacing kernel arguments and CUDA built-in
references with new macro names provided at compile time. Little other optimization
was applied to the kernel, including forgoing an opportunities to remove run-time
guards for cases where the natural thread block area is an integer multiple of the
mask area and to eliminate tile looping when the natural thread block area exactly
matches the full mask area.
The kernel that evaluates parameters at run time is not, for the purposes of
comparison, fully run-time evaluated. As mentioned, register blocking values must
be fixed once ahead of time. Combined with a desire to determine the optimal value
for register blocking in the run-time evaluated kernel variant, as well as to study the
impact of incorrectly choosing this value, the register blocking level is specialized for
each value, even for the run-time evaluated kernel.
80
Axis of
Rotation
X-ray Source
Detector
Object
Cone Beam
Orbit of Source
and Detector
Figure 5.13: A graphical depiction of cone beam scanning setup.
5.3
Cone Beam Backprojection
The third application studied for kernel specialization is a CUDA cone beam backprojection implementation. In this case, backprojection is used to reconstruct threedimensional models of objects scanned using a series of two-dimensional imaging
projections. The projections determine the density of various parts of the interior
of an object based on the intensity of X-ray beams that pass through the object to
reach a detector on the other side.
In the case of X-ray cone beam computed tomography (CBCT), considered here
and shown in Figure 5.13, the X-ray beam is conical in shape, as opposed to the fan
beam used in standard CT imaging, where the X-ray beam is assumed to be two
dimensional. In a fan beam setup, a single row of detectors is used, while the cone
beam uses a two-dimensional arrays of detectors for each projection collected. A
greater amount of data is collected for each projection, enabling a higher-resolution
81
reconstruction of the interior of the object.

The collected projections are from different angles around the object of interest
but along a single axis of rotation. Feldkamp et al. [19] developed the standard
algorithm used for reconstructing a three-dimensional model for both the fan and
cone beam cases. Geometric data, including position and orientation information,
is used to project each pixel from each two-dimensional projection into an output
volume.
The CUDA implementation considered here was originally based on a MATLAB
version of the Feldkamp algorithm contained in the Image Reconstruction Toolbox
from Jeff Fessler at the University of Michigan [20]. The MATLAB implementation
uses bilinear interpolation to discretize the projection pixels into the output voxels.
James Brock and Saoni Mukherjee, from the Reconfigurable Computing Laboratory
at Northeastern [46], ported the MATLAB implementation into: (1) an OpenMP
version, (2) an OpenCL implementation, and (3) a CUDA version.
The backprojection calculations are highly parallel, making the GPU an attractive
processor target. While the full application includes weighting and filtering steps,
the backprojection kernel dominates the GPU run time and is the focus here. For
the GPU kernels, threads are mapped across voxels in the output reconstruction.
Each thread then iterates over the projection data, accumulating the contribution
of each projection, using bilinear interpolation, into the output voxel. Threads loop
over output voxels, as needed, until the entire output volume has been covered.
5.3.1
82
The cone beam backprojection kernel provides an interesting case study for kernel
specialization. Like the OpenCV examples, the kernel was not developed with kernel
specialization in mind. The inner computation loop is too complex to be unrolled by
the GPU compiler, and while many scalar intermediate parameters are calculated,
most are data dependent and cannot be optimized to static values. Wherever possible, constant parameters are propagated and kernel-wide data-dependent optimizations are utilized. An example of the latter are scalar problem-specified parameters
that determine control flow decisions within the kernel. Based on the value of these
parameters for the current problem, only one code path is compiled into the kernel.
5.4
Summary
In this chapter, the three applications to which kernel specialization has been applied
were introduced. In the case of the first two applications, large template matching
and particle image velocimetry, the CUDA implementations were created as a part of
this research kernel specialization was included in their design. The third application,
cone beam back projection is included as an example of applying kernel specialization
to an existing GPU application. The next chapter covers a number of experiments
performed with these applications on two GPUs and includes information about the
specific tests as well as performance results.
Chapter 6
Experiments and Results
This chapter covers a number of experiments performed with the applications described in the last chapter. The details of the specific parameter ranges tested are
provided, as well as details of the systems used to perform benchmarking. Then,
results are provided, followed by a discussion and observations.
6.1
Experimental Setup
To explore the performance impacts related to kernel specialization, the run-time

evaluated (non-specialized) and specialized GPU applications were each run on a
number of different data sets. For each kernel and problem set, a wide range of
implementation parameters were tested. GPU-PF was used for these experiments.
The non-specialized variants were compiled once and used across both the problem
and implementation parameter sets, with the exception of the PIV kernel, where the
number of registers assigned to register blocking must be fixed at compilation time.
In general, these variants cannot take advantage of many important optimizations
compilers can apply to GPU kernels, but a single compiled instance is adaptable to
CHAPTER 6. EXPERIMENTS AND RESULTS
84
the full range of problems studied.

In contrast, the specialized kernels are compiled for each unique problem instance
and implementation parameter set. For the experiments in this dissertation, the specialized kernels were fully-specialized. That is, every possible parameter was provided
with a fixed value derived from either the problem instance or the current implementation parameters, including scalar parameter values, pointer values, pitched memory
strides, and kernel launch configuration information such as thread block and grid
dimensions. This potentially enables a myriad of important compiler optimizations,
as discussed in Chapters 2 and 4.
The focus is on single kernel performance improvement, so many results report
differences in non-specialized and specialized GPU kernel execution times, but endto-end speedups over CPU or FPGA implementations are also reported. For the GPU
applications, timing was handled by GPU-PF, which uses CUDA events for GPUrelated activities and high-resolution host timers otherwise. Timing of individual
kernel launches are used when analyzing specific kernel performance. In other cases,
where the total run time is of interest, timing is measured from the start of processing
to the end of processing. For the GPU applications, this includes data transfers to
and from the GPU. For both the CPU and GPU applications, only the steady-state
streaming processing operation is timed. This may include file input and output as
a part of the processing stage, but initial setup overhead, such as allocating memory,
memory mapping files, allocating threads, or opening devices and finalization tasks
85
are not included.
6.1.1
Hardware and Software Configurations
All results presented were generated with two NVIDIA GPUs: a Tesla C1060 and
a Tesla C2070. The C1060 is a Compute Capable 1.3 device and contains 4 GB of
RAM. It was installed on a workstation with an Intel Core2 Duo E8400 (3 GHz clock
with 6 MB of L3 cache) and 2 GB of RAM. The C2070 is a Compute Capable 2.0
device and provides 6 GB of RAM. The host machine contains an Intel Xeon W3580
(4 Nehalem cores at 3.33 GHz with 8 MB L3 cache). Both machines run 64-bit Linux
3.2 with CUDA 4.1, GCC 4.6.3, MATLAB 2012a, and CUDA driver version 295.20.
All CPU results presented here are from the Xeon-based workstation, as it has the
more powerful processor.
6.1.2
Problem and Implementation Parameterization
Each application studied was benchmarked on a number of different problem sets and
range of implementation parameters. The total number of different benchmarking
trials for any one of the applications can be determined by multiplying the number of
problem instances by the number of implementation parameter sets. The first two applications, large template matching and PIV, both contain a number of independent
implementation parameters, while the cone beam backprojection implementation is
only parameterized by the number of threads per block. As a result, the number
of discrete kernel configurations tested by the first two applications is significantly
86
larger than the number of configurations for the backprojection application.

6.1.2.1
Template Matching
For the large template matching application, the problem instances evaluated are
listed in Table 5.1 in Section 5.1.1. The data sets are real-world clinical data used by
the researchers and represent a wide range of template sizes and total computation.
Additionally, few of the parameters are those that are considered to be GPU-friendly.
The template matching numerator computation, the focus for analyzing the benefits of specialization, offers a number of implementation parameters that can be used
to further tune the processing. The first kernel, which performs the tiled corr2(),
can be run with a varying number of threads per block and main tile sizes. The
maximum number of threads per block may affect how many times the kernel must
be called to cover the search area, as the current implementation only handles one
search offset per thread. However, increasing the number of threads per block increases shared memory usage per thread, as the entire region of interest is loaded
into shared memory and more threads will cover a greater number of shift offsets.
As discussed in Section 5.1.3.2, changing the main tile size affects the number
of main tiles and can determine whether or not edge-case tiles exist in some cases.
Increasing the main tile size will use more shared memory, as the entire tile and
corresponding region of interest are loaded into shared memory. It also represents a
possible mechanism for balancing per-thread and device-wide workloads. A smaller
tile size will generate more independent blocks but reduce the amount of work each

Parameter
Vertical main tile size
Horizontal main tile size
Tiled kernel threads per block
Reduction kernel threads per block
87
Value Range
2 4 8 10 12 16
2 4 8 10 12 16
64 96 128 160 192 224 256 288 320
32 64 96 128 160
Table 6.1: Template matching GPU implementation parameters benchmarked.

thread performs.
The second reduction kernel contains the number of threads per block as the
single implementation parameter. This parameter can be used to control the total
number of blocks in the kernel launch. However, as a thread is generated for each
shift offset, the total amount of parallelism generated among the data sets tested here
is relatively low. There is also some interaction between the tile size chosen for the
first kernel and the second kernel. Smaller tile sizes will generate more independent
blocks, which produces more intermediate values to accumulate, and thus, more work
per thread.
Table 6.1 lists the set of GPU implementation parameters tested. The first three
are for the first tiled accumulation kernel and the fourth is the single thread count
parameter for the reduction kernel. Noting the aforementioned coupling between the
main tile size in the accumulation stage and the reduction kernel, the threads per
block in the reduction kernel were varied across the range of main tile dimensions.
Likewise, thread counts for the tiled accumulation kernel were varied across the tile
size space. The number of unique configurations per data set was 6 6 9 + 6 6 5,
or 504. With two GPUs and six data sets, the total number of individual template
matching tests was 6,048.
88
While not a focus of this dissertation, the multi-threaded CPU implementation is

also parameterizable based on the number of threads used. Thread counts range from
one to eight (two threads per core with Intels Hyper-Threading simultaneous multithreading). Eight threads consistently produced the best results, with the exception
of the data set P2, where four threads were fastest, likely due to the small problem
size. The performance of the best performing thread count is used for comparison
below.
6.1.2.2
PIV
The PIV application was also tested over a wide variety of problem and implementation parameters. As the reference for performance of the PIV application was the
existing FPGA implementation, the same group of problem instances previously examined were also used to benchmark the GPU implementation. The configurations
were in terms of the original problem specification, and are listed in Table 6.2. Table 6.3 contains the same set of configurations in the new problem representation.
There are two groups of configurations among the ten. The first five configurations,
labeled A1 through A5, increase the image size while keeping other parameters constant. With a larger image, a regular grid with a given stride between interrogation
windows will contain more individual flow estimates to cover the entire image. The
second group, labeled B1 through B5, increases the size of the interrogation window
and overlap pixels while keeping the image and mask size constant. The overlap pixel
count grows in proportion with interrogation window size, which maintains a consis-

Parameter
Set
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Image
Dimensions
320 256
512 512
1024 1024
1200 1200
1600 1200
1024 1024
1024 1024
1024 1024
1024 1024
1024 1024
Interrogation
Window Dimensions
40 40
40 40
40 40
40 40
40 40
24 24
32 32
40 40
48 48
56 56
89
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
16 16
16 16
16 16
16 16
16 16
Overlap
Counts
(20, 20)
(20, 20)
(20, 20)
(20, 20)
(20, 20)
(12, 12)
(16, 16)
(20, 20)
(24, 24)
(28, 28)
Table 6.2: The PIV problem set parameters, in terms of interrogation window and
image dimensions, used for comparing performance of the FPGA and GPU implementations.
Parameter
Set
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Mask
Count
165
576
2500
3481
4661
7056
3969
2500
1681
1225
Mask
Offsets
81
81
81
81
81
81
289
625
1089
1681
Total
Offsets
13 365
46 656
202 500
281 961
377 541
571 536
1 147 041
1 562 500
1 830 609
2 059 225
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
16 16
16 16
16 16
16 16
16 16
Table 6.3: The PIV problem set parameters, in terms of mask and offset counts, used
for comparing performance of the FPGA and GPU implementations.
tent scenario where one-half of the area of an interrogation window overlaps with an
adjacent interrogation window in either direction. With an increasing interrogation
window size, fewer interrogation windows can fit within a constant image size. The
size of the search area also increases. In all cases, the FPGA implementation used
8-bit integers for input, while all GPU results use single-precision floating-point.
To more fully characterize the GPU implementation across the PIV problem
space, three additional problem sets were benchmarked. The first, shown in Ta-

Parameter
Set
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Mask
Count
676
676
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
81
81
81
81
81
81
81
81
81
81
81
90
Mask
Dimensions
88
11 11
16 16
25 25
32 32
43 43
48 48
57 57
64 64
75 75
88 88
96 96
Table 6.4: PIV problem set parameters used to test the impact of mask size on the
performance of the GPU implementation.
ble 6.4, varies the mask size while keeping the other parameters constant. The second
problem set, shown in Table 6.5, varies the number of offsets each mask is moved
through while keeping other values constant. In both cases, the search is dense, and
the interrogation windows are distributed so there is a fifty percent overlap in each
direction. For simplicity, the interrogation windows are organized into a regular grid,
as in the original problem specification.
Since each block processes a single mask, variations in the number of offsets and
the mask size affect how long a single block will execute. In all cases present in
the problem sets of Tables 6.4 and 6.5, 676 interrogation windows are used. This
provides more than enough independent blocks to fully saturate the 14 streaming
multiprocessors (32 CUDA Cores each) on the Tesla C2070 and the 30 multiprocessors
(8 CUDA Cores each) on the Tesla C1060. The run time should scale linearly with
the number of masks, as the A series problems in Table 6.3 is designed to test.
Parameter
Set
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
Mask
Count
676
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
169
289
441
625
841
1089
1369
1681
2025
2401
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
91
Interrogation
Window
Dimensions
40 40
44 44
48 48
52 52
56 56
60 60
64 64
68 68
72 72
76 76
80 80
Table 6.5: PIV problem set parameters used to test the impact of the number of
search offsets on the performance of the GPU implementation.
One additional benchmark set, shown in Table 6.6, was used to investigate interblock level performance impacts. It keeps the mask size and offset search space identical, but spaces out the interrogation windows further and further, which reduces the
common interrogation window data needed between thread blocks. Decreasing levels
of data in common between blocks puts more pressure on the block-level memories
when multiple blocks execute within the same multiprocessor, as well as the rest of
the GPU memory hierarchy not-exposed by CUDA.
Like the template matching application, the PIV implementation exposes a number of implementation-only parameters that are freely tunable for any given problem
instance. The tested parameter ranges are shown in Table 6.7. The register blocking
values adjust the amount of work each thread performs at the expense of per-thread
register usage. Increasing the register blocking value can also decrease the available
parallelism when the mask size is small. For both of these reasons, register blocking
Parameter
Set
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
Mask
Count
676
676
676
676
676
676
676
676
676
676
Mask
Offsets
81
81
81
81
81
81
81
81
81
81
Mask
Dimensions
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
32 32
Interrogation
Window
Dimensions
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
40 40
92
Overlap
Counts
(36, 36)
(32, 32)
(28, 28)
(25, 25)
(20, 20)
(16, 16)
(12, 12)
(8, 8)
(4, 4)
(0, 0)
Overlap
Ratio
0.9
0.8
0.7
0.625
0.5
0.4
0.3
0.2
0.1
0
Table 6.6: PIV problem set parameters used to test the impact of interrogation
window overlaps on the performance of the GPU implementation.
may be in contention with the number of threads per block.
The Main threads parameter refers to the number of threads dedicated to loading data and computing the sum of squared differences. As it was discovered that
a synchronization-free reduction with a single warp always performed faster than a
traditional wide reduction tree, the reduction that occurs for each mask offset always takes place using a single warp. When warp specialization is not used (warp
specialization itself being a Boolean implementation parameter), one warp from the
Main threads group performs the reduction while the rest wait. When warp specialization is used, an extra thirty-two threads are allocated for the reduction. Finally,
the PIV kernel variants supported reading data from the interrogation window directly through global memory or through textures. Both scenarios were tested. With
the PIV kernel, all the parameters are orthogonal, resulting in 160 implementation
parameter configurations per problem instance. With the four series of problem
configurations and two GPUs, the total number of PIV instances benchmarked was

Parameter
Register blocking factor
Main threads
Reduction threads (if present)
ROI data source
Warp specialization
93
Value Range
1248
32 64 96 128 160 192 224 256 288 320
32
global or textured global
enabled or disabled
Table 6.7: PIV GPU implementation parameters benchmarked.

Parameter
Set
V1
V2
Projection
Dimensions
64 60
512 768
Projection
Count
72
361
Output
Volume
64 60 50
512 512 768
Table 6.8: Cone beam backprojection GPU problem parameters benchmarked.

13,760.
6.1.2.3
Cone Beam Back Projection
The cone beam backprojection kernel was tested with two data sets, shown in Table 6.8. The data sets contained synthetically generated objects, referred to as phantoms, but designed to match data sets generated from a real world hardware scanner,
a Siemens Inveon multimodal scanner in CT mode [50]. Data set V1 was used for
testing and the dimensions of V2 reflect real-world dimensions. For these data sets,
the projections were spaced one degree apart along the orbit. The phantom data was
generated using the Image Reconstruction Toolbox that also contains the reference
MATLAB implementation. The processing performed for cone beam backprojection
is not data dependent, so artificial phantoms provide a good performance estimate.
Also in contrast to the template matching and PIV kernels, the cone beam backprojection kernel had less inherent parameterizability. In this case, the single parameterizable value was the number of threads per block, as shown in Table 6.9. Thread

Parameter
Threads per block
94
Value Range
32 64 96 128 160 192 224 256 288 320 352 384
Table 6.9: Cone beam backprojection GPU implementation parameters benchmarked.

counts ranged from 32 to 384 in increments of thirty-two. With two data sets and
two GPUs, the total number of backprojection instances was 48.Like the template
matching application, the CPU version can also be adjusted by how many threads
are used. The results presented here, are with four threads.
6.2
Results
Across the range of problem parameters tested for each application, kernels utilizing
kernel specialization (KS) produced notable improvements over the corresponding
fully run-time evaluated (RE) kernels, both in performance and register usage per
thread. Results compare the kernel variants to each other and list the best performance across the wide range of implementation parameters for each GPU and
problem pair.
6.2.1
Comparative Performance
To establish the soundness of the basic GPU implementations, they are first compared
to the performance baseline implementations. For template matching, the results
shown in Table 6.10, the reference is multi-threaded C. The GPU times provided are
those corresponding to the best performing set of implementation parameters, across
all tested for the kernel specialized implementations. The results include not only
Data Set
P1
P2
P3
P4
P5
P6
Execution Time
CPU
C1060
2456.753 900.918
181.105 239.989
568.893 179.304
2295.653 493.490
2199.486 423.610
1700.251 305.997
(ms)
C2070
393.164
73.394
73.006
212.082
170.186
126.311
95
Speedup vs. CPU
C1060
C2070
2.83
6.48
0.75
2.47
3.17
7.79
4.65
10.82
5.19
12.92
5.56
13.46
Table 6.10: Template matching performance results comparing the multi-threaded C

CPU implementation to the best performing CUDA implementation on two GPUs.
the specialized numerator kernel, but the entire corr2() processing chain.
As can be seen from the table, the GPU implementation performs well. On the
C2070, of more similar vintage to the Xeon used for CPU benchmarking, significant
speedup across the data sets are provided. With the older C1060, P2 is the only
data set that does not provide a speedup. For this data set, which also corresponds
to the lowest speedup on the C2070, the template size is small, at only 23 21 pixels.
With template tiling, this produces a small number of tiles, even at small tile sizes,
limiting the total parallelism.
With PIV, the reference implementation was FPGA-based. The FPGA results
were obtained using an Alpha Data ADMXRC-5LX board, which contains a XILINX
Virtex 5LX FPGA. While Virtex 5 FPGAs were first released in 2006 and have been
replaced by newer generations of the Virtex line, the clock speeds of new devices are
not significantly greater. It is not expected for a given design to execute at significantly different frequencies on different devices. Results comparing this device to the
two GPUs, both utilizing kernel specialization, are shown in Table 6.11. The available
FPGA results only provide processing time, not end-to-end times with data trans-
Data Set
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Execution Time (ms)

FPGA C1060 C2070
3
1.473
0.649
10
4.511
2.126
40
18.887
8.867
70
26.393
12.327
90
34.952
16.772
69
15.896
9.457
138
31.801
18.688
188
51.396
25.535
219
73.780
30.024
247
113.922 38.325
96
Speedup vs. FPGA
C1060
C2070
2.04
4.62
2.22
4.70
2.12
4.51
2.65
5.68
2.57
5.37
4.34
7.30
4.34
7.38
3.66
7.36
2.97
7.29
2.17
6.44
Table 6.11: PIV performance results comparing the FPGA implementation to the
best performing CUDA implementation on two GPUs.
Data Set
V1
V2
Execution Time
CPU
C1060
320
85.869
1 929 900 209 805
(ms)
C2070
69.721
48 208
Speedup vs. CPU

C1060
C2070
3.73
4.59
9.20
40.03
Table 6.12: Cone beam backprojection results comparing the OpenMP CPU implementation with four threads to the best performing configuration on both GPUs.
fers, so the GPU numbers also include only the kernel execution time. Both GPUs
show significant and consistent speedup over the optimized FPGA implementation.
It should be noted, however, that the C1060 results do not use warp specialization
while the C2070 results do. The impact of warp specialization is examined below.
Table 6.12 shows the cone beam backprojection results comparing the OpenMP
CPU run time with four threads to each of the two GPUs. Again, both GPU implementations use kernel specialization. While the times provided include the two
preprocessing stages, the backprojection calculations dominate the total run time. As
expected, the GPUs show significant speedup on the highly parallel backprojection
step.
6.2.2
97
Kernel Specialization Performance
With the basic soundness of the GPU implementations established, this section explores the impact of kernel specialization on the performance of the various CUDA
kernels. Here, results compare the performance of single kernels as kernel specialization is a kernel-level technique.
6.2.2.1
Template Matching
The template matching application is unique among the applications tested due to
its multi-kernel nature. However, for simplicity, only the tiled accumulation kernel
from the numerator will be examined in detail. It represents between 60 and 80
percent of the total streaming run time, depending on the data set. It is the most
complicated kernel in the application and, with the associated reduction kernel, is
called most frequently.
Table 6.13 includes average per-kernel execution time (the kernel is called for
each template for each frame) and register usage counts for the specialized and nonspecialized versions. For execution times a speedup is listed, and for register usage,
a ratio between the non-specialized and specialized per-thread register count is provided. Across the data sets a clear advantage for the kernel specialized variants is
observable.
These results are for the best performing main tile size, which is also listed. It
is interesting to note that the optimal tile dimensions seem arbitrary. With the
specialized kernels, preferences for the hardware friendly values of sixteen elements
98
in the contiguous-data dimension (the first) is weakly observed. Texture memory

is used throughout, and with the higher overheads associated with loop iterations,
the lower memory performance may be less relevant. While not shown, many tile
sizes hovered around the same execution time minima; the values shown had the best
average performance over repeated trials.
6.2.2.2
PIV
While the PIV kernel consists of a single kernel, it is unique among the applications in
that it has a wide variety of implementation parameters, including two non-numerical
parameters that control design decisions: selecting data source memory and whether
or not to use warp specialization.
Table 6.14 explores the impact of these design decisions for the same ten data sets
used to compare the GPU and FPGA implementations. In each case, the execution
time corresponding to the best performing set of numerical parameters is displayed.
For each GPU, the RE Baseline row shows the kernel execution time for the kernel
using global memory and no warp specialization. Successive lines show new run
times for a modified configuration, along with speedup values. The first speedup
value is relative to the previous configuration, and the second is cumulative relative
the baseline configuration.
The table shows an interesting performance divergence between the C1060 and
C2070. In both cases, texturing improves performance when added to the run-time
evaluated kernel. This is true across the PIV benchmark results, and texturing is
P1
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
Patient
Execution Time (ms)

RE
SK
Speedup
0.038 0.022
1.48
0.020 0.017
1.21
0.042 0.028
1.51
0.095 0.061
1.57
0.049 0.032
1.56
0.084 0.054
1.56
0.058 0.038
1.52
0.041 0.030
1.33
0.071 0.048
1.49
0.146 0.081
1.80
0.070 0.042
1.66
0.134 0.078
1.73
RE
20
20
20
20
20
20
29
29
29
29
29
29
Registers
SK Ratio
18
1.11
14
1.43
17
1.18
18
1.11
17
1.18
18
1.11
22
1.32
22
1.32
21
1.38
21
1.38
21
1.38
21
1.38
Tile
RE
16 4
12 2
16 4
16 16
16 12
16 12
12 10
12 2
16 4
16 10
16 16
16 12
Size
SK
88
12 2
4 12
12 16
8 16
10 16
12 10
12 2
4 16
16 12
16 8
16 8
Total Tile Count

RE
SK
56
49
22
22
60
76
80
104
42
55
81
105
30
30
22
22
60
57
120
120
30
60
126
Table 6.13: Template matching partial sums: performance and optimal configurations characteristics for the tiled
summation kernel. RE stands for runtime evaluated, and SK stands for specialized kernel.
C1060
C2070
GPU

99
100
used for the remainder of the results. However, adding warp specialization helps
with C2070, but decreases performance for most cases with the C1060. This effect is
examined in more detail in the discussion regarding Table 6.17.
Finally, kernel specialization helps both GPUs, producing speedups of around 2
across the tests, although it is greater for series A than series B, where the search
space for each mask is smaller.
Table 6.14 omitted information regarding the numerical implementation parameters. Table 6.15 provides the parameters for the best performing kernel configuration
for each problem. Texturing is used throughout, and warp specialization is used with
the C2070, but not the C1060.
It is worth noting that the results confirm the importance of increased register
file use on the newer C2070. While the number of masks and offsets change, the
mask size remains constant within each set of problem configurations. The mask size
appears to be the dominant factor in determining the optimal kernel configuration,
with the number of threads and data registers remaining constant within each set.
Between the A series and B series, there is a change in mask size from 32 32 (1024)
pixels to 1616 (256) pixels. In both cases, the mask size makes it easy to generate a
natural block area that covers the mask in one iteration. Transitioning from series
A to B, the optimal configuration occurs with the same number of data registers but
lower thread counts. This seems to confirm the notion that it is better to generate
high levels of ILP and register file usage than it is to add more threads.
A2
11.55
10.29
1.12
10.85
0.95
1.06
4.99
2.17
2.31
6.50
5.65
1.15
5.16
1.10
1.26
2.13
3.06
A3
48.37
43.09
1.12
45.42
0.95
1.06
20.69
2.20
2.34
26.70
23.01
1.16
21.73
1.06
1.23
8.87
3.01
A4
67.96
60.51
1.12
63.33
0.96
1.07
28.67
2.21
2.37
37.04
31.91
1.16
30.10
1.06
1.23
12.32
3.0
Data Set
A5
B1
89.77 35.88
79.99 32.19
1.12
1.11
84.28 35.03
0.95
0.92
1.07
1.02
38.31 17.58
2.20
1.99
2.04
2.34
49.13 25.21
42.37 23.66
1.16
1.07
39.98 20.11
1.06
1.18
1.23
1.25
16.78 9.46
2.93
2.67
B2
71.82
64.39
1.12
70.56
0.91
1.02
35.16
2.01
2.04
51.11
47.77
1.07
39.50
1.21
1.29
18.69
2.74
B3
106.29
95.40
1.11
95.16
1.00
1.12
49.58
1.92
2.14
69.82
65.27
1.07
53.71
1.22
1.30
25.53
2.73
B4
140.17
123.40
1.14
119.41
1.03
1.17
64.94
1.84
2.16
82.58
77.01
1.07
63.50
1.21
1.30
30.05
2.75
B5
202.30
184.03
1.10
133.10
1.38
1.52
89.09
1.49
2.27
101.14
95.37
1.06
72.64
1.31
1.39
38.32
2.64
Table 6.14: PIV GPU performance comparisons for several kernel variants across the FPGA benchmark set.
C2070
C1060
RE Baseline (ms)
+ Texturing (ms)
Speedup
+ Warp Specialization (ms)
Relative Speedup
Cumulative Speedup
+ Kernel Specialization (ms)
Relative Speedup
Cumulative Speedup
RE Baseline (ms)
+ Texturing (ms)
Speedup
+ Warp Specialization (ms)
Relative Speedup
Cumulative Speedup
+ Kernel Specialization (ms)
Cumulative Speedup
A1
3.54
3.13
1.13
3.29
0.95
1.08
1.53
2.15
2.32
1.92
1.71
1.12
1.51
1.13
1.27
0.65
2.95

101
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Config.
Execution Time (ms)

RE
SK
Speedup
1.72
0.65
2.14
5.86
2.13
2.28
24.38
8.87
2.28
33.83
12.33
2.29
45.37
16.78
2.29
21.84
9.46
2.02
43.85
18.69
2.02
60.14
25.53
1.86
71.09
30.05
1.67
82.59
38.32
1.62
3.13
1.47
2.13
10.29
4.51
2.28
43.09
18.89
2.28
60.50
26.39
2.29
79.98
34.95
2.29
32.19
15.90
2.03
64.38
31.80
2.02
95.41
51.40
1.86
123.38 73.78
1.67
184.04 113.92
1.62
RE
23
23
29
29
29
20
20
20
20
20
32
32
32
32
32
32
32
26
26
22
Registers
SK Ratio
23
1
23
1
23
1.26
23
1.26
23
1.26
23
0.87
23
0.87
23
0.87
23
0.87
23
0.87
16
2
28
1.41
16
2
16
2
16
2
16
2
16
2
16
1.63
12
2.17
12
1.83
Register Blocking
RE
K.S
4
4
4
4
8
4
8
4
8
4
2
4
2
4
2
4
2
4
2
4
4
4
4
8
4
4
4
4
4
4
4
4
4
4
2
4
2
2
1
2
Threads
RE SK
256 256
256 256
128 256
128 256
128 256
128 64
128 64
128 64
128 64
128 64
256 288
128 128
128 256
256 256
128 256
64
64
64
64
128 96
128 128
256 128
Table 6.15: PIV GPU performance data for the FPGA benchmark set, including optimal register blocking and thread
counts.
C1060
C2070
GPU

102
103
Table 6.16 shows similar performance information for the M series tests, which
vary only mask size while keeping the number of masks and the search constant. The
SK Normalized column divides the specialized kernel execution time by the mask
area. Between the actual execution time and the normalized time, the PIV kernel
behaves as expected. The run time increases with mask area, but remains relatively
constant on a per-area basis. For both GPUs, kernel specialization offers performance benefits, with noticeable jumps at mask sizes that are multiples of sixteen
for the C2070 and powers of two (thirty-two and sixty-four) for the C1060. These
correspond with drops in the normalized time, which is based on the specialized kernel time. At these sizes, loop unrolling is combined with memory-hierarchy friendly
values that provide additional performance. These factors can also affect the optimal
parameterization at these mask sizes, where more register hungry variants perform
better. This is expected, as greater register blocking counts can be used to balance
the availability of mask data with higher memory hierarchy performance.
Table 6.17 shows similar data as the previous tables, but for the S series of tests,
which vary the number of search offsets for each mask, while keeping the number
of masks and mask size constant. Here, the SK Normalized column is the kernel
execution time divided by the number of search offsets. Two sets of results are
provided for the C1060, both with and without warp specialization.
For the C2070, after an amortization period, the normalized performance and
speedup are constant, as expected, showing a purely linear relationship with the
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Config.
RE
0.989
1.354
2.014
4.293
6.013
12.164
13.296
19.269
23.860
32.722
45.002
53.014
1.274
2.284
3.171
9.370
11.951
24.543
27.410
38.837
47.549
65.380
91.185
106.913
Execution Time (ms)

SK
Speedup SK Normalized
0.868
1.14
0.0136
0.957
1.41
0.0079
0.972
2.07
0.0038
2.062
2.08
0.0033
2.460
2.44
0.0024
5.294
2.30
0.0029
5.348
2.49
0.0023
9.435
2.04
0.0029
8.805
2.71
0.0021
14.603
2.24
0.0026
19.447
2.31
0.0025
19.778
2.68
0.0021
0.771
1.65
0.0121
1.570
1.46
0.0130
1.590
1.99
0.0062
5.311
1.76
0.0085
5.208
2.29
0.0051
15.080
1.63
0.0082
15.995
1.71
0.0069
25.695
1.51
0.0079
22.051
2.15
0.0054
40.674
1.61
0.0072
56.059
1.63
0.0072
63.156
1.69
0.0069
RE
20
20
20
23
23
29
23
23
23
23
23
23
22
22
32
32
32
32
32
32
32
32
32
32
Registers
SK Ratio
13
1.54
23
0.87
23
0.87
22
1
23
1
26
1.12
45
0.51
25
0.92
45
0.51
26
0.92
26
0.92
45
0.51
8
2.75
12
1.83
16
2
17
1.88
16
2
29
1.10
20
1.6
32
1
18
1.78
32
1
32
1
32
1
Register Blocking
RE
K.S
2
1
2
4
2
4
4
4
4
4
8
4
4
8
4
4
4
8
4
4
4
4
4
8
1
1
2
2
4
4
4
4
4
4
4
8
4
4
4
8
4
4
4
8
4
8
4
8
Threads
RE SK
32
64
64
32
128 64
160 160
256 256
128 256
288 96
288 288
256 128
288 256
288 256
288 288
64
64
128 64
64
64
224 160
128 256
256 256
128 192
128 64
128 256
128 64
128 128
128 128
Table 6.16: PIV GPU performance data for the varying mask size benchmark set, including optimal register blocking
and thread counts.
C1060
C2070
GPU

104
105
number of offsets. The contents of the for() loop iterating over the mask offset
set contains the main computation (for the compute warps) and contains the main
synchronization point between exchanging shared memory buffers with the reduction
warp.
With the C1060, however, the picture is more complicated. Both with and without warp specialization, the normalized performance and speedup decrease with an
increasing number of offsets, although it is not immediately clear why. The C1060
appears to incur overhead from warp specialization at small sizes, but the added task
parallelism soon overcomes the overhead once several hundred offsets are involved.
The warp specialized version, while not purely linear like the C2070, has a more linear relationship with the number offsets than the non-warp-specialized variant. This
implies that the decreasing performance as the number of offsets increases is due
to block-wide synchronization, as the width of per-offset reduction does not change
throughout the data set and is assumed to take constant time.
The discontinuities in the normalized performance appear to be related to GPU
occupancy values. The amount of shared memory used by the kernel grows linearly with the number of offsets, and the number of resident thread blocks per SM
drops between S6 and S7 for the non-warp-specialized implementation. With warpspecialization, the occupancy threshold changes between S8 and S9. The shared
memory requirements start out higher for the warp specialized variant as it is double
buffered and reaches an occupancy threshold at a different point. With more shared
106
memory per SM, the C2070 does not experience these problems in the tested range.
The final benchmark set tested with the PIV application is one that varies the
overlap between adjacent interrogation windows. The results of these tests are shown
in Table 6.18. Here, only the non-warp specialized variant is shown for the C1060,
while the C2070 uses warp specialization. The SK Normalized column shows the
execution time relative to the O1 execution time.
As can be seen from the data, there is little impact on kernel performance for
either the C1060 or C2070. With a dense search space there is high data locality
between consecutive search offsets, which is likely significantly more important than
sharing data between interrogation windows, which are all assigned to different thread
blocks.
6.2.2.3
Cone Beam Backprojection
The backprojection kernel results are enumerated in Table 6.19. Even though there
are fewer implementation and data set parameters, there is an interesting result: for
the larger data set, the older C1060 GPU reports slower results when using kernel
specialization. This appears to be a result of higher occupancy decreasing caching
performance. The backprojection kernel uses only global memory for reads, despite
the high data locality. As a newer GPU of Compute Capability 2.0, the C2070 GPU
has an L1 cache in addition to shared memory, which appears to effectively cache the
needed data.
On the other hand, the C1060 does not have an automatically managed L1 cache,
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
Config.
RE
6.009
12.434
21.187
32.268
45.699
61.470
79.575
99.999
122.816
147.998
175.444
11.955
24.816
42.589
64.500
91.499
124.524
166.199
209.058
260.200
417.971
499.351
12.678
26.273
44.776
68.247
96.702
131.942
173.047
217.170
266.570
321.537
383.775
Execution Time (ms)

SK
Speedup SK
2.460
2.44
5.071
2.45
8.632
2.45
13.140
2.46
18.606
2.46
25.016
2.46
32.375
2.46
40.686
2.46
49.949
2.46
60.163
2.46
71.327
2.46
5.206
2.30
10.812
2.30
18.453
2.31
28.126
2.29
39.926
2.29
56.013
2.22
84.071
1.98
113.954
1.83
144.665
1.80
252.124
1.66
302.407
1.65
5.737
2.21
11.876
2.21
19.184
2.33
29.122
2.34
41.141
2.35
56.648
2.33
73.060
2.37
91.737
2.37
135.696
1.96
163.682
1.96
196.516
1.95
Normalized
0.0304
0.0300
0.0299
0.0298
0.0298
0.0297
0.0297
0.0297
0.0297
0.0297
0.0297
0.0643
0.0640
0.0639
0.0638
0.0639
0.0666
0.0772
0.0832
0.0861
0.1245
0.1260
0.0708
0.0703
0.0664
0.0660
0.0658
0.0674
0.0671
0.0670
0.0807
0.0808
0.0818
RE
23
23
23
23
23
23
23
23
23
23
23
32
32
32
32
32
32
32
32
32
32
32
39
39
39
39
39
39
27
39
27
27
27
Registers
SK Ratio
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
23
1
16
2
16
2
16
2
16
2
16
2
28
1.14
28
1.14
16
2
16
2
16
2
16
2
16
2.44
16
1.83
16
2
16
1.88
16
2
16
1.10
16
1.6
16
1
16
1.78
16
1
16
1
Register Blocking
RE
K.S
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
8
4
8
4
4
4
4
4
4
4
4
8
4
8
4
8
4
8
4
8
4
8
4
4
4
8
4
4
4
4
4
4
4
Threads
RE SK
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
128 256
128 256
128 256
128 256
128 256
128 128
256 128
256 256
256 256
256 256
256 256
160 256
160 256
160 256
160 256
160 256
160 256
160 256
160 256
256 256
256 256
256 256
Table 6.17: PIV GPU performance data for the varying search benchmark set, including optimal register blocking and
thread counts.
specialization
warp
with
C1060
specialization
warp
without
C1060
specialization
warp
with
C2070
GPU

107
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
Config.
RE
5.973
5.994
6.003
6.011
6.010
6.012
6.016
6.016
6.0261
6.048
11.808
11.953
11.954
11.945
11.955
11.955
11.971
12.161
12.116
12.166
Execution Time (ms)

SK
Speedup SK Normalized
2.456
2.43
1.000
2.457
2.44
1.001
2.456
2.44
1.001
2.460
2.44
1.002
2.459
2.44
1.001
2.461
2.44
1.002
2.462
2.44
1.003
2.463
2.44
1.003
2.464
2.45
1.003
2.467
2.45
1.005
5.198
2.27
1.000
5.199
2.30
1.000
5.201
2.30
1.000
5.198
2.30
1.000
5.200
2.30
1.000
5.231
2.29
1.006
5.244
2.28
1.009
5.273
2.31
1.014
5.278
2.30
1.015
5.284
2.30
1.016
RE
20
20
20
23
23
29
23
23
23
23
32
32
32
32
32
32
32
32
32
32
Registers
SK Ratio
13
1.54
23
0.87
23
0.87
22
1
23
1
26
1.12
45
0.51
25
0.92
45
0.51
26
0.92
16
2.75
16
1.83
16
2
16
1.88
16
2
16
1.10
16
1.6
16
1
16
1.78
16
1
Register Blocking
RE
K.S
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
Threads
RE SK
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
256 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
128 256
Table 6.18: PIV GPU performance data for the varying overlap benchmark set, including optimal register blocking and
thread counts.
C1060
C2070
GPU

108
GPU
C2070
C1060
Config
V1
V2
V1
V2
Execution Time (ms)

RE
KS
Speedup
13.160
11.228
1.17
46 604
24 400
1.91
14.036
10.501
1.34
175 789 188 806
0.93
RE
32
32
34
34
109
Registers
KS Ratio
26
1.23
24
1.33
25
1.36
21
1.62
Thread
RE
320
160
128
224
Count
KS
160
192
320
32
Table 6.19: Performance comparisons for the backprojection kernels.

forcing the SM to refetch the data on each access. Table 6.20 lists the C1060 GPU
occupancy numbers and execution time for the run-time evaluated and specialized
kernels on the V2 data set. The best performance, for both specialized and run-time
evaluated kernels, occurs at low occupancy (low number of active warps per SM).
Threads and blocks are distributed through the output volume, and a kernel execution
configuration with more active warps will increase the range of input data needed by
a streaming multiprocessor will grow. While shared memory is not utilized by this
kernel, even older GPUs do contain a caching memory hierarchy outside the SM [56]
and below shared memory. Fewer active threads may improve the performance of
these caches for what is a memory bound application.
Each individual block is confined to a plane in the output space, but adjacent
blocks may wrap around from the bottom corner of an output slice back to the
top of the next slice, decreasing shared data. This may explain why the lowest
execution time for the run-time evaluated kernel does not occur with the minimum
number of active threads. There are still few active warps, but only one active thread
block.
Finally, when the memory hierarchy is not a bottleneck, as is the case with the

Threads
per Block
32
64
96
128
160
192
224
256
288
320
352
384
Execution
RE
187 434
209 185
197 427
206 569
200 387
210 942
175 789
185 377
210 850
201 404
203 691
205 907
Time (ms)
KS
188 807
209 923
210 678
209 529
219 732
217 779
208 039
196 602
213 612
217 476
212 521
218 412
Blocks/SM
RE
KS
6
8
6
8
3
5
3
5
2
4
2
4
1
2
1
2
1
2
1
2
1
2
1
2
110
Warps/SM
RE
KS
6
8
12
16
9
15
12
20
10
20
12
24
7
14
8
16
9
18
10
20
11
22
12
24
Table 6.20: Occupancy and execution data for the C1060 on the V2 data set.
first data set, kernel specialization does provide a benefit.
6.3
Analysis
There are a number of overall conclusions that can be made across the set of applications and both GPUs. First, kernel specialization provides significant performance
improvements. Execution times are noticeably lower. These benefits are derived
on top of good baseline implementations of real-world scientific applications of nontrivial complexity. Simpler kernels may provide even greater performance improvements.
Both the PIV and cone beam backprojection application results show that kernel specialization provides significant benefits beyond those associated with loop unrolling. In the case of the PIV application, the run-time evaluated kernel was allowed
to cheat so that it could take advantage of variable register blocking. This results
in static unrolled loops over the core data loading and computation loop within the
111
PIV kernel. Despite this advantage, the fully specialized kernel still provides significant performance advantages over a wide range of parameters. Constant folding
and propagation, as well as some additional strength reduction add significantly to
performance.
At the same time, kernel specialization generally provides significant reductions
in the number of registers per thread used by a given kernel, assuming all other configuration values are equal. This has two important benefits: 1) When not otherwise
constrained, fewer registers per thread may result in more active warps per streaming
multiprocessor. This may allow the hardware to achieve better performance despite
high latencies. 2) It enables more sophisticated and resource intensive kernels to fit
within the same budget.
That second factor is important for another of the key advantages offered by
kernel specialization: increased adaptability without a performance penalty. The 51
data sets chosen cover a wide range of irregular parameters. Kernel specialization
provides significant performance improvements over the high-performing baseline implementations by removing overheads associated with flexibility.
Related to this is the ability to include additional implementation parameters for
further specializing the behavior of a given kernel. From the data in the various
tables in this section, a wide range of implementation parameters, across different
problems and the two GPU devices, are required to achieve optimal performance.
In contrast to the tables previously discussed, which reflect the best performing
112
set of specialized and run-time evaluated kernel parameters, Tables 6.21 and 6.22
show the performance of a single fixed and specialized configuration against the
highest performing configuration on a given data set for the template matching and
PIV applications, respectively. These scenarios represent typical CUDA development
practices where a single set of compile-time (specialized) parameters are used regardless of the incoming problem parameters or target hardware characteristics, other
than compute capability.
Table 6.21 shows the relative performance of various fixed main tile sizes and
thread counts compared to the optimal values for each tested data set. Similarly,
Table 6.22 shows the performance of several sets of fixed PIV implementation parameters against the best performing PIV configuration for the M series of problem
configurations. Between requirements for register blocking and the need to leverage important performance optimizations, these implementation parameters would
likely be fixed when the kernel was compiled ahead of time in a scenario without
kernel specialization. Both tables use values that are typically seen in GPGPU applications. With kernel specialization, there is no longer a need for fixed parameter
values, allowing unique values to be used for each problem and GPU.
In both cases, it is possible to select a configuration with reasonable performance.
For the C2070, a fixed configuration can achieve about 85 and 80 percent of peak kernel performance for the template matching and PIV applications, respectively. For
the C1060, the same values are about 80 and 90 percent, respectively. While accept-
Data Set
P1
P2
P3
P4
P5
P6
Average
Minimum
C2070
128 Threads
256 Threads
8 8 16 16 8 8 16 16
0.60
0.19
0.84
0.53
0.34
0.17
0.85
0.55
0.69
0.40
0.94
0.73
0.59
0.62
0.88
0.96
0.69
0.49
0.73
0.57
0.87
0.91
0.87
0.91
0.63
0.46
0.86
0.71
0.34
0.17
0.73
0.53
113
C1060
128 Threads
256 Threads
8 8 16 16 8 8 16 16
0.37
0.21
0.42
0.35
0.41
0.24
0.85
0.55
0.70
0.48
0.96
0.75
0.68
0.67
0.86
0.95
0.68
0.59
0.77
0.70
0.90
0.93
0.90
0.93
0.62
0.52
0.79
0.71
0.37
0.21
0.42
0.35
Table 6.21: Percentage of the peak performance for the template matching application with various fixed main tile sizes and thread counts.
able on average, in each case the minimum relative performance is significantly lower,
demonstrating the benefits of making these implementation parameters adjustable.
These results, as well as others in this chapter, further reinforce the complexity
of development on NVIDIA CUDA hardware. Performance can be unpredictable
and highly sensitive to variation. The fixed configurations in Tables 6.21 and 6.22
represent several possible intuitive kernel configurations that only sometimes correlate with the optimal configuration. To further demonstrate the non-predictability
of CUDA kernel parameterization, Figures 6.1 and 6.2 show the relative performance
of the PIV kernel on C1060 and C2070, respectively, across the M series tests, where
only the mask size is changing. Each plot shows performance and is scaled independently between zero and one, with the best combination of register blocking and
thread count marked with a white square.
As is visible from both the figures and the preceding tables, kernel performance is
highly non-linear and dependent on the correct selection of implementation parameter
values. As more free implementation parameters are available, manual selection of
64 Threads
0.89
0.93
1.00
0.69
0.65
0.67
0.63
0.78
0.58
0.72
0.68
0.58
0.73
0.58
0.89
0.94
1.00
0.78
0.82
0.80
0.89
0.91
0.87
0.90
0.90
0.88
0.88
0.78
4 Data Registers
128 Threads 256
0.62
0.65
0.69
0.70
0.85
0.90
0.78
0.99
0.76
0.97
0.89
0.76
0.80
0.62
0.68
0.83
0.89
0.85
0.92
0.93
0.96
0.98
0.98
0.98
0.97
0.97
0.91
0.68
Threads
0.35
0.36
0.39
0.77
1.00
1.00
0.73
0.97
0.77
1.00
1.00
0.77
0.76
0.35
0.35
0.48
0.57
0.90
1.00
0.99
0.91
0.96
1.00
0.98
0.99
0.99
0.84
0.35
64 Threads
0.94
0.93
0.72
0.56
0.93
0.54
0.46
0.57
0.84
0.52
0.49
0.83
0.69
0.46
0.76
0.85
0.91
0.83
0.93
0.90
0.98
1.00
0.99
1.00
0.99
0.99
0.93
0.76
8 Data Registers
128 Threads 256
0.67
0.67
0.68
0.82
0.71
0.77
0.64
0.81
1.00
0.75
0.72
1.00
0.77
0.64
0.41
0.60
0.63
0.92
0.98
0.98
0.94
0.97
1.00
0.99
1.00
1.00
0.87
0.41
Threads
0.34
0.35
0.37
0.66
0.62
0.74
0.56
0.82
0.98
0.77
0.74
0.63
0.63
0.34
0.19
0.33
0.36
0.68
0.71
1.00
0.83
0.95
0.93
0.96
0.95
0.94
0.74
0.19
Table 6.22: Percentage of the peak performance for the PIV application with various fixed data register counts and
thread counts.
C1060
C2070
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Average
Minimum
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
Average
Minimum
Data Set

114
115
Figure 6.1: Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C1060. The location of peak performance is marked
with a white square.
optimal values becomes speculative and error prone. Autotuning techniques, a main
focus of many other related research projects, are therefore highly complimentary to
kernel specialization.
In this chapter, a number of experiments and their results were presented. Kernel
specialization was shown to have a number of important advantages for GPGPU
kernels. In the next chapter, plans for future work are discussed.
116
Figure 6.2: Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C2070. The location of peak performance is marked
with a white square.
Chapter 7
Conclusions and Future Work
In this chapter, ideas for future work and conclusions are presented.
7.1
Conclusions
While GPGPU computing can provide significant performance advantages over general purpose processors, it often comes at the expense of either adaptability or maximum performance. Static-value optimizations applied at compile time require choosing between these objectives. This dissertation has explored kernel specialization, a
technique where compilation is delayed until fixed parameter values are known. By
lowering the penalty associated with increased parameterization, kernel specialization allows for a single GPU implementation to offer greater adaptability without
sacrificing performance.
Using several real-world applications, kernel specialization was examined from
two angles: improved performance and reduction in per-thread register usage for
a given level of adaptability as well as the importance of adjusting normally static
implementation parameters in optimizing performance. For the first, kernels belong-
CHAPTER 7. CONCLUSIONS AND FUTURE WORK
118
ing to non-trivial case studies, aided by kernel specialization, were shown to exhibit
adaptability not only to a wide range of problem parameters but also to two different NVIDIA GPU generations. The maximum performance observed for each data
set occurred with a different set of implementation parameters. Maximum performance can only be achieved with the ability to dynamically adjust implementation
parameters that are usually considered static.
Combined with autotuning, which can be used to help select optimal values from
an implementation parameter space that may be extremely large, kernel specialization can be used to create libraries of highly-parameterized GPGPU kernel implementations that can be effectively applied across different hardware and problem
configurations. This is of particular use in problem areas not covered by emerging
domain-specific tools.
7.2
Future Work
There are three main areas in which future research efforts will be applied: improving
the existing GPU applications covered in this dissertation, adding new capabilities
to GPU-PF, and further exploring the characteristics of kernel specialization.
7.2.1
Existing Applications
First, there are possibilities for improving the performance of the existing GPU applications. In the template matching application, the non-numerator stages, which
represent approximately 20 to 40 percent of the GPU run time could benefit from the
119
same kernel specialization optimizations that were applied to the numerator. The
application may also benefit from utilizing more of the register file instead of shared
memory for data storage. For the PIV kernel, implementing data reuse through
shared memory could improve performance. Additionally, more time could be spent
studying warp specialization, as the current CUDA tools provide little insight into
the relative performance of different warp groups. Finally, the cone beam backprojection kernel could be augmented to use texture memory. It is likely that there are
additional opportunities for using kernel specialization to improve the performance
of the kernel.
7.2.2
GPU-PF
Beyond the CUDA GPU implementations, there are a number of possible enhancements to the GPU Prototyping Framework. Of primary interest is creating an
OpenCL back end. Currently, the framework uses the driver-level CUDA API, which
should facilitate a port as it is more similar to OpenCL than the higher-level run-time
API. OpenCL presents a possible advantage for kernel specialization: the OpenCL
specification includes an API for compiling the C-like syntax of OpenCL kernels at
runtime. OpenCL support would enable targeting a wider variety of GPU vendors
and platforms. This may soon be possible with the CUDA language, as NVIDIA
has opened up an increasing portion of the CUDA development tools to third party
modification. Beyond additional platforms, CUDA evolution may also include better support for generating specialized binaries at run time without having to call
120
to a separate compiler application. Any developments along these lines could be

incorporated into GPU-PF.
In addition, a better mechanism for expressing task parallelism and utilizing
CUDA streams could be implemented. Given the limitations of task scheduling
with CUDA streams, in which internal CUDA synchronization points are easily introduced, current support for flagging a group of CUDA actions for execution distributed among a set of streams is often sufficient. However, with multiple CUDA
contexts on one or more GPUs, and simultaneous CPU and GPU execution, more
robust task graph support could be investigated.
To further the aim of rapid prototyping, integration with existing auto-tuning
tools and/or applying optimization techniques to the high-dimensional performance
data cubes to automatically generate effective configuration policies could be implemented. With the additional adaptability to both problem and implementation
parameters that kernel specialization offers, it is important to provide tools for selecting near optimal configurations as quickly as possible.
7.2.3
While kernel specialization has been shown to be beneficial for both performance
and register usage reduction, additional work in this area could be done. There may
be major advantages between the benefits of fixing some parameters over others.
Additional experiments with subsets of optimizations enabled and disabled would
help elucidate the contours of this space. As mentioned, some key performance
121
optimizations require fixed values, like loop unrolling and register blocking. However,
others, such as pointer value inlining, may be less important. There is a natural tradeoff between kernel specializations and adaptability without recompilation. If pointer
or other values change frequently, avoiding recompilation may outweigh the benefits
of kernel specialization. Fixing as many parameters as possible for a given usage
scenario is a key objective.
Finally, more advanced GPU binary generations techniques, such as fusing multiple kernels into one, could be examined. With ever increasing support for a C++
class and template features, more advanced compile time specialization than what
was studied in this dissertation could be investigated. There is room for increasing
the adaptability of a single kernel compilation unit to different algorithms, problem
instances, and hardware configurations. In addition, developing a library of specialization abstractions that allow toggling the specialization state of a particular
variable would improve the usability of kernel specialization.
This research would be driven by applying kernel specialization to a wider variety
of applications and platforms. A larger corpus of knowledge would help inform the
benefits, limitations, and trade-offs of kernel specialization.
Appendix A
Glossary
Abbreviation
CT
FPGA
GPU
GPU-PF
GPGPU
ILP
IR
IRT
JIT
PIV
PTX
RCL
RE
RLG
ROI
SK
SM
TLP
Term
computed tomography
field programmable gate array
graphics processing unit
GPU Prototyping Framework
general purpose GPU computing
instruction level parallelism
intermediate representation
Image Reconstruction Toolbox
just-in-time
particle image velocimetry
Parallel Thread Execution; NVIDIA CUDAs IR
Reconfigurable Computing Laboratory
run-time evaluated
Robot Locomotion Group
region of interest
specialized kernel
streaming multiprocessor
thread level parallelism
122
Appendix B
Flexibly Specializable Kernel
Listing B.1: A CUDA C GPU kernel designed to demonstrate flexible kernel specialization. The kernel can be compiled both with and without specialization.
1
#include g p u F u n c t i o n s . cuh
2
3
4
5
#i f n d e f CT LOOPS
#define CT LOOPS 0
#endif
6
7
8
9
#i f n d e f LOOPS COUNT
#define LOOPS COUNT 0
#endif
10
11
12
13
#i f n d e f CT ARGS
#define CT ARGS 0
#endif
14
15
16
17
#i f n d e f ARG A
#define ARG A 0
#endif
18
19
20
21
#i f n d e f ARG B
#define ARG B 0
#endif
22
23
24
25
#i f n d e f CT BLOCK DIMS
#define CT BLOCK DIMS 0
#endif
26
27
28
29
#i f n d e f BLOCK DIM X
#define BLOCK DIM X 1
#endif
30
31
32
33
#i f n d e f BLOCK DIM Y
#define BLOCK DIM Y 1
#endif
34
35
36
37
38
#i f n d e f BLOCK DIM Z
#define BLOCK DIM Z 1
#endif
123
APPENDIX B. FLEXIBLY SPECIALIZABLE KERNEL
39
40
41
extern C {
global
42
43
44
45
46
void mathTest ( i n t
int
int
int
int
in ,
out ,
argA ,
argB ,
loopCount ) ;
47
48
49
global
void mathTest ( i n t in , i n t out , i n t argA , i n t argB , i n t loopCount ) {
int acc = 0 ;
50
51
52
53
typedef gpu : : c t r t : : num<CT LOOPS, int , LOOPS COUNT > l o o p s ;

typedef gpu : : c t r t : : mult<CT ARGS, gpu : : num<int , ARG A>, gpu : : num<int , ARG B> >
computedStride ;
typedef gpu : : c t r t : : BlockDims<CT BLOCK DIMS, BLOCK DIM X, BLOCK DIM Y,
BLOCK DIM Z> bDims ;
54
55
56
57
const i n t count = l o o p s : : op ( loopCount ) ;

const unsigned i n t s t r i d e = c o m p u t e d S t r i d e : : op ( argA , argB ) ;
const unsigned i n t o f f s e t = b l o c k I d x . x bDims : : opX ( ) + t h r e a d I d x . x ;
58
59
60
61
62
63
64
65
f o r ( i n t i = 0 ; i < count ; i ++){

#i f d e f CT PTR IN
a c c += ( ( i n t ) PTR IN + o f f s e t + i s t r i d e ) ;
#e l s e
a c c += ( i n + o f f s e t + i s t r i d e ) ;
#endif
}
66
67
68
69
70
71
72
73
#i f d e f CT PTR OUT
( ( i n t )PTR OUT + o f f s e t ) = a c c ;
#e l s e
( out + o f f s e t ) = a c c ;
#endif
return ;
}
124
Appendix C
Sample Run Time Evaluated PTX
The PTX in Listing C.2 was generated from the kernel source in Appendix B with all
parameters evaluated at run time. No compile-time specialization is used by leaving
all of the macro definitions in Listing B.1 undefined. The nvcc command line used
to generate the PTX is provided in Listing C.1.
Listing C.1: The nvcc command line used to generate the PTX in C.2.
mathTest2.cu file contained the source of Listing B.1.
The
nvcc v e r b o s e ptxaso p t i o n s v e r b o s e keep c u b i n m64 I . . / gpui n c l u d e gpu

a r c h i t e c t u r e compute 20 gpucode sm 20 mathTest2 . cu
Listing C.2: The run-time adaptable PTX produced by calling nvcc on the CUDA C
source in Appendix B without any fixed parameters.
//
// Generated by NVIDIA NVVM Compiler
// Compiler b u i l t on Thu Jan 12 1 7 : 4 6 : 0 1 2012 ( 1 3 2 6 4 0 8 3 6 1 )
// Cuda c o m p i l a t i o n t o o l s , r e l e a s e 4 . 1 , V0 . 2 . 1 2 2 1
//
. version 3 . 0
. t a r g e t sm 20
. a d d r e s s s i z e 64
.
.
.
.
.
file
file
file
file
file
1
2
3
4
5
mathTest2 . cpp3 . i
mathTest2 . cu
. . / gpui n c l u d e / g p u A t t r i b u t e s . cuh
. . / gpui n c l u d e /gpuMath . cuh
. . / gpui n c l u d e /gpuNum . cuh
125
APPENDIX C. SAMPLE RUN TIME EVALUATED PTX
. e n t r y mathTest (
. param . u64 mathTest
)
{
. r e g . pred %p<3>;
. reg . s32
%r <29>;
. reg . s64
%r l <12>;
l d . param . u64
l d . param . u64
l d . param . u32
c v t a . t o . global . u64
c v t a . t o . global . u64
mov . u32
mov . u32
mov . u32
mad . l o . s 3 2
c v t . u64 . u32
s e t p . gt . s 3 2
@%p1 b r a
mov . u32
bra . u n i
BB0 2 :
l d . param . u32
l d . param . u32
mul . l o . s 3 2
mov . u32
mov . u32
mov . u32
BB0 3 :
c v t . u64 . u32
add . s 6 4
s h l . b64
add . s 6 4
l d . global . u32
add . s 3 2
add . s 3 2
add . s 3 2
l d . param . u32
setp . l t . s32
@%p2 b r a
BB0 4 :
s h l . b64
add . s 6 4
s t . global . u32
ret ;
}
param
param
param
param
param
0,
1,
2,
3,
4
%r l 4 , [ mathTest param 0 ] ;
%r l 5 , [ mathTest param 1 ] ;
%r3 , [ mathTest param 4 ] ;
%r l 1 , %r l 5 ;
%r l 2 , %r l 4 ;
%r12 , %n t i d . x ;
%r13 , %c t a i d . x ;
%r14 , %t i d . x ;
%r15 , %r12 , %r13 , %r14 ;
%r l 3 , %r15 ;
%p1 , %r3 , 0 ;
BB0 2 ;
%r28 , 0 ;
BB0 4 ;

%r4 , %r24 , %r23 ;
%r28 , 0 ;
%r27 , %r28 ;
%r26 , %r28 ;
%r l 6 , %r26 ;
%r l 7 , %r l 6 , %r l 3 ;
%r l 8 , %r l 7 , 2 ;
%r l 9 , %r l 2 , %r l 8 ;
%r20 , [% r l 9 ] ;
%r28 , %r20 , %r28 ;
%r26 , %r26 , %r4 ;
%r27 , %r27 , 1 ;
%p2 , %r27 , %r25 ;
BB0 3 ;
%r l 1 0 , %r l 3 , 2 ;
%r l 1 1 , %r l 1 , %r l 1 0 ;
[%r l 1 1 ] , %r28 ;
126
Appendix D
Sample Kernel Specialized PTX
The PTX in Listing D.2 was generated from the kernel source in Appendix B for the
case where every parameter was fixed at compile time. The kernel is fully specialized.
In this example, a loop iteration count of five, and a one-dimensional block of 128
threads, was used. The argA and argB inputs were fixed at 3 and 7, respectively.
The input pointer was set to 0x200ca0200, and the output pointer to 0x200b80000.
(Kernels are compiled after memory is allocated so pointer values are known.) The
nvcc command line used to generate the PTX is provided in Listing D.1.
Listing D.1: The nvcc command line used to generate the PTX in D.2.
mathTest2.cu file contained the source of Listing B.1
The
nvcc v e r b o s e ptxaso p t i o n s v e r b o s e keep c u b i n m64 I . . / gpui n c l u d e gpu

a r c h i t e c t u r e compute 20 gpucode sm 20 DCT ARGS=1 DARG A=3 DARG B=7
DCT LOOPS=1 DLOOPS COUNT=5 DCT PTR IN=1 DPTR IN=0x 200c a020 0 DCT PTR OUT=1
DPTR OUT=0x200b80000 DCT BLOCK DIMS=1 DBLOCK DIM X=128 mathTest2 . cu
Listing D.2: Specialized PTX produced by calling nvcc on the CUDA C source in
Appendix B and specifying all parameters on the command line.
//
// Generated by NVIDIA NVVM Compiler
// Compiler b u i l t on Thu Jan 12 1 7 : 4 6 : 0 1 2012 ( 1 3 2 6 4 0 8 3 6 1 )
// Cuda c o m p i l a t i o n t o o l s , r e l e a s e 4 . 1 , V0 . 2 . 1 2 2 1
//
127
APPENDIX D. SAMPLE KERNEL SPECIALIZED PTX
. version 3 . 0
. t a r g e t sm 20
. a d d r e s s s i z e 64
.
.
.
.
.
file
file
file
file
file
1
2
3
4
5
mathTest2 . cpp3 . i
mathTest2 . cu
. . / gpui n c l u d e /gpuNum . cuh
. . / gpui n c l u d e /gpuMath . cuh
. . / gpui n c l u d e / g p u A t t r i b u t e s . cuh
. e n t r y mathTest (
. param . u64 mathTest param
)
{
. reg . s32
%r <20>;
. reg . s64
%r l <2>;
mov . u32
s h l . b32
mov . u32
add . s 3 2
mul . wide . u32
l d . u32
l d . u32
add . s 3 2
l d . u32
add . s 3 2
l d . u32
add . s 3 2
l d . u32
add . s 3 2
s t . u32
ret ;
}
0,
1,
2,
3,
4
%r1 , %c t a i d . x ;
%r2 , %r1 , 7 ;
%r3 , %t i d . x ;
%r4 , %r2 , %r3 ;
%r l 1 , %r4 , 4 ;
%r5 , [% r l 1 +8603173460];
%r7 , [% r l 1 +8603173376];
%r9 , %r5 , %r7 ;
%r10 , [% r l 1 +8603173544];
%r12 , %r10 , %r9 ;
%r13 , [% r l 1 +8603173628];
%r15 , %r13 , %r12 ;
%r16 , [% r l 1 +8603173712];
%r18 , %r16 , %r15 ;
[%r l 1 +8601993216] , %r18 ;
128
Appendix E
OpenCV Kernel Source
The following sample is provided as a real-world example of the coding techniques
required to achieve best performance on GPUs. The listing is from the OpenCV
computer vision librarys CUDA module and implements row filtering as it is applied
in image processing [42, 41].
Listing E.1: Unmodified OpenCV CUDA example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/M//////////////////////////////////////////////////////////////////////////////
//
// IMPORTANT: READ BEFORE DOWNLOADING, COPYING, INSTALLING OR USING .
//
// By downloading , c o p y i n g , i n s t a l l i n g or u s i n g t h e s o f t w a r e you a g r e e t o t h i s
license .
//
I f you do n o t a g r e e t o t h i s l i c e n s e , do n o t download , i n s t a l l ,
// copy or u s e t h e s o f t w a r e .
//
//
//
L i c e n s e Agreement
//
For Open Source Computer V i s i o n L i b r a r y
//
// C o p y r i g h t (C) 2000 2008 , I n t e l C o r p o r a t i o n , a l l r i g h t s r e s e r v e d .
// C o p y r i g h t (C) 2009 , Willow Garage I n c . , a l l r i g h t s r e s e r v e d .
// C o p y r i g h t (C) 1993 2011 , NVIDIA C o r p o r a t i o n , a l l r i g h t s r e s e r v e d .
// Third p a r t y c o p y r i g h t s a r e p r o p e r t y o f t h e i r r e s p e c t i v e owners .
//
// R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y forms , w i t h or w i t h o u t m o d i f i c a t i o n ,
// a r e p e r m i t t e d p r o v i d e d t h a t t h e f o l l o w i n g c o n d i t i o n s a r e met :
//
//
R e d i s t r i b u t i o n s o f s o u r c e code must r e t a i n t h e a b o v e c o p y r i g h t n o t i c e ,
//
t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r .
//
//
R e d i s t r i b u t i o n s i n b i n a r y form must r e p r o d u c e t h e a b o v e c o p y r i g h t n o t i c e ,
//
t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r i n t h e d o c u m e n t a t i o n
//
and/ or o t h e r m a t e r i a l s p r o v i d e d w i t h t h e d i s t r i b u t i o n .
129
APPENDIX E. OPENCV KERNEL SOURCE
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
//
//
130
The name o f t h e c o p y r i g h t h o l d e r s may n o t be used t o e n d o r s e or promote

products
d e r i v e d from t h i s s o f t w a r e w i t h o u t s p e c i f i c p r i o r w r i t t e n p e r m i s s i o n .
//
//
// This s o f t w a r e i s p r o v i d e d by t h e c o p y r i g h t h o l d e r s and c o n t r i b u t o r s as i s and
// any e x p r e s s or i m p l i e d w a r r a n t i e s , i n c l u d i n g , b u t n o t l i m i t e d to , t h e i m p l i e d
// w a r r a n t i e s o f m e r c h a n t a b i l i t y and f i t n e s s f o r a p a r t i c u l a r p u r p o s e a r e d i s c l a i m e d
.
// In no e v e n t s h a l l t h e I n t e l C o r p o r a t i o n or c o n t r i b u t o r s be l i a b l e f o r any d i r e c t ,
// i n d i r e c t , i n c i d e n t a l , s p e c i a l , exemplary , or c o n s e q u e n t i a l damages
// ( i n c l u d i n g , b u t n o t l i m i t e d to , procurement o f s u b s t i t u t e g o o d s or s e r v i c e s ;
// l o s s o f use , data , or p r o f i t s ; or b u s i n e s s i n t e r r u p t i o n ) however c a u s e d
// and on any t h e o r y o f l i a b i l i t y , w h e t h e r i n c o n t r a c t , s t r i c t l i a b i l i t y ,
// or t o r t ( i n c l u d i n g n e g l i g e n c e or o t h e r w i s e ) a r i s i n g i n any way o u t o f
// t h e u s e o f t h i s s o f t w a r e , even i f a d v i s e d o f t h e p o s s i b i l i t y o f such damage .
//
//M/
43
44
45
46
47
48
49
#include
#include
#include
#include
#include
#include
i n t e r n a l s h a r e d . hpp
opencv2 / gpu / d e v i c e / s a t u r a t e c a s t . hpp
opencv2 / gpu / d e v i c e / vec math . hpp
opencv2 / gpu / d e v i c e / l i m i t s . hpp
opencv2 / gpu / d e v i c e / b o r d e r i n t e r p o l a t e . hpp
opencv2 / gpu / d e v i c e / s t a t i c c h e c k . hpp
50
51
52
53
54
55
namespace cv { namespace gpu { namespace d e v i c e

{
namespace r o w f i l t e r
{
#d e f i n e MAX KERNEL SIZE 32
56
57
constant
f l o a t c k e r n e l [ MAX KERNEL SIZE ] ;
58
59
60
61
62
void l o a d K e r n e l ( const f l o a t k e r n e l [ ] , i n t k s i z e )
{
c u d a S a f e C a l l ( cudaMemcpyToSymbol ( c k e r n e l , k e r n e l , k s i z e s i z e o f ( f l o a t )
) );
}
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
template <i n t KSIZE , typename T, typename D, typename B>

global
void l i n e a r R o w F i l t e r ( const DevMem2D <T> s r c , PtrStep<D> dst ,
const i n t anchor , const B brd )
{
#i f
CUDA ARCH >= 200
const i n t BLOCK DIM X = 3 2 ;
const i n t BLOCK DIM Y = 8 ;
const i n t PATCH PER BLOCK = 4 ;
const i n t HALO SIZE = 1 ;
#e l s e
const i n t BLOCK DIM X = 3 2 ;
const i n t BLOCK DIM Y = 4 ;
const i n t PATCH PER BLOCK = 4 ;
const i n t HALO SIZE = 1 ;
#e n d i f
78
79
typedef typename TypeVec<f l o a t , V e c T r a i t s <T> : : cn > : : v e c t y p e sum t ;
80
81
shared
sum t smem [ BLOCK DIM Y ] [ ( PATCH PER BLOCK + 2 HALO SIZE )
BLOCK DIM X ] ;
82
83
84
const i n t y = b l o c k I d x . y BLOCK DIM Y + t h r e a d I d x . y ;
131
i f ( y >= s r c . rows )
return ;
85
86
87
const T s r c r o w = s r c . p t r ( y ) ;
88
89
const i n t x S t a r t = b l o c k I d x . x (PATCH PER BLOCK BLOCK DIM X) +

threadIdx . x ;
90
91
// Load l e f t h a l o
#pragma u n r o l l
f o r ( i n t j = 0 ; j < HALO SIZE ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + j BLOCK DIM X ] = s a t u r a t e c a s t <
sum t >(brd . a t l o w ( x S t a r t ( HALO SIZE j ) BLOCK DIM X,
src row ) ) ;
92
93
94
95
96
// Load main d a t a
#pragma u n r o l l
f o r ( i n t j = 0 ; j < PATCH PER BLOCK ; ++j )
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE BLOCK DIM X + j
BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd . a t h i g h ( x S t a r t + j
BLOCK DIM X, s r c r o w ) ) ;
97
98
99
100
101
// Load r i g h t h a l o
#pragma u n r o l l
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + (PATCH PER BLOCK + HALO SIZE )
BLOCK DIM X + j BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd .
a t h i g h ( x S t a r t + (PATCH PER BLOCK + j ) BLOCK DIM X, s r c r o w ) ) ;
102
103
104
105
106
syncthreads () ;
107
108
#pragma u n r o l l
{
const i n t x = x S t a r t + j BLOCK DIM X ;
109
110
111
112
113
if (x < src . cols )

{
sum t sum = V e c T r a i t s <sum t > : : a l l ( 0 ) ;
114
115
116
117
#pragma u n r o l l
f o r ( i n t k = 0 ; k < KSIZE ; ++k )
sum = sum + smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE
BLOCK DIM X + j BLOCK DIM X an cho r + k ] c k e r n e l [ k
];
118
119
120
121
d s t ( y , x ) = s a t u r a t e c a s t <D>(sum ) ;
122
123
124
125
126
127
128
129
130
131
132
template <i n t KSIZE , typename T, typename D, template<typename> c l a s s B>

void l i n e a r R o w F i l t e r c a l l e r (DevMem2D <T> s r c , DevMem2D <D> dst , i n t anchor ,
i n t cc , c u d a S t r e a m t stream )
{
i n t BLOCK DIM X ;
i n t BLOCK DIM Y ;
i n t PATCH PER BLOCK ;
133
134
135
136
137
i f ( c c >= 2 0 )
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 8 ;
132
PATCH PER BLOCK = 4 ;
138
}
else
{
139
140
141
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 4 ;
142
143
144
145
146
const dim3 b l o c k (BLOCK DIM X, BLOCK DIM Y) ;

const dim3 g r i d ( divUp ( s r c . c o l s , BLOCK DIM X PATCH PER BLOCK) , divUp (
s r c . rows , BLOCK DIM Y) ) ;
147
148
149
B<T> brd ( s r c . c o l s ) ;
150
151
l i n e a r R o w F i l t e r <KSIZE , T, D><<<g r i d , b l o c k , 0 , stream >>>(s r c , dst ,

anchor , brd ) ;
cudaSafeCall ( cudaGetLastError ( ) ) ;
152
153
154
i f ( stream == 0 )
cudaSafeCall ( cudaDeviceSynchronize ( ) ) ;
155
156
157
158
159
160
161
162
template <typename T, typename D>

void l i n e a r R o w F i l t e r g p u (DevMem2Db s r c , DevMem2Db dst , const f l o a t k e r n e l [ ] ,
i n t k s i z e , i n t anchor , i n t b r d t y p e , i n t cc , c u d a S t r e a m t stream )
{
typedef void ( c a l l e r t ) (DevMem2D <T> s r c , DevMem2D <D> dst , i n t anchor ,
i n t cc , c u d a S t r e a m t stream ) ;
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
s t a t i c const c a l l e r t c a l l e r s [ 5 ] [ 3 3 ] =
{
{
0,
l i n e a r R o w F i l t e r c a l l e r < 1 , T, D,
l i n e a r R o w F i l t e r c a l l e r <10 , T, D,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
BrdRowReflect101 >,
l i n e a r R o w F i l t e r c a l l e r <30 , T, D, BrdRowReflect101 >,

l i n e a r R o w F i l t e r c a l l e r <31 , T, D, BrdRowReflect101 >,
l i n e a r R o w F i l t e r c a l l e r <32 , T, D, BrdRowReflect101>
197
198
199
200
201
},
{
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
133
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
c a l l e r <23 ,
c a l l e r <24 ,
c a l l e r <25 ,
c a l l e r <26 ,
c a l l e r <27 ,
c a l l e r <28 ,
c a l l e r <29 ,
c a l l e r <30 ,
c a l l e r <31 ,
c a l l e r <32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate >,
BrdRowReplicate>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
},
{
260
261
262
263
264
265
266
267
268
269
270
271
272
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
<23 ,
<24 ,
<25 ,
<26 ,
<27 ,
<28 ,
<29 ,
<30 ,
<31 ,
<32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant >,
BrdRowConstant>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
c a l l e r <16 ,
c a l l e r <17 ,
c a l l e r <18 ,
c a l l e r <19 ,
c a l l e r <20 ,
c a l l e r <21 ,
c a l l e r <22 ,
c a l l e r <23 ,
c a l l e r <24 ,
c a l l e r <25 ,
c a l l e r <26 ,
c a l l e r <27 ,
c a l l e r <28 ,
c a l l e r <29 ,
c a l l e r <30 ,
c a l l e r <31 ,
c a l l e r <32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect >,
BrdRowReflect>
0,
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
caller < 1 ,
caller < 2 ,
caller < 3 ,
caller < 4 ,
caller < 5 ,
caller < 6 ,
caller < 7 ,
caller < 8 ,
caller < 9 ,
c a l l e r <10 ,
c a l l e r <11 ,
c a l l e r <12 ,
c a l l e r <13 ,
c a l l e r <14 ,
c a l l e r <15 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
},
{
273
306
134
},
{
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
linearRowFilter
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
caller
<16 ,
<17 ,
<18 ,
<19 ,
<20 ,
<21 ,
<22 ,
<23 ,
<24 ,
<25 ,
<26 ,
<27 ,
<28 ,
<29 ,
<30 ,
<31 ,
<32 ,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
T,
135
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
D,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>,
BrdRowWrap>
340
};
341
342
loadKernel ( kernel , k s i z e ) ;
343
344
c a l l e r s [ b r d t y p e ] [ k s i z e ] ( ( DevMem2D <T>) s r c , (DevMem2D <D>) dst , anchor ,

cc , stream ) ;
345
346
347
348
349
350
351
352
353
354
template void l i n e a r R o w F i l t e r g p u <uchar , f l o a t >(DevMem2Db s r c , DevMem2Db

dst , const f l o a t k e r n e l [ ] , i n t k s i z e , i n t anchor , i n t b r d t y p e , i n t cc ,
c u d a S t r e a m t stream ) ;
template void l i n e a r R o w F i l t e r g p u <uchar4 , f l o a t 4 >(DevMem2Db s r c , DevMem2Db
template void l i n e a r R o w F i l t e r g p u <s h o r t 3 , f l o a t 3 >(DevMem2Db s r c , DevMem2Db
template void l i n e a r R o w F i l t e r g p u <i n t
, f l o a t >(DevMem2Db s r c , DevMem2Db
template void l i n e a r R o w F i l t e r g p u <f l o a t , f l o a t >(DevMem2Db s r c , DevMem2Db
} // namespace r o w f i l t e r
}}} // namespace cv { namespace gpu { namespace d e v i c e
Appendix F
OpenCV Specialized Kernel Source
The following source listing is a hypothetical version of the same OpenCV GPU
kernel in Appendix E. A detailed explanation is provided in Section 4.2.
Listing F.1: Modified OpenCV CUDA example: this portion is specialized

1
2
3
4
5
6
#include
#include
#include
#include
#include
#include
7
8
9
10
11
12

{
{
constant
f l o a t c k e r n e l [ KSIZE ] ;
13
14
15
16
17
void l o a d K e r n e l ( const f l o a t k e r n e l [ ] )
{
c u d a S a f e C a l l ( cudaMemcpyToSymbol ( c k e r n e l , k e r n e l , KSIZE s i z e o f ( f l o a t )
) );
}
18
19
20
21
22
template <typename T, typename D, typename B>

global
void l i n e a r R o w F i l t e r ( const DevMem2D <T> s r c , PtrStep<D> dst ,
const B brd )
{
typedef typename TypeVec<f l o a t , V e c T r a i t s <T> : : cn > : : v e c t y p e sum t ;
23
24
shared
sum t smem [ BLOCK DIM Y ] [ ( PATCH PER BLOCK + 2 HALO SIZE )
BLOCK DIM X ] ;
25
26
const i n t y = b l o c k I d x . y BLOCK DIM Y + t h r e a d I d x . y ;
27
28
29
i f ( y >= s r c . rows )
return ;
30
136
APPENDIX F. OPENCV SPECIALIZED KERNEL SOURCE
137
const T s r c r o w = s r c . p t r ( y ) ;
31
32
const i n t x S t a r t = b l o c k I d x . x (PATCH PER BLOCK BLOCK DIM X) +

threadIdx . x ;
33
34
// Load l e f t h a l o
#pragma u n r o l l
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + j BLOCK DIM X ] = s a t u r a t e c a s t <
sum t >(brd . a t l o w ( x S t a r t ( HALO SIZE j ) BLOCK DIM X,
src row ) ) ;
35
36
37
38
39
// Load main d a t a
#pragma u n r o l l
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE BLOCK DIM X + j
BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd . a t h i g h ( x S t a r t + j
BLOCK DIM X, s r c r o w ) ) ;
40
41
42
43
44
// Load r i g h t h a l o
#pragma u n r o l l
smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + (PATCH PER BLOCK + HALO SIZE )
BLOCK DIM X + j BLOCK DIM X ] = s a t u r a t e c a s t <sum t >(brd .
a t h i g h ( x S t a r t + (PATCH PER BLOCK + j ) BLOCK DIM X, s r c r o w ) ) ;
45
46
47
48
49
syncthreads () ;
50
51
#pragma u n r o l l
{
const i n t x = x S t a r t + j BLOCK DIM X ;
52
53
54
55
56
if (x < src . cols )

{
sum t sum = V e c T r a i t s <sum t > : : a l l ( 0 ) ;
57
58
59
60
#pragma u n r o l l
f o r ( i n t k = 0 ; k < KSIZE ; ++k )
sum = sum + smem [ t h r e a d I d x . y ] [ t h r e a d I d x . x + HALO SIZE
BLOCK DIM X + j BLOCK DIM X ANCHOR + k ] c k e r n e l [ k
];
61
62
63
64
d s t ( y , x ) = s a t u r a t e c a s t <D>(sum ) ;
65
66
67
68
69
70
}
71
72
73
74
75
extern C {
void l i n e a r R o w F i l t e r c a l l e r (DevMem2D <T TYPENAME> s r c , DevMem2D <D TYPENAME> dst
, const f l o a t k e r n e l [ ] , c u d a S t r e a m t stream )
{
using namespace cv : : gpu : : d e v i c e : : r o w f i l t e r ;
76
77
loadKernel ( kernel ) ;
78
79
80
const dim3 b l o c k (BLOCK DIM X, BLOCK DIM Y) ;

const dim3 g r i d ( divUp ( s r c . c o l s , BLOCK DIM X PATCH PER BLOCK) , divUp ( s r c .
rows , BLOCK DIM Y) ) ;
81
82
B TYPENAME<T TYPENAME> brd ( s r c . c o l s ) ;
138
83
l i n e a r R o w F i l t e r <KSIZE , T TYPENAME, D TYPENAME><<<g r i d , b l o c k , 0 , stream >>>(

s r c , dst , anchor , brd ) ;
cudaSafeCall ( cudaGetLastError ( ) ) ;
84
85
86
i f ( stream == 0 )
cudaSafeCall ( cudaDeviceSynchronize ( ) ) ;
87
88
89
90
}
} // e x t e r n C
Listing F.2: Modified OpenCV CUDA example: this portion is compiled into the
host program
1
2
3
4
5
6
#include
#include
#include
#include
#include
#include
7
8
9
10
11
12
13
14
15
16
17
18

{
{
s t a t i c const char brdNames [ ] = {
BrdRowReflect101 ,
BrdRowReplicate ,
BrdRowConstant ,
BrdRowReflect ,
BrdRowWrap
};
19
20
21
22
23
template <typename T, typename D>

void l i n e a r R o w F i l t e r g p u (DevMem2Db s r c , DevMem2Db dst , const f l o a t k e r n e l [ ] ,
i n t k s i z e , i n t anchor , i n t b r d t y p e , i n t cc , c u d a S t r e a m t stream )
{
typedef void ( c a l l e r t ) (DevMem2D <T> s r c , DevMem2D <D> dst , const f l o a t
k e r n e l [ ] , c u d a S t r e a m t stream ) ;
24
25
26
27
i n t BLOCK DIM X ;
i n t BLOCK DIM Y ;
i n t PATCH PER BLOCK ;
28
29
30
31
32
33
34
35
36
37
38
39
40
i f ( c c >= 2 0 )
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 8 ;
}
else
{
BLOCK DIM X = 3 2 ;
BLOCK DIM Y = 4 ;
}
41
42
43
44
45
caller t caller = ( caller t ) specialize (

r o w f i l t e r . cu , l i n e a r R o w F i l t e r c a l l e r ,
B TYPENAME , brdNames [ b r d t y p e ] ,
T TYPENAME , t y p e t r a i t s <T> : : name ,
139
D TYPENAME , t y p e t r a i t s <D> : : name ,

KSIZE , t o S t r i n g ( k s i z e ) ,
ANCHOR , t o S t i n g ( an ch or ) ,
BLOCK DIM X , t o S t r i n g (BLOCK DIM X) ,
BLOCK DIM Y , t o S t r i n g (BLOCK DIM Y) ,
PATCH PER BLOCK , t o S t r i n g (PATCH PER BLOCK) ) ;
46
47
48
49
50
51
52
53
54
c a l l e r ( ( DevMem2D <T>) s r c , (DevMem2D <D>) dst , k e r n e l , stream ) ;
55
56
57
58
59
60
61
62
63
template void l i n e a r R o w F i l t e r
dst , const f l o a t k e r n e l [ ]
g p u <uchar , f l o a t >(DevMem2Db s r c , DevMem2Db

, i n t k s i z e , i n t anchor , i n t b r d t y p e , i n t cc ,
g p u <uchar4 , f l o a t 4 >(DevMem2Db s r c , DevMem2Db
g p u <s h o r t 3 , f l o a t 3 >(DevMem2Db s r c , DevMem2Db
g p u <i n t
, f l o a t >(DevMem2Db s r c , DevMem2Db
g p u <f l o a t , f l o a t >(DevMem2Db s r c , DevMem2Db
64
65
66
Appendix G
Sample GPU-PF Log Output
The following output is a highly abridged example of the output provided by the GPU
Prototyping Framework. The particular output show below is from the template
matching application (see Section 5.1) run on the data set for Patient 1.
Before execution of the first pipeline iteration new parameter values are registered
and propagated to resources and actions, as shown in Listing G.1. Some one-time
events, such as GPU memory allocation or transferring static data to the GPU also
occur.
Listing G.1: Initial application refresh, Part 1

[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0 ] Updates :
0]
Zero Param Update
0]
0
0]
Data Type Update
0]
single
0]
Per Frame S c h e d u l e Update
0]
P e r i o d 1 1 ; Delay 0
0]
Per Template S c h e d u l e Update
0]
P e r i o d 0 ; Delay 0
0]
Per Frame End S c h e d u l e Update
0]
P e r i o d 1 1 ; Delay 11
0]
S h i f t S i z e X Update
0]
37
0]
S h i f t Area Update
0]
703
0]
Template Area Update
0]
2862.000000
0]
Template Dimensions Update
140
APPENDIX G. SAMPLE GPU-PF LOG OUTPUT
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
[ 5 3 54 1 2 ]
0]
Template Dimensions Z Update
0]
One Time S c h e d u l e Update
0]
P e r i o d 1; Delay 0
0]
Frame Data Host Update
0]
I n i t i a l [ 0 , 0 , 0 ] to [ 8 8 , 71 , 0 ]
0]
Stride [0 , 0 , 1]
0]
Frame Data D e v i c e Extent Update
0]
[ 8 9 , 72 , 1 ] 4 (25632 bytes )
0]
Frame Data Host Extent Update
0]
[ 8 9 , 72 , 442] 4 (11329344 bytes )
0]
Template Data Extent Update
0]
[ 5 3 , 54 , 12] 4 (137376 bytes )
0]
Template Denominator Extent Update
0]
[ 1 2 , 1 , 1 ] 4 (48 bytes )
0]
Frame A v e r a g e s Extent Update
0]
[ 3 7 , 19 , 1 ] 4 (2812 bytes )
0]
Frame Denominator Extent Update
0]
[ 3 7 , 19 , 1 ] 4 (2812 bytes )
0]
Numerator Extent Update
0]
[ 7 0 3 , 12 , 1 ] 4 (33744 bytes )
0]
F i n a l M u l t i p l i c a t i o n Extent Update
0]
[ 7 0 3 , 12 , 1 ] 4 (33744 bytes )
0]
G l o b a l P i t c h e d Frame Data Update
0]
G l o b a l P i t c h e d Frame Data A l l o c a t i o n
0]
A l l o c a t e d [ 8 9 , 7 2 , 1 ] 4 , ( p512 )
0]
Memory Mappped Host Frame Data Update
0]
Memory Mappped Host Frame Data A l l o c a t i o n
0]
[ 8 9 , 72 , 442] 4
0]
Update Frame Data S u b s e t Update
0]
[ 8 9 , 72 , 1 ] 4
0]
S t a r t i n g at [ 0 , 0 , 0 ]
0]
25632 b y t e s s t r i d e
0]
Per Frame S c h e d u l e Exe Group Update
0]
Frame Data S u b s e t Extent Update
0]
[ 8 9 , 72 , 1 ] 4 (25632 bytes )
0]
Memory Copy from Host Frame Data S u b s e t t o G l o b a l P i t c h e d Frame Data
Update
0]
Per Frame S c h e d u l e Exe Group Update
0]
G l o b a l P i t c h e d Template Data Update
0]
G l o b a l P i t c h e d Template Data A l l o c a t i o n
0]
A l l o c a t e d [ 5 3 , 5 4 , 1 2 ] 4 , ( p512 )
0]
Memory Mappped Host Template Data Update
0]
Memory Mappped Host Template Data A l l o c a t i o n
0]
[ 5 3 , 54 , 12] 4
0]
Memory Copy from Host Template Data t o G l o b a l P i t c h e d Template Data
Update
0]
Copied [ 5 3 , 5 4 , 1 2 ] 4 ( 1 3 7 3 7 6 b y t e s )
0]
from Template Data <0x 7 f 4 6 4 9 4 d e 0 0 0 >
0]
t o Template Data <0x200300000>
0]
G l o b a l L i n e a r Template Denominator Update
0]
G l o b a l L i n e a r Template Denominator A l l o c a t i o n
0]
Allocated [12 , 1 , 1] 4
0]
Memory Mappped Host Template Denominator Update
0]
Memory Mappped Host Template Denominator A l l o c a t i o n
0]
[12 , 1 , 1] 4
0]
Memory Copy from Host Template Denominator t o G l o b a l L i n e a r Template
Denominator Update
0]
Copied [ 1 2 , 1 , 1 ] 4 ( 4 8 b y t e s )
0]
from Template Denominator <0x7f46494dd000>
0]
t o Template Denominator <0x200400000>
0]
G l o b a l P i t c h e d Frame A v e r a g e s Update
0]
G l o b a l P i t c h e d Frame A v e r a g e s A l l o c a t i o n
141
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
0]
142
A l l o c a t e d [ 3 7 , 1 9 , 1 ] 4 , ( p512 )
G l o b a l L i n e a r Frame Denominator Update
G l o b a l L i n e a r Frame Denominator A l l o c a t i o n
Allocated [ 3 7 , 19 , 1 ] 4
G l o b a l P i t c h e d Numerator Update
G l o b a l P i t c h e d Numerator A l l o c a t i o n
A l l o c a t e d [ 7 0 3 , 1 2 , 1 ] 4 , ( p3072 )
G l o b a l P i t c h e d F i n a l M u l t i p l i c a t i o n Update
Global Pitched Final M u l t i p l i c a t i o n A l l o c a t i o n
A l l o c a t e d [ 7 0 3 , 1 2 , 1 ] 4 , ( p3072 )
PageLocked Host F i n a l M u l t i p l i c a t i o n Update
PageLocked Host F i n a l M u l t i p l i c a t i o n A l l o c a t i o n
[ 7 0 3 , 12 , 1 ] 4
Still within the initial application refresh, the log segment in Listing G.2 shows
the compilation and loading of the numerator stage kernels.
Listing G.2: Initial application refresh, Part 2

[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0]
Numerator P a r t s Update
0]
Numerator P a r t s A l l o c a t i o n
0]
Using system ( ) t o c a l l : nvcc c u b i n keep keepdir /home/ nmoore /
gpup f / b i n d e v i c e debug 0 o /home/ nmoore /gpup f / b i n /
c o r r 2 T i l e d N u m e r a t o r P a r t s C o m b i n e d 0 1 6 1 2 2 5 3 0 4 4 5 2 7 6 3 7 . debug . c u b i n I /home/ nmoore /gpu
p f /gpui n c l u d e DGPU DEBUG gpua r c h i t e c t u r e compute 20 gpucode sm 20
DREGULAR TILE SIZE X=16 DREGULAR TILE SIZE Y=2 DBOTTOM TILE SIZE X=5
DRIGHT TILE SIZE Y=0 DGRID DIM X=4 DGRID DIM Y=27 DSHIFT X=37 /home/ nmoore /
gpup f / apps /tm/ k e r n e l s / c o r r 2 T i l e d N u m e r a t o r P a r t s C o m b i n e d . cu
0]
Numerator R e d u c t i o n Update
0]
Numerator R e d u c t i o n A l l o c a t i o n
0]
Using system ( ) t o c a l l : nvcc c u b i n keep keepdir /home/ nmoore /
gpup f / b i n d e v i c e debug 0 o /home/ nmoore /gpup f / b i n / c o r r 2 T i l e d S u m s P i t c h e d 0 1 0 8
. debug . c u b i n I /home/ nmoore /gpup f /gpui n c l u d e DGPU DEBUG gpua r c h i t e c t u r e
compute 20 gpucode sm 20 DSUMS UNROLL COUNT=108 /home/ nmoore /gpup f / apps /tm/
k e r n e l s / c o r r 2 T i l e d S u m s P i t c h e d . cu
0]
G l o b a l P i t c h e d Numerator P a r t s Update
0]
G l o b a l P i t c h e d Numerator P a r t s A l l o c a t i o n
0]
A l l o c a t e d [ 7 0 3 , 1 0 8 , 1 2 ] 4 , ( p3072 )
0]
Numerator P a r t s S u b s e t I n f o Update
0]
I n i t i a l [ 0 , 0 , 0 ] to [ 7 0 2 , 107 , 0 ]
0]
Stride [0 , 0 , 1]
0]
Update Numerator P a r t s S u b s e t Update
0]
[ 7 0 3 , 108 , 1 ] 4
0]
S t a r t i n g at [ 0 , 0 , 0 ]
0]
331776 b y t e s s t r i d e
0]
Per Template S c h e d u l e Exe Group Update
0]
Numerator P a r t s S u b s e t Extent Update
0]
[ 7 0 3 , 108 , 1 ] 4 (303696 bytes )
0]
Numerator P a r t s Update
0]
27 r e g i s t e r s p e r t h r e a d
0]
Numerator R e d u c t i o n Update
0]
23 r e g i s t e r s p e r t h r e a d
0]
Bind t e x t u r e frameDataTex Update
0]
from module Numerator P a r t s
0]
Bound t o global Frame Data
143
The log segment in Listing G.3 shows a single pipeline iteration. The iteration
shown is a the last iteration for the current frame, resulting in a final reduction and
file I/O, as described in Section 5.1.3. The frame averages and denominator are not
executed as they run only once per frame while the numerator executes for each
template. This log segment is itself abridged, with the missing portion marked by a
string of periods.
Listing G.3: Pipeline iteration

[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
47] Iteration 47:

4 7 ] Execute : Pre E x e c u t i o n S t a g e
47]
+Per Template S c h e d u l e Exe Group
47]
Update Numerator P a r t s S u b s e t
47]
O f f s e t : 3649536
47]
S t e p s : 11
47]
47]
Template Data Walk
47]
S l i c e : 11
47]
P o i n t e r : Template Data D e v i c e S t a c k Ref Walk S l i c e 11 <0x200265400 >;
Extent : [ 5 3 , 5 4 , 1 ] 4 ( p512 )
47]
Bind t e x t u r e templateDataTex
47]
from module Numerator P a r t s
47]
Bound t o global Template Data D e v i c e S t a c k Ref
4 7 ] Execute : Data Push S t a g e
4 7 ] Execute : E x e c u t i o n S t a g e
47]
47]
Numerator F u l l S h i f t Subqueue
4 7 ] Execute : Numerator F u l l S h i f t Subqueue
47]
+Numerator Every I t r S c h e d u l e Exe Group
47]
Numerator F u l l S h i f t O f f s e t Step
47]
0
47]
Numerator P a r t s k e r n e l e x e c u t i o n
47]
( 4 27 1 ) Grid
47]
( 7 4 1 1 ) Block
47]
1048 b y t e s SM
47]
9 K e r n e l Arguments :
47]
P o i n t e r : Numerator P a r t s S u b s e t ( 0 x200b7b000 )
47]
P i t c h : Numerator P a r t s S u b s e t ( 3 0 7 2 )
47]
Integer : 0
47]
Integer : 2
47]
I n t e g e r : 16
47]
Integer : 2
47]
Integer : 5
47]
Integer : 0
47]
I n t e g e r : 37
47]
47]
47]
2
47]
47]
( 4 27 1 ) Grid
[
[
[
[
[
[
[
[
[
[
[
[
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
47]
( 7 4 1 1 ) Block
1048 b y t e s SM
Integer : 2
Integer : 2
I n t e g e r : 16
Integer : 2
Integer : 5
Integer : 0
I n t e g e r : 37
............................................................
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[

47]
47]
47]
16
47]
47]
( 4 27 1 ) Grid
47]
( 7 4 1 1 ) Block
47]
1048 b y t e s SM
47]
47]
47]
47]
I n t e g e r : 16
47]
Integer : 2
47]
I n t e g e r : 16
47]
Integer : 2
47]
Integer : 5
47]
Integer : 0
47]
I n t e g e r : 37
47]
47]
( 4 27 1 ) Grid
47]
( 3 7 1 1 ) Block
47]
692 b y t e s SM
47]
47]
47]
47]
I n t e g e r : 18
47]
Integer : 1
47]
I n t e g e r : 16
47]
Integer : 2
47]
Integer : 5
47]
Integer : 0
47]
I n t e g e r : 37
47]
+Per Frame End S c h e d u l e Exe Group
47]
Numerator R e d u c t i o n k e r n e l e x e c u t i o n
47]
( 1 1 12 1 ) Grid
47]
( 6 4 1 1 ) Block
47]
0 b y t e s SM
47]
47]
P o i n t e r : Numerator ( 0 x200209000 )
47]
P o i n t e r : Numerator P a r t s ( 0 x200800000 )
47]
P i t c h : Numerator ( 3 0 7 2 )
47]
P i t c h : Numerator P a r t s ( 3 0 7 2 )
47]
I n t e g e r : 703
47]
I n t e g e r : 108
47]
F i n a l M u l t i p l y k e r n e l e x e c u t i o n
47]
( 1 1 12 1 ) Grid
47]
( 6 4 1 1 ) Block
47]
0 b y t e s SM
144
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
145
47]
47]
P o i n t e r : Numerator ( 0 x200209000 )
47]
P o i n t e r : Frame Denominator ( 0 x200600000 )
47]
P o i n t e r : F i n a l M u l t i p l i c a t i o n ( 0 x200212000 )
47]
P i t c h : Numerator ( 3 0 7 2 )
47]
Pitch : Final M u l t i p l i c a t i o n (3072)
47]
I n t e g e r : 703
4 7 ] Execute : Data P u l l S t a g e
47]
47]
Memory Copy from G l o b a l P i t c h e d F i n a l M u l t i p l i c a t i o n t o Host Pinned
Final Multiplication
47]
Copied [ 7 0 3 , 1 2 , 1 ] 4 ( 3 3 7 4 4 b y t e s )
47]
from F i n a l M u l t i p l i c a t i o n <0x200212000>
47]
t o F i n a l M u l t i p l i c a t i o n <0x200700000>
4 7 ] Execute : Post E x e c u t i o n S t a g e
47]
47]
Write f i l e /home/ nmoore /gpup f / b i n /gpuOut . b i n
Listing G.4 shows a log segment containing GPU-PF timing output, which includes the number of times and average time for each operation execution. This
segment is illustrative of the typical level of overhead resulting in granular timing.
The item titled Frame Averages Full Shift Subqueue is a GPU-PF abstraction that
wraps the four Frame Averages kernels that follow in a single object. In this run,
only main and bottom tiles existed. This produces fine-grained results for the two
kernels, as each launch is timed individually, but the cumulative overhead can be
seen in the amount of time it takes to execute the whole wrapper object. The same
effect can be seen in the stage summary, which lists a cumulative time obtained by
summing the averages, and a total time, which measures the real end-to-end stage
time.
Listing G.4: Per-operation timing

Timing f o r queue Pre E x e c u t i o n S t a g e ( 5 3 0 4 e x e c u t i o n s ) :
Update Frame Data S u b s e t : 442 e v e n t s 0 . 0 0 0 ms
Update Numerator P a r t s S u b s e t : 5304 e v e n t s 0 . 0 0 0 ms
Template Data S t a c k Sub Queue : 1 e v e n t s 0 . 1 6 0 ms
Template Data Walk : 5304 e v e n t s 0 . 0 0 0 ms
Bind t e x t u r e templateDataTex : 5304 e v e n t s 0 . 0 0 1 ms
146
Cumulative O p e r a t i o n Time : 0 . 0 0 1 ms
Avg T o t a l Queue Time : 0 . 0 0 1 ms
Timing f o r queue Data Push S t a g e ( 5 3 0 4 e x e c u t i o n s ) :

Memory Copy from Host Frame Data S u b s e t t o G l o b a l P i t c h e d Frame Data : 442 e v e n t s
0 . 0 2 3 ms
Timing f o r queue E x e c u t i o n S t a g e ( 5 3 0 4 e x e c u t i o n s ) :
Frame A v e r a g e s F u l l S h i f t Subqueue : 442 e v e n t s 0 . 2 5 2 ms
Frame A v e r a g e s Main P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 8 ms
Frame A v e r a g e s Bottom P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 7 ms
Frame A v e r a g e s Right P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame A v e r a g e s Corner P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame A v e r a g e s R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 1 5 ms
Frame Denominator F u l l S h i f t Subqueue : 442 e v e n t s 0 . 2 5 0 ms
Frame Denominator Main P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 8 ms
Frame Denominator Bottom P a r t s k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 7 ms
Frame Denominator Right P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame Denominator Corner P a r t s k e r n e l e x e c u t i o n : 0 e v e n t s 0 . 0 0 0 ms
Frame Denominator R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 1 4 ms
Numerator F u l l S h i f t Subqueue : 5304 e v e n t s 0 . 1 6 2 ms
Numerator P a r t s k e r n e l e x e c u t i o n : 5304 e v e n t s 0 . 0 1 0 ms
Numerator R e d u c t i o n k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 4 0 ms
F i n a l M u l t i p l y k e r n e l e x e c u t i o n : 442 e v e n t s 0 . 0 0 6 ms
Timing f o r queue Data P u l l S t a g e ( 5 3 0 4 e x e c u t i o n s ) :

Memory Copy from G l o b a l P i t c h e d F i n a l M u l t i p l i c a t i o n t o Host Pinned F i n a l
M u l t i p l i c a t i o n : 442 e v e n t s 0 . 0 1 9 ms
Timing f o r queue Post E x e c u t i o n S t a g e ( 5 3 0 4 e x e c u t i o n s ) :

Write f i l e /home/ nmoore /gpup f / b i n /gpuOut . b i n : 442 e v e n t s 0 . 0 4 4 ms
The log segment in Listing G.5, here referred to as Example A, provides an

example of global timing information for each stage of the application. The first
three lines provide the end-to-end total application time, the amount of time spent
setting up the application, compiling kernels, and allocating resources, and actually
executing the application pipeline, respectively.
147
After, the Component Times provides some more granular information about
application overheads not associated with executing the application pipeline. The
Application Initialization Time includes CUDA library initialization overhead,
which can be quite large. Options Processing refers to parsing the configuration file for the application, Implementation Build refers to the application using
the GPU-PF API to construct the representation of the application, and Implementation Update refers to the argument setup and memory and other resource
allocation phase that takes place before iterative execution. The Implementation
Update Time includes kernel compilation. For this run, individual operation timing
was disabled.
Listing G.5: High-level timing, Example A

E n t i r e A p p l i c a t i o n Runtime : 1 4 0 8 3 . 3 1 6 ms
Setup Time : 1 2 3 1 0 . 7 7 2 ms
I t e r a t i v e Time : 6 6 1 . 0 0 2 ms
Component Times :
A p p l i c a t i o n I n i t i a l i z a t i o n Time : 5 5 . 0 1 8 ms
I m p l e m e n t a t i o n I n i t i a l i z a t i o n Time : 0 . 0 3 6 ms
O pt i on s P r o c e s s i n g Time : 0 . 0 4 6 ms
I m p l e m e n t a t i o n B u i l d Time : 0 . 3 0 4 ms
I m p l e m e n t a t i o n Update Time : 1 2 2 5 5 . 4 4 9 ms
I m p l e m e n t a t i o n D e a l l o c a t i o n Time : 1 . 6 1 6 ms
Listing G.6, shows GPU-PF log output for the same application as Examples A in
Listings G.5. The application was rerun without clearing the compiled GPU binary
cache so that the CUDA kernels did not have to be recompiled. The difference in
Implementation Update Time between Example A and B indicates the overhead
incurred by kernel compilation.
148
Listing G.6: High-level timing, Example B

E n t i r e A p p l i c a t i o n Runtime : 2 4 7 4 . 9 4 8 ms
Setup Time : 5 9 . 2 8 6 ms
I t e r a t i v e Time : 2 2 9 2 . 0 5 8 ms
Component Times :
I m p l e m e n t a t i o n Update Time : 3 . 6 5 7 ms
Listing G.7, shows GPU-PF log output for the same application as Examples
A and B in Listings G.5 and G.6. The application was rerun without individual
operation timing and without clearing the compiled GPU binary cache so that the
CUDA kernels did not have to be recompiled. Disable individual operation timing
removes nearly all of the timing overhead, significantly reducing overall application
run times.
Listing G.7: High-level timing, Example C

E n t i r e A p p l i c a t i o n Runtime : 8 4 4 . 9 6 7 ms
Setup Time : 6 2 . 8 0 0 ms
I t e r a t i v e Time : 6 5 8 . 5 6 5 ms
Component Times :
I m p l e m e n t a t i o n Update Time : 3 . 6 3 3 ms
Bibliography
[1] AMD Accelerated Parllel Processing (APP) SDK Website. http://developer.
amd.com/gpu/AMDAPPSDK/Pages/default.aspx, last accessed 1 April 2011.
[2] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop
nests for GPGPUs. In ICS 08: Proceedings of the 22nd annual international
conference on Supercomputing, pages 225234, New York, NY, USA, 2008. ACM.
[3] M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory
bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis,
SC 11, pages 12:112:11, New York, NY, USA, 2011. ACM.
[4] A. Bennis. Implementing a Highly Parameterized Digital PIV System On Reconfigurable Hardware. PhD thesis, Northeastern University, 2010.
[5] A. Bennis, M. Leeser, and G. Tadmor. The effect of parameterization on a
reconfigurable implementation of PIV. In ERSA 09: Proceedings of the 2009
149
BIBLIOGRAPHY
150
International Conference on Engineering of Reconfigurable Systems and Algorithms, 2009.

[6] A. Bennis, M. Leeser, and G. Tadmor. Implementing a highly parameterized
digital PIV system on reconfigurable hardware. In Application-specific Systems,
Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pages 32 37, 2009.
[7] B. Bergen, M. Daniels, and P. Weber.
A hybrid programming model for
compressible gas dynamics using OpenCL. In Parallel Processing Workshops

(ICPPW), 2010 39th International Conference on, pages 397404, September
2010.
[8] K. J. Brown, A. K. Sujeeth, H. Lee, T. Rompf, H. Chafi, M. Odersky, and
K. Olukotun. A heterogeneous parallel framework for domain-specific languages.
In PACT, pages 89100, 2011.
[9] B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In Proceedings of the 16th ACM symposium on
Principles and practice of parallel programming, PPoPP 11, pages 4756, New
York, NY, USA, 2011. ACM.
[10] B. Catanzaro, S. A. Kamil, Y. Lee, K. Asanovi, J. Demmel, K. Keutzer, J. Shalf,
K. A. Yelick, and A. Fox. SEJITS: Getting productivity and performance with
selective embedded JIT specialization. Technical Report UCB/EECS-2010-23,
BIBLIOGRAPHY
151
University of California at Berkeley, 2010. http://www.eecs.berkeley.edu/Pubs/

TechRpts/2010/EECS-2010-23.html.
[11] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. In Proceedings of
the 16th ACM symposium on Principles and practice of parallel programming,
PPoPP 11, pages 3546, New York, NY, USA, 2011. ACM.
[12] J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse
matrix-vector multiply on GPUs. In PPoPP 10: Proceedings of the 15th ACM
SIGPLAN symposium on Principles and practice of parallel programming, pages
115126, New York, NY, USA, 2010. ACM.
[13] CUDA Toolkit Release Archive.
http://developer.nvidia.com/object/cuda
archive.html, last accessed 1 April 2011.

[14] Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang. Multiple templatebased fluoroscopic tracking of lung tumor mass without implanted fiducial markers. Physics in Medicine and Biology, 52(20):622942, 2007.
[15] V. Damian, A. Sandu, M. Damian, F. Potra, and G. R. Carmichael. The kinetic
preprocessor kpp-a software environment for solving chemical kinetics. Computers & Chemical Engineering, 26(11):1567 1579, 2002.
BIBLIOGRAPHY
152
[16] Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos,

E. Elsen, F. Ham, A. Aiken, K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan. Liszt: a domain specific language for building portable mesh-based pde
solvers. In Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, SC 11, pages 9:19:12, New
[17] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: a dynamic
optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT 10, pages 353364, New York, NY,
USA, 2010. ACM.
[18] G. Dotzler, R. Veldema, and M. Klemm. JCudaMP: OpenMP/Java on CUDA.
In IWMSE 10: Proceedings of the 3rd International Workshop on Multicore
Software Engineering, pages 1017, New York, NY, USA, 2010. ACM.
[19] L. A. Feldkamp et al. Practical cone-beam algorithm. J. Opt. Soc. Am. A, 1(6),
1984.
[20] J. Fessler. Image Reconstruction Toolbox. University of Michigan. http://web.
eecs.umich.edu/fessler/code/index.html, last accessed 17 March 2012.
[21] R. Garg and J. N. Amaral. Compiling Python to a hybrid execution environment. In GPGPU 10: Proceedings of the 3rd Workshop on General-Purpose
BIBLIOGRAPHY
153
Computation on Graphics Processing Units, pages 1930, New York, NY, USA,
2010. ACM.
[22] D. Grewe and A. Lokhmotov. Automatically generating and tuning GPU code
for sparse matrix-vector multiplication from a high-level representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics
Processing Units, GPGPU-4, pages 12:112:8, New York, NY, USA, 2011. ACM.
[23] L. Gui and W. Merzkirch. A comparative study of the MQD method and several
correlation-based PIV evaluation algorithms. Experiments in Fluids, 28:3644,
2000. 10.1007/s003480050005.
[24] M.
Harris.
Corporation,
Optimizing Parallel Reduction in CUDA.

2701 San Tomas Expressway,
Santa Clara,
NVIDIA
CA 95050.
Available from http://developer.download.nvidia.com/compute/DevZone/C/

Projects/reduction.tar.gz, last accessed 20 May 2012.
[25] F. D. Igual, E. Chan, E. S. Quintana-Ort, G. Quintana-Ort, R. A. van de Geijn,
and F. G. V. Zee. The FLAME approach: From dense linear algebra algorithms
to high-performance multi-accelerator implementations. Journal of Parallel and
Distributed Computing (JPDC), 2011.
[26] C. Jang. Run fast anywhere: Managed platform GPGPU with OpenCL. In AMD
Fusion Developer Summit, 2012. http://golem5.org/bucket/afds2012/PL-4158
AFTER REVIEW DRAFT 052312.pdf, last accessed 24 May 2012.
BIBLIOGRAPHY
154
[27] A. Klockner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. CoRR,
abs/0911.3456, 2009. http://arxiv.org/abs/0911.3456, last accessed 24 May
2012.
[28] S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and
tuning for GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC
10, pages 111, Washington, DC, USA, 2010. IEEE Computer Society.
[29] A. Leung, O. Lhotak, and G. Lashari. Automatic parallelization for graphics
processing units. In PPPJ 09: Proceedings of the 7th International Conference
on Principles and Practice of Programming in Java, pages 91100, New York,
NY, USA, 2009. ACM.
[30] B. G. Levine, J. E. Stone, and A. Kohlmeyer. Fast analysis of molecular dynamics trajectories with graphics processing unitsradial distribution function
histogramming. Journal of Computational Physics, 230(9), 2011.
[31] J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu. Multi-core acceleration of chemical kinetics for simulation and prediction. In SC 09: Proceedings
of the Conference on High Performance Computing Networking, Storage and
Analysis, pages 111, New York, NY, USA, 2009. ACM.
BIBLIOGRAPHY
155
[32] Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU
program optimizations. In IPDPS 2009: IEEE International Symposium on
Parallel & Distributed Processing, 2009, pages 110, May 2009.
[33] G. Mainland and G. Morrisett. Nikola: embedding compiled GPU functions in
Haskell. In Proceedings of the third ACM Haskell symposium on Haskell, Haskell
10, pages 6778, New York, NY, USA, 2010. ACM.
[34] MathWorks Parallel Computing Toolbox.
http://www.mathworks.com/
products/parallel-computing/, last accessed 24 May 2012.

[35] R. Nath, S. Tomov, T. T. Dong, and J. Dongarra. Optimizing symmetric dense
matrix-vector multiplication on gpus. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011.
[36] NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050.
NVIDIA CUDA Programming Guide, 3.1 edition, June 2010. http://developer.
nvidia.com/object/cuda 3 1 downloads.html, last accessed 1 April 2011.
CUDA C Best Practices Guide, 4.1 edition, January 2012. http://developer.
nvidia.com/nvidia-gpu-computing-documentation, last accessed 20 May 2012.
NVIDIA CUDA C Programming Guide, 4.2 edition, April 2012.
http://
BIBLIOGRAPHY
156
developer.nvidia.com/nvidia-gpu-computing-documentation, last accessed 20

May 2012.
NVIDIA CUDA Toolkit v4.1 Release Notes Errata, 4.1 edition, January 2012.
Available from http://developer.nvidia.com/cuda-toolkit-41-archive, last accessed 21 May 2012.
Parallel Thread Execution ISA, 3.0 edition, February 2012. http://developer.
nvidia.com/nvidia-gpu-computing-documentation, last accessed 20 May 2012.
[41] OpenCV Repository.
Available athttp://code.opencv.org/projects/opencv/
repository/changes/trunk/opencv/modules/gpu/src/cuda/row filter.cu,
revi-
sion 7307, last accessed 20 May 2012.

[42] OpenCV Website. http://opencv.willowgarage.com/, last accessed 20 May 2012.
[43] M. Papakipos. The peakstream platform: High-productivity software development for multi-core processors. In Windows Hardware Engineering Conference
(WinHEC), 2007.
[44] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke,
D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. Optix: a
BIBLIOGRAPHY
157
general purpose ray tracing engine. In SIGGRAPH 10: ACM SIGGRAPH

2010 papers, pages 113, New York, NY, USA, 2010. ACM.
[45] V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support
for enabling generalized reduction computations on heterogeneous parallel configurations. In ICS 10: Proceedings of the 24th ACM International Conference
on Supercomputing, pages 137146, New York, NY, USA, 2010. ACM.
[46] Reconfigurable Computing Laboratory Website.
http://www.coe.neu.edu/
Research/rcl/index.php, last accessed 21 March 2012.

[47] Robot Locomotion Group Website. http://groups.csail.mit.edu/locomotion/,
last accessed 21 March 2012.
[48] G. Rudy, M. Hall, C. Chen, J. Chame, and M. M. Khan. A programming
language interface to describe transformations and code generation. In The 23rd
International Workshop on Languages and Compilers for Parallel Computing,
2010.
[49] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO 08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pages 195204, New
BIBLIOGRAPHY
[50] Siemens
Inveon
158
Platform
Website.
http://www.medical.siemens.com/
webapp/wcs/stores/servlet/CategoryDisplay?catalogId=-1&categoryId=
1029721&catTree=100010,1007660,1011525,1029715,1029721&langId=
-1&storeId=10001, last accessed 21 March 2012.
[51] A. Stivala, P. Stuckey, and A. Wirth. Fast and accurate protein substructure
searching with simulated annealing and GPUs. BMC Bioinformatics, 11(1):446,
2010.
[52] J. E. Stone, J. Saam, D. J. Hardy, K. L. Vandivort, W.-m. W. Hwu, and K. Schulten. High performance computation and interactive display of molecular orbitals
on GPUs and multi-core CPUs. In GPGPU-2: Proceedings of 2nd Workshop
on General Purpose Processing on Graphics Processing Units, pages 918, New
[53] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation
of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for
High Performance Computing, Networking, Storage and Analysis, SC 11, pages
35:135:11, New York, NY, USA, 2011. ACM.
[54] V. Volkov. Better performance at lower occupancy. In GPU Technology Conference (GTC), 2010.
BIBLIOGRAPHY
159
[55] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra.
In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC 08,
pages 31:131:11, Piscataway, NJ, USA, 2008. IEEE Press.
[56] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In Performance
Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium
on, pages 235 246, march 2010.
[57] H. Yu, M. Leeser, G. Tadmor, and S. Siegel. Real-time particle image velocimetry
for feedback loops using fpga implementation. In AIAA 06: Proceedings of the
43rd American Institute of Aeronautics and Astronautics Aerospace Sciences
Meeting and Exhibit, 2005.
[58] Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3D stencil codes
on GPU clusters. In International Symposium on Code Generation and Optimization (CGO), 2012.

GPUs

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

GPUs

Загружено:

Авторское право:

Доступные форматы

Kernel Specialization for Improved Adaptability and Performance on

Graphics Processing Units (GPUs)

c Copyright 2012 by Nicholas Moore

Recent CUDA Techniques . . . . . . . . . . . . . . . . . . . . . . . .

CUDA Development and Adaptability . . . . . . . . . . . . . . . . .

Implementation Parameters and Parameterization . . . . . . . . . . .

Run Time Code Generation . . . . . . . . . . . . . . . . . . . . . . .

Domain Specific Tools . . . . . . . . . . . . . . . . . . . . . . . . . .

Implementation and Tools . . . . . . . . . . . . . . . . . . . . . . . .

Validation and Parameterization . . . . . . . . . . . . . . . . .

Large Template Matching . . . . . . . . . . . . . . . . . . . . . . . .

Algorithm and Problem Space . . . . . . . . . . . . . . . . . .

CUDA Implementation Challenges . . . . . . . . . . . . . . .

Variable Tile Sizes and Kernel Specialization . . . . .

Particle Image Velocimetry . . . . . . . . . . . . . . . . . . . . . . . .

Cone Beam Backprojection . . . . . . . . . . . . . . . . . . . . . . . .

6 Experiments and Results

Hardware and Software Configurations . . . . . . . . . . . . .

Problem and Implementation Parameterization . . . . . . . .

Cone Beam Back Projection . . . . . . . . . . . . . .

Kernel Specialization Performance . . . . . . . . . . . . . . . .

Cone Beam Backprojection . . . . . . . . . . . . . . 106

7 Conclusions and Future Work

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Existing Applications . . . . . . . . . . . . . . . . . . . . . . . 118

Kernel Specialization . . . . . . . . . . . . . . . . . . . . . . . 120

B Flexibly Specializable Kernel

C Sample Run Time Evaluated PTX

D Sample Kernel Specialized PTX

E OpenCV Kernel Source

F OpenCV Specialized Kernel Source

G Sample GPU-PF Log Output

Example of a parallel reduction tree starting with eight elements. . .

Different template tile regions . . . . . . . . . . . . . . . . . . . . . .

are the matrix averages of A and B . . . . . . . . . . . . . .

is the matrix average of B, AC is the current template with the

Different template tile regions. . . . . . . . . . . . . . . . . . . . . . .

A pseudocode representation of the computation performed by each

A graphical depiction of the physical PIV configuraiton. . . . . . . . .

A graphical depiction of the terminology originally used for the FPGA

5.13 A graphical depiction of cone beam scanning setup. . . . . . . . . . .

CUDA thread block characteristics by compute capability . . . . . . .

The amount of register and shared memory available within each

streaming multiprocessor for NVIDIA GPUs of various compute capabilities [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Various parameter types provided by GPU-PF. . . . . . . . . . . . .

Various resource types provided by GPU-PF. . . . . . . . . . . . . . .

Various memory types provided by GPU-PF. . . . . . . . . . . . . . .

Various actions provided by GPU-PF. . . . . . . . . . . . . . . . . . .

Template matching GPU implementation parameters benchmarked. .

The PIV problem set parameters, in terms of interrogation window

PIV problem set parameters used to test the impact of interrogation

PIV GPU implementation parameters benchmarked. . . . . . . . . .

Cone beam backprojection GPU problem parameters benchmarked. .

Cone beam backprojection GPU implementation parameters benchmarked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.10 Template matching performance results comparing the multi-threaded

6.11 PIV performance results comparing the FPGA implementation to the

A CUDA C GPU kernel designed to demonstrate the common ways

A CUDA C GPU kernel designed to demonstrate kernel specialization.

MATLAB function implementing the sliding window correlation for

mentations to the particular characteristics of the current GPU. Examples of these

1.2 & 1.3

2.0 & 2.1

Table 2.1: CUDA thread block characteristics by compute capability

Recent CUDA Techniques