Вы находитесь на странице: 1из 103

Peter Calvert

Parallelisation of Java for


Graphics Processors
Computer Science Tripos, Part II
Trinity College
May 11, 2010
Proforma
Name: Peter Calvert
College: Trinity College
Project Title: Parallelisation of Java for Graphics Processors
Examination: Computer Science Tripos, Part II, June 2010
Word Count: 11983 words
Project Originator: Peter Calvert
Supervisors: Dr Andrew Rice and Dominic Orchard
Original Aims of the Project
The aim of the project was to allow extraction and compilation of Java vir-
tual machine bytecode for parallel execution on graphics cards, specically the
NVIDIA CUDA framework, by both explicit and automatic means.
Work Completed
The compiler, which was produced, successfully extracts and compiles code from
class les into CUDA C++ code, and outputs transformed classes that make use
of this native code. Developers can indicate loops that should be parallelised by
use of Java annotations. Loops can also be automatically detected as safe using
a dependency checking algorithm.
On benchmarks, speedups of up to a factor of 187 were measured. Evaluation
of the automatic dependency analysis showed 85% accuracy over a range of sample
code.
Special Diculties
None.
i
Declaration
I, Peter Calvert of Trinity College, being a candidate for Part II of the Computer
Science Tripos, hereby declare that this dissertation and the work described in it
are my own work, unaided except as may be specied below, and that the disser-
tation does not contain material that has already been used to any substantial
extent for a comparable purpose.
Signed
Date
ii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 JavaB [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Within JikesRVM [16] . . . . . . . . . . . . . . . . . . . . 3
1.3.3 JCUDA [25] . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Preparation 5
2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Methods of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Development Environment . . . . . . . . . . . . . . . . . . . . . . 8
2.5 The Java Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 NVIDIA CUDA Architecture . . . . . . . . . . . . . . . . . . . . 11
2.6.1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.2 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Common Compiler Analysis Techniques . . . . . . . . . . . . . . . 14
2.7.1 General Dataow Analysis . . . . . . . . . . . . . . . . . . 14
2.7.2 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.3 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . 16
2.7.4 Constant Propagation . . . . . . . . . . . . . . . . . . . . 17
2.7.5 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Implementation 19
3.1 Overall Implementation Structure . . . . . . . . . . . . . . . . . . 19
iii
3.2 Internal Code Representation (ICR) . . . . . . . . . . . . . . . . . 21
3.2.1 Code Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Visitor Pattern . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Bytecode to ICR Translation . . . . . . . . . . . . . . . . 24
3.2.4 Type Inference . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Dataow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Support for Arrays and Objects . . . . . . . . . . . . . . . 28
3.3.2 Increment Variables . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 May-Alias . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.4 Usage Information . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Loop Trivialisation . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Kernel Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Copy In . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.2 Copy Out . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.1 Annotation Based . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.2 Automatic . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.1 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.2 Kernel Invocation . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.3 Data Copying . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Compiler Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8.1 Feedback to the User . . . . . . . . . . . . . . . . . . . . . 41
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Evaluation 43
4.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Model of Overheads . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Component Benchmarks . . . . . . . . . . . . . . . . . . . 49
4.2.3 Java Grande Benchmark Suite [7] . . . . . . . . . . . . . . 51
4.2.4 Mandelbrot Set Computation . . . . . . . . . . . . . . . . 52
4.2.5 Conways Game of Life . . . . . . . . . . . . . . . . . . . . 53
4.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Accuracy of Dependency Analysis . . . . . . . . . . . . . . . . . . 56
4.4 Comparison with Existing Work . . . . . . . . . . . . . . . . . . . 56
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
5 Conclusions 59
5.1 Comparison with Requirements . . . . . . . . . . . . . . . . . . . 59
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Further Hardware Support . . . . . . . . . . . . . . . . . . 60
5.2.2 Further Optimisations . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Further Automatic Detection . . . . . . . . . . . . . . . . 61
5.3 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 63
A Dataow Convergence Proofs 67
A.1 General Dataow Analysis . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3 Constant Propagation . . . . . . . . . . . . . . . . . . . . . . . . 69
B Code Generation Details 71
C Command Line Interface 73
D Sample Code Used 75
D.1 Java Grande Benchmark Suite . . . . . . . . . . . . . . . . . . . . 75
D.2 Mandelbrot Computation . . . . . . . . . . . . . . . . . . . . . . 76
D.3 Conways Game of Life . . . . . . . . . . . . . . . . . . . . . . . . 76
E Testing Gold Standards 77
F Class Index 79
G Source Code Extract 81
H Project Proposal 83
v
List of Figures
1.1 Build process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Iterative development process. . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Development environment. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Software model of threads under CUDA. . . . . . . . . . . . . . . . . 12
2.4 CUDA hardware architecture. . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Various examples of loops. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Outline call graph for main classes. . . . . . . . . . . . . . . . . . . . 20
3.2 Garbage collection of unreachable blocks. . . . . . . . . . . . . . . . . 21
3.3 Unication algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Outline of kernel extraction algorithm. . . . . . . . . . . . . . . . . . 34
3.5 Form of multiple dimension kernels. . . . . . . . . . . . . . . . . . . . 35
3.6 Array and object type templates for on-GPU execution . . . . . . . . 39
4.1 Eect on copy performance (host-to-device) of single vs. multiple
allocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Comparison of measured performance with model (using CUDA SDK). 48
4.3 Values of t
d
and t
h
for measurements (using CUDA SDK). . . . . . . 48
4.4 Fit of model (green) to component benchmarks. . . . . . . . . . . . . 50
4.5 Fit of model to Fourier Series benchmark, using previously calculated
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Fit of model to Mandelbrot benchmark, using previously calculated
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Speedups and overhead for Mandelbrot benchmark with xed iteration
limit (250). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Speedups and overhead for Mandelbrot benchmark with xed grid size
(8000 8000). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9 Overall times for simulation of Conways Game of Life. . . . . . . . . 55
5.1 Minimum nding algorithms . . . . . . . . . . . . . . . . . . . . . . . 61
D.1 3 generations of the Game of Life. . . . . . . . . . . . . . . . . . . . . 76
vi
List of Tables
2.1 CUDA memory spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Summary of JVM Instructions and their internal representation. . . . 22
3.2 Unication Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Tests made for each compiler state. . . . . . . . . . . . . . . . . . . . 43
4.2 Expected timings for overhead stages according to model. . . . . . . . 45
4.3 Model of overheads for component benchmark versions. . . . . . . . . 50
4.4 Model parameters, as measured using component benchmarks. . . . . 50
4.5 Speedup factors for the component benchmarks. . . . . . . . . . . . . 51
4.6 Summary of speedup factors. . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Comparison of Java Grande benchmark timings with JCUDA. . . . . 56
D.1 Summary of Section 2 of the Java Grande Benchmark Suite. . . . . . 75
vii
List of Examples
1.1 Mandelbrot Set computation (kernel highlighted) . . . . . . . . . . . 3
2.1 Example of thread divergence. . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Graph for Mandelbrot computation . . . . . . . . . . . . . . . . . . . 23
3.2 UML sequence diagram for Visitor pattern operation. . . . . . . . . . 24
3.3 Basic block that causes diculties when exporting. . . . . . . . . . . 25
3.4 Reuse of local variable locations. . . . . . . . . . . . . . . . . . . . . 27
3.5 Results from increment variable analysis computation. . . . . . . . . 28
3.6 Example inter-procedural may-alias computation. . . . . . . . . . . . 31
3.7 Non-termination of may-alias analysis. . . . . . . . . . . . . . . . . . 32
3.8 Mandelbrot control ow graph after various stages of loop detection. 33
3.9 Examples of the automatic dependency check. . . . . . . . . . . . . . 37
3.10 C++ code generation for float Cr = (x * spacing - 1.5f);. . . . 38
viii
Acknowledgements
Much thanks is owed to everyone who has given me guidance, feedback and
encouragement throughout this project. Specically, my two supervisors, Dr
Andrew Rice and Dominic Orchard, have been invaluable in advising me at
tricky points. I owe particular thanks to Andy who stopped me from navely
attempting an even more ambitious project!
ix
CHAPTER 1
Introduction
This chapter explains the motivation for using parallel architectures, before de-
scribing the scope of this project. I also provide a short overview of other relevant
work, and highlight the dierences between these and the approach taken here.
1.1 Motivation
In the past, improvements in processor performance have taken the form of in-
creased clock speeds. However, since 2002, developments have come from the use
of multiple processors to solve independent parts of a problem in parallel [24].
Commodity parallel processing is now available not only as multi-core CPUs, but
also graphics processors (GPUs) that allow many more threads to be executed
in parallel with the restriction that they share a program counteri.e. single
instruction multiple data (SIMD).
Unfortunately, most existing code is sequential, so the performance gains from
executing it on parallel architectures are limited. Often, it must be rewritten to
benet. Automatic parallelisation aims to address this by analysing sequential
code during compilation, and identifying regions that can be executed in parallel.
However, determining whether dependencies exist between two regions of code
is undecidable in the general case [15]. Therefore any analysis must be approxi-
mate to some extent, and developers may nd that small changes in code result
in disproportionate changes in performance. This suggests that a mix between
explicit and automatic parallelism might be desirable, with detailed feedback in
the automatic case being an important feature.
1
2 CHAPTER 1. INTRODUCTION
Bytecode (+ libraries)
Java Source
Scala Source
.
.
.
Native Code
javac
scalac
Parallelising Compiler
JVM
Figure 1.1: Build process.
1.2 Project Description
This project focuses on the data parallel, or SIMD, pattern used on graphics pro-
cessors. For NVIDIA graphics devices, this is provided by extensions to C++ in
their CUDA framework [20]. However, this framework and similar cross-platform
APIs, such as OpenCL, operate at a low level, with developers manually handling
data transfers and kernel invocations. Ports of CUDA to other languages gen-
erally still require kernels to be written in C++ (e.g. Py-CUDA [14] for Python,
and JCUDA [25] for Java).
This project allows these graphics processors to be used from a high level
language through both explicit annotations (parallel for loops) and automatic
analysis. For reasons of familiarity, I consider the Java Virtual Machine (JVM),
although similar techniques could be applied to other virtual machines such as
Microsofts Common Language Runtime. By operating at the bytecode level, no
modications are made to the syntax of Java, and the compiler should work with
languages other than Java that compile onto the JVM. The compiler ts in as
an additional step in the build process (Figure 1.1), taking a class le (compiled
bytecode) as input and producing a replacement along with any required supple-
mentary les. For clarity, this report gives examples in Java rather than bytecode
whenever possible.
One example used throughout this report is the computation of the Man-
delbrot Set (Example 1.1). The parallelising compiler extracts lines 4 to 16
(highlighted) as a two dimensional kernel that can be executed in parallel on the
GPU.
1.3 Related Work
Parallel computation is currently a huge research eld. There have been many
attempts at both intuitive frameworks and eective automatic analysis. Ap-
proaches for parallelising Java have included both static analyses similar to my
work, and also direct ports of CUDA.
1.3. RELATED WORK 3
1 public void compute ( )
2 for ( i nt y = 0; y < s i z e ; y++)
3 for ( i nt x = 0; x < s i z e ; x++)
4 f l oat Zr = 0. 0 f , Zi = 0. 0 f ;
5 f l oat Cr = ( x s paci ng 1. 5 f ) , Ci = ( y s paci ng 1. 0 f ) ;
6 f l oat ZrN = 0 , ZiN = 0;
7 i nt i ;
8
9 for ( i = 0; ( i < ITERATIONS) && ( ZiN + ZrN <= LIMIT) ; i ++)
10 Zi = 2. 0 f Zr Zi + Ci ;
11 Zr = ZrN ZiN + Cr ;
12 ZiN = Zi Zi ;
13 ZrN = Zr Zr ;
14
15
16 data [ y ] [ x ] = ( short ) ( ( i 255) / ITERATIONS) ;
17
18
19
Example 1.1: Mandelbrot Set computation (kernel highlighted)
1.3.1 JavaB [4]
Developed by Aart Bik in 1997, this work adopts a similar transformation ap-
proach to that of my project, although it targets multiple CPUs rather than
GPUs. It detects regions of code that can be executed in parallel, and produces
a modied class le that uses Java threads to exploit the parallelism. The detec-
tion is partially automatic, with user input to make the analysis more accurate.
However, this input is at the level of do variables x and y alias?, not should this
loop be run in parallel?, and is specied to the compiler rather than in source
code.
1.3.2 Within JikesRVM [16]
This recent work (2009) implements automatic analysis within the JikesRVM vir-
tual machine (originally Jalape no [2]), operating on intermediate code in a similar
manner to this project. It has the advantage over a compile-time approach that
all applications are modied, but requires that users install a specic virtual ma-
chine. Unlike static approaches, it has access to runtime information. However,
it cannot provide compile-time feedback, possibly resulting in unpredictable per-
formance. The benchmarks were all written by the author, and therefore it is
hard to know how eective the analysis might be on more typical code and full
4 CHAPTER 1. INTRODUCTION
applications.
1.3.3 JCUDA [25]
This paper, also published in 2009, details a partial port of CUDA to an extended
Java syntax, providing the same low level interface to invoke kernels and copy
data. Kernels must still be written in C++. This gives an unusual mix of Javas
high level approach with low level exposure to hardware. This projects approach
of using annotations is preferable, since the source code may still be compiled with
a standard Java compiler, which simply ignores the parallel annotations.
Their performance results, based on hand written CUDA versions of the Java
Grande benchmarks [7], give a reference point for possible speedups (assuming
similar hardware).
This work should not be confused with a library of the same name. The
jCUDA library provides access to a number of numerical routines, written using
CUDA, from within Java.
CHAPTER 2
Preparation
In order to complete this project successfully and develop a compiler that pro-
duced correct results with real benets, it was crucial to be clear from the outset
what was required and to have a sensible plan for achieving this. This chapter
documents this process, and introduces the key concepts and theory on which
the compiler implementation is based.
2.1 Requirements Analysis
Given the large array of possible directions that this project could have taken,
there was a real need to set clear goals and requirements. For any inherently
technical software, such as a compiler, it is dicult to separate what it must
achieve from how this might be done. Requirements analysis aims to concentrate
on the rst of these, setting out goals that can be veried objectively.
The following core requirements (C1 C6) are derived from the success criteria
set out in the project proposal. They are made from the perspective of a developer
with no knowledge of compiler internals, and should capture their expectations.
This gives the rst property to which all compilers must adhere:
C1. Correctness: Application of the compiler to JVM bytecode should not
aect the results of the code in any signicant way.
Moving to the specic requirements on this project, the user should be able
to gain tangible benets from using the compiler relatively easily. As it is an
optional step in the build process, this is required to warrant its inclusion.
5
6 CHAPTER 2. PREPARATION
C2. Performance: It must be possible to achieve improvements in execution
time by using the compiler.
C3. Usage: Any code modications required to achieve speedups must be min-
imal and transparent to standard compilers. These modications must
make it possible to specify that a for loop is run in parallel. Furthermore,
if multiple tightly-nested loops are specied, the inner body should be run
in parallel across each of the dimensions.
For these benets to be observed consistently, they should apply as universally
as possible:
C4. Scope: Ideally, it should be possible for any JVM instructions to be exe-
cuted on the graphics processor. However, GPU architectures place some
restrictions on what is possible, and for the core of the project, support is
restricted to use of basic arithmetic on primitive local variables and arrays.
This notably excludes support for exceptions, monitors and objects.
For various reasons, code specied for parallel execution may not be exe-
cutable as such. In this case, it is important that sucient feedback is given:
C5. Feedback: There must be varying levels of output available that indicate
reasons if certain regions of code were not appropriate for parallel execution.
The nal requirement on the core of the project ensures that the above can
be veried objectively by the developer:
C6. Veriable: Supplementary tools and pools of example code must be made
available so that developers can evaluate the compiler objectively.
2.1.1 Extensions
The project proposal also outlines several areas where the project might be ex-
tended. These are formally set out below so that they can be assessed in the
evaluation of the project.
E1. Automatic Detection of Loop Bounds: The number of iterations of a
loop should be inferred from the bytecode, without any user input.
E2. Automatic Dependency Checking: The compiler should detect, with
little help from annotations, regions of code for parallel execution.
2.2. DEVELOPMENT PROCESS 7
Evaluation / Design
Refactoring
Implementation
Testing
Prototyping
Figure 2.1: Iterative development process.
E3. Runtime Checks: Some annotations (for example, any introduced by
E2) should be replaced with runtime checks (as in [16]) that can determine
whether to execute the kernel in parallel, or to use the original CPU code.
E4. Support for Objects on GPU: It would be useful to include object-
oriented code in parallel regions, within the scope of what the graphics
processor supports.
E5. Further Code Optimisation: Some optimisations that neither the vir-
tual machine nor the GPU compiler can make (due to splitting the code
between the CPU and GPU) should be reimplemented (e.g. loop invariant
code motion).
E6. Code Transformations: In cases where code is not suitable for parallel
execution, it may be possible to modify the codeperhaps by splitting
loops into parallelisable and non-parallelisable chunks (loop ssion) or by
matching common patterns (e.g. minimum nding).
2.2 Development Process
The development process adopted for this project was based on an iterative style
similar to the Spiral Model [5]. This enabled compiler stages to be tested on
real class les as early in the timetable as possible. From this position, iterations
consisted of the following main stages (Figure 2.1):
1. Evaluation of which stage or feature should be implemented next, based on
measurements and observations indicating which was most applicable.
2. Refactoring of existing code to allow the new feature to be integrated nat-
urally.
3. Implementation of the new stage or feature.
8 CHAPTER 2. PREPARATION
4. Testing on an increasing pool of sample code, and xing compiler bugs in
order that more code could be compiled correctly.
In this way, feedback from each iteration informed future development, avoid-
ing wasted time on unnecessary features. The process also suited the integrated
testing strategy described in the next section.
A slight deviation compared with Boehms original Spiral model is the omis-
sion of a separate prototyping stage between 1 and 2. This was primarily due
to time constraints. However, some prototype code was written prior to starting
the main implementation in order to experiment with suitable internal represen-
tations (Section 3.2).
In comparison, the classical Waterfall method would have delayed integrated
tests until the later stages of the project, preventing benchmarks and measure-
ments from directing design decisions such as selection of extensions.
2.3 Methods of Testing
Given the importance of maintaining correctness (Requirement C1), it seemed
natural that testing should include full integration tests over the whole compiler.
The rst development iteration allowed a subset of JVM bytecode to be imported
into an internal representation, and re-exported to a new class le. As more
stages were added, these tests were rerun (i.e. regression testing) to ensure that
correctness was maintained, and new samples were added to test new features.
The integration tests consisted of a range of self-testing Java code to test the
compiler from both a black box and white box perspective. The rst of these could
only be done by using code written by other developers, such as the Java Grande
Benchmark Suite [7]. The white box testing was done by specic examples written
to cover dierent features of the compiler.
At a ner granularity, analysis stages of the compiler were unit tested by
comparing their results, for the same pool of sample code, to a gold standard
produced manually.
2.4 Development Environment
The overall development environment is presented in Figure 2.2. Here I highlight
some key aspects of this and the decisions made.
Language. The Java language was the natural choice for implementing the com-
piler due to familiarity, along with some use of C++ as required by CUDA.
2.4. DEVELOPMENT ENVIRONMENT 9
Development Machine
Working Copy
SRCF
Duplicate SVN Repository
Public Workstation Facility
Master SVN Repository
Replication on every commit
(via SSH+SVN with key authentication)
UCS Backups
CL File Server
earlybird (workstation)
Core 2 Quad (2.66GHz)
3MB Cache, 8GB RAM
GeForce 9600 GT
512MB global memory
6 multiprocessors
1.6GHZ
bing (dedicated)
2 Pentium 4 (3.20GHz)
2MB Cache, 1GB RAM
GeForce GTX 260
896MB global memory
27 multiprocessors
1.24GHZ
Double precision support
NFS
scp
SVN+SSH
NFS
Test Machines
Other Users
Coding
SUN JDK 1.6.0.18
Netbeans
Dissertation
Latex
TikZ
Evaluation
SQLite
GNUPlot
Matlab
Figure 2.2: Development environment.
The availability of the ASM [6] library for reading and writing class les
also inuenced this decision. Note that GCC 4.3.3 was used rather than
the more recent GCC 4.4 due to compatibility issues with CUDA 2.3.
Version Control. Subversion was used for storing all project les. This allowed
changes to be rolled back, and code to be transferred between machines
in a coherent manner. Since this dissertation was written in L
A
T
E
X, with
diagrams written in TikZ and graphs produced using gnuplot and shell
scripts, most binary les could be reproduced and did not need to be stored.
Backups. These were predominantly provided by the regular PWF backups
made by the University Computing Service. The copy replicated on the
SRCF
1
was intended to guard against accidental deletion of the master
repository and to reduce downtime if the PWF became unavailable. The
Computer Laboratory lespace was only used during testing, with results
being transferred to the working copy immediately, and therefore did not
need backing up.
Testing Hardware. Two machines with compatible graphics cards were avail-
able (earlybird and bing). Since the resources on earlybird were shared
with other users and an X server, bing was generally preferred.
1
Student Run Computing Facility (http://www.srcf.ucam.org/)
10 CHAPTER 2. PREPARATION
2.5 The Java Platform
The Java language and corresponding virtual machine were developed in the
1990s, and made up the rst mainstream platform of their type. More recent
alternatives, such as the Common Language Runtime, have used hindsight to
improve the design in some areas. However, Java remains commonly used and
compilers that target the JVM are still developed by third parties for other lan-
guages.
The virtual machine is stack-based, and its instruction set can be considered
RISC-like
2
and mostly orthogonal (i.e. each instruction is available for each type).
The features below are key for this project. Java also supports garbage-collection,
objects, synchronisation monitors and exceptions.
Annotations. These have been available since Java 1.5 and are maintained in
the compiled bytecode. They have been used widely to allow tools to modify
and instrument bytecode after compilation. Source code utilising annota-
tions also remains compatible with a standard compiler.
Native Interface (JNI). This oers the facility for using code written in other
languages, which may make use of system calls not abstracted by the Java
libraries, at the cost of portability.
JNI species [18] the format of shared libraries that implement native
methods and the functions that allow interaction with Java objects and
code.
2.5.1 State
Data within the JVM can exist in four locations (from the perspective of byte-
code): the operand stack, the local variable stack, static variables and the heap.
All instruction operands are taken from the operand stack, and results are pushed
onto this. Local variables and statics can be used to store any of the primitive
datatypes
3
[19]. Objects and arrays reside in the heap, and are identied by ref-
erences. Monitor synchronisation support and exception handling also introduce
state associated with control ow.
2
Reduced instruction set computers (RISC) provide only common instructions, choosing to
optimise these rather than oering more complex instructions.
3
boolean, byte, short, int, long, float, double, char and references.
2.6. NVIDIA CUDA ARCHITECTURE 11
2.5.2 Performance
Originally, JVMs interpreted bytecode at runtime, causing very poor perfor-
mance. Whilst this is still a common belief, since the introduction of Just-In-Time
(JIT) compilation, there have been studies suggesting that performance is com-
parable to that of C and C++ [17]. In some cases, the studies even show that
Java can take advantage of runtime information to outperform C.
It was suggested in the late 1990s that Java might be an appropriate language
for future high performance computing (HPC) applications [23]. Whilst this has
never materialised, a recent study concludes that in most cases there is no reason
why Java shouldnt be used for computationally expensive applications, although
they do note that there are signicant overheads in communication intensive
applications [3].
2.6 NVIDIA CUDA Architecture
When released in 2007, CUDA was one of the only general purpose frameworks for
graphics processors. Previously, general purpose computation had to be formu-
lated as graphics operations [22]. CUDA supports many programming constructs
including conditional and looping control ow, although does lack support for re-
cursion and virtual function lookups [20, Appendix B.1.4].
Operations that are invoked on the GPU are executed asynchronously from
the perspective of the CPU code. There are therefore some useful constructs
provided in the CUDA API that allow accurate timing of operations.
The framework is based on C++ with keywords for specifying whether func-
tions should be compiled for the GPU or host, and in which memory space vari-
ables should be stored. It also adds a syntax for invoking kernels. With each
new version of CUDA, the provided compiler (nvcc) moves closer to supporting
all the features of C++.
2.6.1 Thread Model
The threading model exposed to software by CUDA is illustrated in Figure 2.3.
Each thread must execute the same code, but is given coordinates so that it
may operate on dierent data. The two level approach is due to the hardware
architecture which also places limits on the dimensions of both grids and blocks.
The CUDA hardware architecture is shown in Figure 2.4. Each block is
assigned to a multiprocessor which contains 8 processors each executing 4 threads.
As such, 32 threads can be executed concurrently in each block, in what is called
12 CHAPTER 2. PREPARATION
Block
Thread
Grid
Figure 2.3: Software model of threads under CUDA.
a warp. There are therefore advantages to ensuring that the number of threads in
a block is a multiple of this. It is also worth noting that there is only one double
precision unit per multiprocessor, and as such there is a signicant penalty for
performing double precision arithmetic.
Since each processor within a multiprocessor must execute the same instruc-
tions, there is a performance hit whenever threads within a single warp diverge.
This occurs when two or more threads take dierent paths through the control
ow graph, as in Example 2.1. In this case, the hardware must execute the
dierent paths sequentially.
CUDA also provides primitives for synchronization between threads, however,
these are not used in this project. Without these, the thread model can be
considered as a parallel for loop over a number of dimensions.
2.6.2 Memory Model
The hardware model also has implications for the software memory model. As
shown in Figure 2.4, there are a variety of memory areas, each with dierent
properties as summarised in Table 2.1.
When memory accesses are consecutive within a warp (i.e. thread i is reading
arr[n + i]), then the hardware can coalesce these into fewer memory accesses
that utilise the full width of the memory bus.
It is worth noting that often there is less memory available on the GPU than
the host. Therefore computations ooaded to the GPU may fail early, or be
forced to revert to CPU execution, giving a wall in performance.
2.6. NVIDIA CUDA ARCHITECTURE 13
Host Memory
PCI-e Bus
Device Memory (Global)
Multiprocessor N (up to about 30)
Multiprocessor 2
Multiprocessor 1
Shared Memory
Instruction
Unit
Registers
Processor 1
Registers
Processor 2
Registers
Processor 8
Constant Cache
Texture Cache
Figure 2.4: CUDA hardware architecture.
(Based on a gure used in various NVIDIA presentations.)
1 i f ( i ndex & 1) s [ i ndex >> 1] = s i n ( i n [ i ndex >> 1 ] ) ;
2 el se c [ i ndex >> 1] = cos ( i n [ i ndex >> 1 ] ) ;
1 i f ( i ndex < W) s [ i ndex ] = s i n ( i n [ i ndex ] ) ;
2 el se c [ i ndex W] = cos ( i n [ i ndex W] ) ;
index ranges between 0 and 2W 1, where W is a multiple of the warp size. The second case
runs roughly twice as fast, since there is no thread divergence (for W = 51200, the timings are
0.102ms and 0.043ms respectively.)
Example 2.1: Example of thread divergence.
14 CHAPTER 2. PREPARATION
Memory Location Cached Access Scope Size
5
Registers On-chip N/A
6
Read/write One thread 16384
Shared On-chip N/A
6
Read/write All threads in a block 16KB
Local O-chip Read/write One thread
Global O-chip Read/write All threads and host 896MB
Texture O-chip Read All threads and host
Constant O-chip Read All threads and host 64KB
Table 2.1: CUDA memory spaces.
2.7 Common Compiler Analysis Techniques
In this section, I introduce some common methods used within compilers [1], and
indicate why each is applicable.
2.7.1 General Dataow Analysis
Dataow analysis describes a common framework used for determining properties
of programs [13], such as which variables must be transferred to the graphics
processor (Section 2.7.3) and the behaviour of writes to variables (Sections 2.7.4
and 3.3.2).
The result of an analysis for an instruction or block of code, R(b) X, is
given by Equation 2.1, where (X, _) is a complete lattice.
Denition 1. A complete lattice is a partially ordered set, in which every subset
has a unique least upper bound (its join or lub) and a unique greatest lower
bound (its meet or glb). We denote the join of the whole set as and the meet
as .
The function children(n) is usually dened to be either the predecessor set
(forward analysis) or the successor set (backward analysis), with F
init
giving the
value at entry points or exits respectively.
_
can be chosen either as the join
(lub) or meet (glb) operator. F
b
: X X is the transfer function that alters a
result in accordance with the instruction or block b.
R(b) =
_
_
_
F
b
_
_
cchildren(b)
R(c)
_
children(b) ,=
F
b
(F
init
) children(b) =
(2.1)
5
Sizes for a GeForce GTX 260 card.
6
Neither registers nor shared memory need a cache, since both are accessed within a single
clock cycle.
2.7. COMMON COMPILER ANALYSIS TECHNIQUES 15
Entry
Exit
(a) Single Entry/Single Exit
Entry 1
Entry 2
(b) Multiple Entries
Entry
Exit 1
Exit 2
(c) Multiple Exits
Figure 2.5: Various examples of loops.
Since the control ow graph may be cyclic (due to loops), R(b) must be
computed iteratively until a xed point solution is reached. Initially, each R(b)
is set to the least element X. The number of iterations until convergence
depends on the order in which instructions and blocks are considered. For forward
analysis, they should be considered from start to end, and for backward the
converse. For each specic dataow analysis, it is necessary to prove that the
analysis will converge. This can be shown to be a consequence of (X, _) having
nite height, and F
b
being monotone. The proof of this, and also convergence of
the specic analyses that follow, is given in Appendix A.
2.7.2 Loop Detection
JVM instructions provide only unstructured control ow with branches to arbi-
trary labels, and all structured information regarding loops and conditionals is
discarded at compile-time. Therefore, in order to extract loop bodies for parallel
execution, some of this structure must be reconstructed using loop detection. Ide-
ally, this should be done without needing user annotations (as per Requirement
C3).
Denition 2. A natural loop is dened as a loop with only a single entry point.
In general, detection is made dicult by the possibility of loops with multiple
entries and exits as in Figure 2.5. However, by restricting detection to the case
of natural loops, a simple algorithm can be used [1, p655]. This case still includes
all loops either expressible using standard for and while constructs in high-level
languages, or suitable for GPU execution (see Section 2.6.1).
Denition 3. In a control ow graph of basic blocks, we dene a block m to be
a dominator of another block n if all execution paths to n contain m.
Denition 4. A back edge is dened to be an edge whose end dominates its
start.
16 CHAPTER 2. PREPARATION
Since a single entry point must dominate every block in the loop body, the
edge to the entry from the end of the body must be a back edge. Therefore, each
natural loop corresponds to a back edge in the control ow graph, and can be
detected by the following simple algorithm.
Step 1 Calculate the set of dominators D(b) of each block b.
Step 2 Find any edge m n such that n D(m). This gives a natural loop
with body from n to m.
Since each dominator of a block b must also be a dominator of all of bs im-
mediate predecessors, the dominator set of b, D(b) (Blocks), can be described
by:
D(b) = b

ppred(b)
D(p) (2.2)
This is a form of forward dataow analysis over the lattice ((Blocks), )
using the meet operator (i.e. set intersection) and the transfer function in Equa-
tion 2.3. This can therefore be computed iteratively, initialising each D(b) to
the empty set . Since set union is monotone, the analysis is also guaranteed to
converge.
F
b
(x) = x b (2.3)
Step 2 can be performed trivially to nd all loops. The set of blocks in the
loop body from n to m is given by S
n
(m), dened recursively as follows:
S
n
(b) =
_
n if b = n
b

ppred(b)
S(p) otherwise
(2.4)
2.7.3 Live Variable Analysis
Denition 5. A variable is live at a given point if, on some execution path
starting from that point, the variable is read before it is written to.
Live variables [1, p608] can be calculated using backward dataow analysis on
the lattice ((Vars), ) using the join operator (i.e. set union) and the transfer
function given in equation 2.5, where Write(n) and Read(n) indicate the sets of
writes and reads made by an instruction n.
F
n
(x) = (x Write(n)) Read(n) (2.5)
2.7. COMMON COMPILER ANALYSIS TECHNIQUES 17
2.7.4 Constant Propagation
Forward dataow analysis can be used to determine the value of a variable at
a point in code, if it is a constant [1, p632]. For each variable v and block
b, we maintain a result R
v
(b) taken from the at lattice over constantsi.e.
(, Constants, _) where:
x _ y (x = ) (x = y) (y = ) (2.6)
R
v
(b) =
_

_
c Constants if constant c is the value of v at the end of b
if the value of v is not constant at the end of b
if no writes are made to v before the end of b
(2.7)
This can be computed using the join operator with transfer function F
n,v
for
variable v as follows:
F
n,v
(x) =
_

_
c if n assigns c to v
if n writes a non-constant to v
x otherwise
(2.8)
2.7.5 Data Dependencies
When considering whether two instructions or regions of code can be run in
parallel, the data dependencies between them must be considered. There are three
types: true dependencies (read-after-write), anti-dependencies (write-after-read)
and output dependencies (write-after-write). We can then determine whether
there are any loop-carried dependencies that prevent the loop from being executed
in parallel. The core requirements of the project require that the programmer
will consider this before marking a loop as parallel.
Determining dependencies automatically is desirable, but becomes hard as
soon as memory references are introduced, which Java does through objects and
arrays. Diculty arises because writes to two distinct references can aect the
same state. Alias analysis aims to determine statically whether this may occur
at a given point in code. There are two variations of this problem, may-alias
and must-alias. For this project, may-alias is required, since by overestimating
conicts to a memory address, it will always be safe (see Section 3.3.3 for this
analysis).
18 CHAPTER 2. PREPARATION
In languages such as Java, with if statements, loops, dynamic storage, and
recursive data structures, alias analysis can be shown to be undecidable by
reduction to the Halting Problem [15].
2.8 Summary
This chapter has given objective and veriable requirements for the project. The
development and testing strategy that was employed to ensure these were met has
also been outlined. Finally, brief introductions to the Java Platform, NVIDIAs
CUDA framework and some common compiler analysis techniques have been
given. It was from this base of knowledge and planning that the project was
started.
CHAPTER 3
Implementation
This chapter rst outlines the overall implementation structure as well as the
central data structure. Some new analysis techniques are then introduced and
developed, before descriptions of individual compiler stages are given. Finally,
the overarching compiler tool is briey explained.
The size of the implementation
1
, despite containing a signicant proportion
of boilerplate for supporting the JVM instruction set, is too large to describe in
detail. As such, this chapter gives a high-level view, identifying specic details
only when necessary. Further information is given in the appendices.
3.1 Overall Implementation Structure
The compilation process can be divided into ve main stages: importing classes;
loop detection; kernel extraction including dependency checks; code generation;
and nally exporting new class les.
The nal structure of the project implementation is shown by Figure 3.1. This
gives a high level view of class interactions with time running (roughly) down the
page. To keep the diagram relatively simple, commonly used classes have been
left out, notably graph.* (see Section 3.2), analysis.dataflow.SimpleUsed
(see Section 3.3.4) and analysis.BlockCollector. Colour coding is used to
indicate when each class was added to the compiler.
1
SLOCCount gives a total of 7686 lines.
19
20 CHAPTER 3. IMPLEMENTATION
1
Translation between bytecode
and an internal representation.
3
Detection of loop bounds and
increments to give trivial loops.
5
Extraction of 1D kernels based
on annotations, and generation
of GPU wrappers.
2
Detection of loops for repre-
sentation as structured control
ow.
4
Generation of C++ from byte-
code.
6
Support of multiple dimension
kernels.
7
Basic automatic dependency
analysis.
Parallelise External Libraries
JOpt Simple
bytecode.ClassFinder
ClassNode bytecode.ClassImporter
bytecode.AnnotationImporter
ASM
bytecode.MethodImporter
LoopDetector LiveVariable
LoopTrivialiser IncrementVariables Dataflow
LoopNester Tree
KernelExtractor AliasUsed BasicCheck
CombinedCheck AnnotationCheck
ReachingConstants Dataflow
LiveVariable
cuda.CUDAExporter
cuda.Beautifier
cuda.Helper
cuda.BlockExporter
bytecode.ClassExporter cuda.CppGenerator
ASM
bytecode.BlockExporter bytecode.InstructionExporter
cuda.CUDAExporter
NVCC
(Colouring denotes the development cycle on which the code was written.)
Figure 3.1: Outline call graph for main classes.
3.2. INTERNAL CODE REPRESENTATION (ICR) 21
Unreachable
Weak References
Strong References
Figure 3.2: Garbage collection of unreachable blocks.
3.2 Internal Code Representation (ICR)
The internal representation of classes, methods and elds under transformation
is central to the compiler. This provides similar capabilities to the Java re-
ection classes but with added support for modication. Therefore, the graph
package contains ClassNode, Method, state.Field, Annotation and Modifier
as replacements for the corresponding reection classes. The Method class in
turn references a graph giving the implementation. It is on this graph that the
compiler analyses and transformations act.
3.2.1 Code Graph
The implementation graphs are made up of two main types of block: basic blocks
and loops. For a block b, the notation pred(b) is used to denote its immediate
predecessors in the graph, and succ(b) its successors.
Denition 6. A basic block is a sequence of instructions i
1
, . . . , i
n
where only the
rst instruction may have multiple predecessors ([pred(i
k
)[ = 1, for 1 < k n),
and only the last multiple successors ([succ(i
k
)[ = 1, for 1 k < n).
In my implementation, successors are represented by a standard set. However,
in order to minimise the housekeeping required when modifying the graph, the
predecessor set is stored internally as a weakly referenced list (util.WeakList).
Then whenever it is accessed, it is returned as a standard set. By using weak
references, any code that becomes unreachable can be garbage-collected as shown
in Figure 3.2. A list is used internally to count how many links exist from each
predecessor, making it easy to update. For example, a switch instruction might
branch to the same block for multiple cases. If one of these were changed, it
would be necessary to determine whether to modify the predecessor set of the
destination block.
22 CHAPTER 3. IMPLEMENTATION
Result producing instructions (Producer)
Arithmetic *ADD, *SUB, *MUL, *DIV, *REM Convert *2*
*AND, *OR, *SHL, *SHR Negate *NEG
Constant *CONST_*, LDC, *PUSH Compare *CMP*
ArrayLength ARRAYLENGTH NewArray ANEWARRAY
NewMultiArray MULTIANEWARRAY NewObject NEW
CheckCast CHECKCAST InstanceOf INSTANCEOF
Read *LOAD, GETSTATIC, GETFIELD Call INVOKE*
Stateful instructions (Stateful)
Write *STORE, PUTSTATIC, PUTFIELD Call INVOKE*
Read *LOAD, GETSTATIC, GETFIELD Increment IINC
Branching instructions (Branch)
Return RETURN Condition IF*
ValueReturn *RETURN TryCatch N/A
Switch TABLESWITCH, LOOKUPSWITCH Throw ATHROW
Other Instructions
unsupported RET, JSR (these are used for finally blocks), MONITOR*
StackOperation SWAP, POP, POP2, DUP, DUP2, DUP X1, DUP X2
DUP2 X1, DUP2 X2
Table 3.1: Summary of JVM Instructions and their internal representation.
The instructions within a basic block are connected in a directed acyclic graph
that gives the dataow representation of the code. In general, this forms a graph
rather than a tree since each instruction can be used as an argument to multiple
other instructions. Whilst the original bytecode will have an order for all instruc-
tions within a basic block, this ordering is only important for stateful instructions.
Therefore, each basic block also holds a timeline of stateful instructions, and a
nal branch instruction. This approach sits between the common techniques:
linear lists of instructions; and complete dataow graphs. A full summary of
instructions and their internal groupings is given in Table 3.1.
Denition 7. An instruction is stateful if the time at which it is executed may
aect its result or eect.
Loops are represented by the start and end blocks for the body.
As an example, the ICR for the Mandelbrot computation (Example 1.1) is
shown in Example 3.1.
3.2. INTERNAL CODE REPRESENTATION (ICR) 23
WRITE y
READ y
READ this
READ ->height
IF >= THEN
VOID RETURN
WRITE x
READ x
READ this
READ ->width
IF >= THEN
INC y BY 1
0
0
WRITE Zr
WRITE Zi
READ x
READ this
READ ->spacing
WRITE Cr
0.0
0.0
MUL
INT
TO
FLOAT
SUB 1.5
READ y
READ this
READ ->spacing
WRITE Ci
MUL
INT
TO
FLOAT
SUB 1.5
WRITE ZrN
WRITE ZiN
0.0
0.0
WRITE i 0
READ i
READ this
READ ->iterations
IF >= THEN
READ ZiN
READ ZrN
IF > THEN
READ this
READ ->data
READ y
READ []
READ x
READ i
READ this
READ ->iterations
WRITE []
255
MUL
DIV
INT TO
SHORT
INC x BY 1
READ Zr
READ Zi
READ Ci
WRITE Zi
READ ZrN
READ ZiN
READ Cr
WRITE Zr
READ Zi
READ Zi
WRITE ZiN
READ Zr
READ Zr
WRITE ZrN
INC i BY 1
2.0
MUL
MUL
ADD
SUB
ADD
MUL
MUL
ENTRY
Example 3.1: Graph for Mandelbrot computation
24 CHAPTER 3. IMPLEMENTATION
be:BlockExporter
w:Write a:Arithmetic
ie:InstructionExporter
accept(ie)
visit(w)
getState()
accept(ie)
visit(a)
getOperandA()
getOperandB()
Example 3.2: UML sequence diagram for Visitor pattern operation.
3.2.2 Visitor Pattern
In order for other classes to traverse this structure easily, the visitor pattern [10,
p331] is utilised for both the control and dataow graphs. The abstract classes
graph.BlockVisitor and graph.CodeVisitor emulate multiple dispatch which
is not supported natively by the JVM. With multiple dispatch, the choice of
method to invoke is based on the runtime type of all arguments. The JVM does
support single dispatch, where the runtime type of the object, but not the argu-
ments, is considered. The visitor pattern makes use of this in its implementation,
as shown in Example 3.2.
In addition to the above, a decorator (analysis.CodeTraverser) is provided
for the code graph that causes a child visitor to do a depth-rst traversal of a
given dataow graph.
3.2.3 Bytecode to ICR Translation
The internal code representation must be interchangeable with JVM bytecode.
The stack-based nature of the JVM makes this relatively straightforward in the
3.2. INTERNAL CODE REPRESENTATION (ICR) 25
double[][] arr = {{0.1, 0.2}};
ICONST 1
ANEWARRAY "[D"
DUP
ICONST 0
ICONST 2
NEWARRAY double
DUP
ICONST 0
LDC double 0.1d
DASTORE
DUP
ICONST 1
LDC double 0.2d
DASTORE
AASTORE
ASTORE 2
(1) Bytecode
1
NewArray 0
2
NewArray 0 0.1 1 0.2
4. Write 3. Write 1. Write 2. Write
Multiple arrows out of a node imply a DUP instruction (or
similar) is needed.
(2) Graph
Example 3.3: Basic block that causes diculties when exporting.
standard cases, although there are some issues that make the general case more
dicult.
Rather than producing import and export code from scratch, a class reading
library, ASM [6], was used. This provides visitor pattern access to the les rather
than producing any data structures. The library does this to remain lightweight
and fast for applications that can perform transformations in a single pass (i.e. do
not need to store the bytecode). It also allows use with whatever data structures
an application might require.
For importing, the timeline and dataow graph for a basic block can be built
in a single pass through the code using symbolic execution, with the standard
operand stack containing graph nodes rather than real results. In cases where
the operand stack is not empty at the end of a basic block (as occurs with
ternary conditionalsexpr ? a : b), the values are stored as being emit-
ted by the block and successor blocks are marked as accepting values of the
respective types. These values can then be accessed using the RestoreStack
pseudo-instruction.
Unfortunately, exporting to bytecode is only easy in cases where stack opera-
tions (e.g. DUP, POP, . . . ) are not required, since these are represented implicitly
by the structure of the dataow graph rather than individual nodes (see Example
3.3). Therefore, the compiler makes use of the correct bytecode sequence that is
seen for each basic block in the input class le by maintaining a cache
2
.
2
Using a WeakHashMap so entries are not held unnecessarily if a basic block is discarded.
26 CHAPTER 3. IMPLEMENTATION
In the case of code inserted by the compiler, no stack operations are required
since:
Dataow graphs form a tree (i.e. results are never used more than once, so
no need for DUP etc. instructions).
Reads and result producing calls occur in the timeline in the same order as
given by a depth-rst search of the dataow graph.
Results of calls are always used.
It is therefore possible to produce bytecode by performing a depth-rst search of
the dataow graph corresponding to each timeline entry in order.
It is worth noting that all code in a transformed class is exported from the
above structure. Whilst it may have been possible to simply copy unmodied
methods, or even portions of methods, from the original class, this approach
would have been less elegant, and required either a second pass of the original
le, or storage of all original bytecode. A consequence of this decision is that
it is necessary for all instructions (including monitor and exception operations)
to be handled by the code representation, even if they cannot be executed on a
graphics card.
3.2.4 Type Inference
Compilation to bytecode loses the majority of type information, so it is necessary
to infer types, in order to copy state onto and o a graphics processor. Primitive
types are clear from the instruction used to load the value. However, reference
types can only be inferred by usage. This is achieved using a Damas-Milner style
type-checking algorithm [8]. At each instruction, we take a fresh type corre-
sponding to the usage and unify (Figure 3.3) this with the type maintained for
the object operated on. This process ensures that the stored type is valid for all
contexts. If unication ever fails, then this indicates that the input bytecode was
badly typed. Table 3.2 gives details of the unication operation performed for
some instructions.
Unfortunately, the existing Type class provided by ASM had a private con-
structor, so could not be extended to include the unication functionality. There-
fore, graph.Type is based heavily on the ASM code, supplemented with unica-
tion and some convenient methods for dealing with array types.
Type inference is slightly complicated by reuse of local variablesfor instance,
in Example 3.4, variables i and j are likely to share a location on the local variable
stack. We can overcome this by using live variable analysis (Section 2.7.3) to
3.3. DATAFLOW ANALYSIS 27
if x is a supertype of y then
x y
else if y is a supertype of x then
y x
else
return failure
end if
Figure 3.3: Unication algorithm.
Instruction Unication performed
PUT/GETSTATIC The value passed/returned must unify with
the type of the static eld.
PUT/GETFIELD The object type must unify with the owner
class of the eld, and the value passed/re-
turned must unify with type of the eld.
<T>ALOAD/<T>ASTORE The object given must unify with an array of
element type <T>.
CALL Each arguments type must unify with the
corresponding type in the method descriptor.
Table 3.2: Unication Details
1 for ( i nt i = 0; i < 10; i ++) f ( i ) ;
2 for ( i nt j = 0; j < 10; j ++) g ( j ) ;
Example 3.4: Reuse of local variable locations.
determine the live ranges of each variable and ensure that the types across a
range are consistent. Since the unication algorithm is simple and not time-
consuming, these unication steps are integrated into the live variable analysis
code. Thus, at the end of each method import, live variable analysis is performed
on the code to infer the types.
3.3 Dataow Analysis
A general framework for dataow analysis was outlined in Section 2.7.1. Here
specic dataow analyses that were developed for use in the compiler are de-
28 CHAPTER 3. IMPLEMENTATION
i++; R
i
= 1, R
j
= 0
j++; R
i
= 1, R
j
= 1
if(...)
i += 3; R
i
= 4, R
j
= 1
else
i += 2; R
i
= 3, R
j
= 1
j = i + 10; R
i
= 3, R
j
=
i++; R
i
= 4, R
j
=

R
i
= 4, R
j
=
Example 3.5: Results from increment variable analysis computation.
scribed.
3.3.1 Support for Arrays and Objects
The live variable analysis previously given explicitly excludes array and object
accesses. However, for analysis of JVM bytecode, this is insucient. The simple
approach taken here denes the eect
n
function such that array and object vari-
ables become live when any of their elements or elds are either read or written.
The only way the variable can stop being live is if it is directly assigned a value
(e.g. a new array or object reference). This ensures safety.
3.3.2 Increment Variables
This analysis returns information about integer-typed variables for which it is
possible to statically determine the eect of a region of code. The result for each
variable is taken from a at lattice over integers ( Z, _) with:
x _ y (x = y) (y = ) (3.1)
The result for a variable v at the end of a block b, R
v
(b), has the behaviour
described by Equation 3.2 (also see Example 3.5).
R
v
(b) =
_
n Z if the overall eect on v is to increment by n
if v is written to in a more complex manner
(3.2)
Note that this also includes decrement variables (i.e. n < 0). The results
can be calculated using forward dataow analysis with the join operator (least
upper bound) and a transfer function as dened below. Each R
v
(b) is initialised
3.3. DATAFLOW ANALYSIS 29
to 0.
F
n,v
(X) =
_

_
X + i if n increments v by i and X Z
if n writes to v in a more complex manner
X otherwise
(3.3)
Theorem 1. Iterative computation of increment variables converges.
Proof. Since our lattice does not contain , we must adopt a dierent style of
proof. Suppose the analysis does not terminate, then there must be a loop which
increments a variable v. However, there must be an entry point and for the outer-
most loop, this gives a xed increment for v. Therefore, the join on entering the
loop will give for v, and since n, v.F
n,v
() = the analysis must terminate.
Hence, we have a contradiction and our assumption of non-convergence must be
incorrect.
3.3.3 May-Alias
May-alias analysis is used in the compiler to establish which variables may be
aected by a write. This is then used both to determine which variables must be
copied back o the graphics card, and also in automatically detecting dependen-
cies. Computing may-alias sets is the most complex analysis performed in the
compiler. The approach presented here is an approximation, agging some cases
as inaccurate.
Whilst reference states (i.e. array elements and object elds) are represented
within the compiler as chains of reads (e.g. a[i] would rst read a and then
an element), for the description here, states will be considered as in Equation
3.4 (where c Call represents the return value of a call). I also dene loose
states (Equation 3.5) that allow comparison ignoring array indices, and a function
(Equation 3.6) to loosen states.
State ::= v [ s [ c where v Var, s Static, c Call
[ State.f [ State[expr] where f Field (3.4)
LooseState ::= v [ s [ c where v Var, s Static, c Call
[ LooseState.f where f Field
[ LooseState[] (3.5)
30 CHAPTER 3. IMPLEMENTATION
loosen(s) =
_

_
loose(p).f if s = p.f
loose(p)[] if s = p[expr]
s otherwise
(3.6)
Forward dataow analysis can then compute, for each block b, a result M(b),
where M(b)(s) gives the set of states which may share the same reference as s.
For example, consider the code:
a = b; a[f(x)] = objA; b[g(x)] = objB;
Statically, f(x) and g(x) may be unknown, so we should deduce that any
element in either array a or b could point to either objA or objB (i.e.
a[] objA, objB, b[] objA, objB).
We use the lattice over functions (LooseState (State), _) with _ as de-
ned in Equation 3.7. Therefore, joins can be considered as pointwise union.
M

m
: State (State) (Equation 3.8) gives the closure under dereferencing of a
function m : LooseState (State).
f _ g s.f(s) g(s) (3.7)
M

m
(s) =
_

_
m(loosen(s)) x.f [ x M

m
(p) if s = p.f
m(loosen(s)) x[e] [ x M

m
(a) if s = a[e]
m(loosen(s)) otherwise
(3.8)
The transfer function is dened in Equation 3.9, where a

b indicates a
write to a of value b type . The 5 dierent cases will be referred to as A to E.
F
n
(m) = y.
_

_
Recurse(m, c) if n = c and y = c
M

m
(x) if n = v
ref
x and y = x
M

m
(x) m(y) if n = a[]
ref
x and a

m
(a) y = a

[]
M

m
(x) m(y) if n = o.f
ref
x and o

m
(o) y = o

.f
m(y) otherwise
(3.9)
The initial value, M
init
, at the entry of a code graph must be provided and
should indicate which states might alias.
We must also maintain a set of states R from the lattice ((State), ) that
contains all states which might be returned from a method
3
. The transfer function
3
Note that R is associated with the function rather than any particular block.
3.3. DATAFLOW ANALYSIS 31
int x = ...; m = , R = B
List temp; m = temp temp, R =
List[] temp2 = new List[1]; m = temp2 new
0
, R = B
List[] data = ...; m = data data, R = B
temp = data[0]; m = . . . , temp data[0], R = B
temp = data[x]; m = . . . , temp data[x], R = B
temp2[0] = data[100]; m = . . . , new
0
[] data[100], R = C
temp2[0] = data[x]; m = . . . , new
0
[] data[100], data[x], R = C
return f(temp, temp2[0]); m = . . . , R = data[100], data[x] A
List f(List a, List b) M
init
= a data[x], b data[100], data[x]
if(Math.sqrt(4.0) < 4.0) m = . . . , R =
return a; m = . . . , R = data[x]
else m = . . . , R = data[x]
return b; m = . . . , R = data[x], data[100]

The case of F
n
that is applied is given on the right hand side.
Example 3.6: Example inter-procedural may-alias computation.
G
n
below computes R using the current m : LooseState (State) as context.
G
n
(m, R) =
_
R M

m
(s) if n = RETURN(s)
R otherwise
(3.10)
This allows Recurse(m, c) to be dened as R from recursive analysis on f
(where c = f(a
0
, . . . , a
n
)), with M
init
given by Equation 3.11. However, no alias
information other than R is returned, so if the function contains reference writes
(i.e. x
ref
y) then the analysis must be marked inaccurate.
M
init
(s) =
_

_
M

m
(a
i
) if s = v
i
and i n
M

m
(s) if s Static
otherwise
(3.11)
Example 3.6 gives an example of the results achieved when the inter-
procedural case is used.
The analysis that has been described so far may not terminate (Example
3.7). Therefore, the number of iterations is bounded, and if convergence does not
occur, the analysis is agged inaccurate.
3.3.4 Usage Information
At various stages in the compiler, it is useful to know the set of accesses made
by a block of code. Accesses are either direct or indirect:
32 CHAPTER 3. IMPLEMENTATION
a[] a
0

while(a[i] != null)
b = a[i].next; a[] a
0
, a
0
.next, . . . , b a
0
.next, a
0
.next.next, . . .
a[i] = b; a[] a
0
, a
0
.next, . . . , b a
0
.next, a
0
.next.next, . . .

Example 3.7: Non-termination of may-alias analysis.


Denition 8. An access is direct if it accesses the value of a variable or static
eld.
Denition 9. An access is indirect if it accesses a value in the heapi.e. it
requires one or more dereferences. Each indirect access can be described by a list
of indicesfor example, arr[i].video.data[x][y] corresponds to [i, x, y].
The class graph.dataflow.SimpleUsed collects sets of state for the cate-
gories: variables used, statics used and state directly written. It also collects a
set of classes used. This is done simply by unioning across all instructions (i.e.
nothing is ever removed from these sets).
The case of indirect accesses is much harder to compute due to the eects of
aliasing. Therefore, the may-alias analysis described in the previous section is
used to form sets of all state that could have been written to or read from. This
is all done within the graph.dataflow.AliasUsed class.
3.4 Loop Detection
Loop detection is done in three stages: natural loop detection, loop trivialisation
and loop nesting. Example 3.8 shows the eect of these. The rst is implemented
as a version of the algorithms in Section 2.7.2, restricted to cases with both a
single entry and single exit. This corresponds to the style of loops that can
be executed in parallel on GPUs (see Section 2.6.1). Loop nesting is done by
checking whether a loop is contained in the body of another.
3.4.1 Loop Trivialisation
In order to execute a loop on a graphics processor, it is necessary that the di-
mensions and limits of the loop can be determined. The compiler detects these
automatically for trivial loops as dened below. The denition is more inclusive
than that used in JavaB [4], with positive or negative increments to the loop
variable permitted anywhere in the loop body.
3.4. LOOP DETECTION 33
42
43
24
25
26
27
48
49
46
47
44
45
28
29
40
41
39
38
33
32 31
30
37
36
35
34
(1) Before Loop Detection
42
43
88
89
24
25
26
27
48
49
46
47
44
45
28
29
40
41
91
90
93
92
39
38
33
32 31
30
37
36
35
34
(2) After Loop Detection
28
29
88
89
24
25
27
48
49
46
47
44
45
42
43
40
41
90
92
95
94
39
38
33
32 31
37
36
35
34
(3) After Loop Trivialisation
Block
Natural Loop Trivial Loop
Numbers are unique identiers for blocks. Red dotted arrows indicate bodies of loops.
Example 3.8: Mandelbrot control ow graph after various stages of loop detection.
(Automatically generated by the compiler using Graphvizhttp://www.graphviz.org/)
34 CHAPTER 3. IMPLEMENTATION
S root level loops
while S is not empty do
l S.remove
if extract(l) fails then
S.add(l.children)
end if
end while
Figure 3.4: Outline of kernel extraction algorithm.
Denition 10. A loop is trivial if there is only a single conditional branch that
exits the loop after comparing the loop index i with an expression. Furthermore,
no writes can occur before the branch, and i must be an increment variable as
dened by the analysis of Section 3.3.2.
Therefore a trivial loop is dened by its index, its limit and a mapping between
its increment variables (of which the index must be one) and their increments.
These can be detected by the increment variables analysis in Section 3.3.2, along
with inspection of the exit condition, and are represented by extended loop nodes
in the code graph.
3.5 Kernel Extraction
In order to extract kernels from loop bodies, the tree provided by the nesting stage
must be considered, since it is not possible to extract both an outer loop and one
of its inner loops independently. In this project, outer loops are parallelised
preferentially since this minimises the number of data copies to and from the
GPU. This gives the outline algorithm for kernel extraction shown in Figure 3.4.
For the one-dimensional case, extract(l) simply uses a dependency checker
to determine whether the loop l can be run in parallel, and if so attempts to
extract it. Note that an extraction may fail due to limitations of the CUDA
architecturethis type of failure is handled exactly as though the dependency
check failed.
For the n-dimensional case, the rst level is checked as for the 1D case. For
subsequent levels, it is required that there is only one loop child and also that
the form in Figure 3.5 is followed before the level may be added as a further
dimension of the kernel.
3.6. DEPENDENCY ANALYSIS 35
for(...) Outer Loop
v = constant (v Inc
inner
) Checked using Constant Propagation (Section 2.7.4)
for(...) ... Parallel Inner Loop (checked by dependency checker)
v += constant (v Inc
outer
)

Figure 3.5: Form of multiple dimension kernels.


There may be other viable approaches that dont always select the outer loop
if the compiler were capable of leaving state in GPU memory between kernel
invocations, as was done in [16] with multi-pass loops, but these are not considered
here.
3.5.1 Copy In
The copy in state for a kernel is the set of state that must be supplied to the
GPU for kernel execution. This is the set of variables made live by the loop body
plus any dimension indices not already in this set.
3.5.2 Copy Out
Since the kernel is executed in parallel, all direct writes should be local to the
kernel (i.e. not live immediately following the loop). If this were not the case,
then an output dependency would exist. The copy out set is therefore given
by the indirect writes set computed by analysis.dataflow.AliasUsed (Section
3.3.4). When the may-alias analysis is agged as inaccurate, all copy in state is
included in the copy out set.
3.6 Dependency Analysis
The dependency analysis portion of the compiler is used by the kernel extraction
stage (Section 3.5) to determine whether it is safe to parallelise a given loop.
Both the user annotation and automated checks implement the same interface
(DependencyCheck) so can be used interchangeably.
36 CHAPTER 3. IMPLEMENTATION
3.6.1 Annotation Based
Developers can use method annotations to both express explicit parallelism and
override automatic analysis. The annotation (@Parallel) has a single property
loops that takes an array of index variable names for trivial loops which should be
executed in parallel. This still requires that the corresponding loop is detected
and found to be of a trivial form. The class must have been compiled with
debugging information so that variable names are available.
3.6.2 Automatic
This test consists of two checks to ensure there are no loop-carried dependencies:
Direct Writes. All direct writes must be to variables that are local to the loop
bodyi.e. the variable must not be live either at the start of the loop body,
or immediately following the loop.
Indirect Writes. Momentarily ignoring the eect of aliasing, we compare each
write with all accesses (including itself) to the same loose state (i.e. states
that are the same ignoring array indices, see Equation 3.5). To be sure
they dont access the same location on dierent iterations, there must be an
increment variable at the same position in each list of indices (see Denition
9 of indirect accesses). The variable must also have been incremented by
the same amount in each access. Several examples are given in Example
3.9.
The eects of aliasing are managed by the AliasUsed class, which expands
each write to all states it may have aected. The may-alias analysis is
initialised using information provided by @Restrict annotations. When
marked as such, the programmer is asserting that the variable, and all
references reachable from it, do not alias with any other state. If the may-
alias is agged inaccurate, then the loop is not accepted.
3.7 Code Generation
The top level algorithm for code generation deals with the diculty inherent in
code generation for CUDA, which can fail due to both unsupported instructions
(e.g. exceptions, monitors and memory allocation) and calls to methods in classes
not supplied to the compiler.
3.7. CODE GENERATION 37
short[] f(short[] data, short[] dummy)
if(Math.sqrt(4.0) < 4.0)
return data;
else
return dummy;

void compute()
short[][] dummy = new short[height][];
for(int y = 0; y < height; y++)
for(int x = 0; x < width; x++)
...
dummy[y] = data[y];
f(dummy[y], data[y])[x] = ...;

(1) Correct Acceptance


while(i < LIMIT)
arr[i] = ...
i++;
arr[i] = ...
i++;

(2) False Rejection


while(i < LIMIT)
arr[i] = ...
i += 2;
arr[i] = ...
i--;

(3) Correct Rejection


Example 3.9: Examples of the automatic dependency check.
Before outputting code for any method or kernel, all of the static elds, classes
and methods on which it depends (Section 3.3.4) must be exported. This is
implemented by buering all C++ code and recursing onto a new buer whenever
a call is reached. Only when a method is completely exported, along with its
recursions, is its buer ushed. As a result, some methods may be exported and
then never used, since they were exported for a kernel that later failed to export.
I will now describe how the C++ code generation itself works, before moving
onto describing the launcher method that is called in place of parallelisable
loops to execute the kernel. Details regarding naming conventions are given in
Appendix B.
Finally, an extension of Javas PrintStream (cuda.Beautifier) indents code
based on the location of curly braces. This was done to facilitate debugging.
38 CHAPTER 3. IMPLEMENTATION
ILOAD 2 (x)
I2F
ALOAD 0 (this)
GETFIELD spacing:F
FMUL
LDC 1.5f
FSUB
FSTORE 5
(1) Bytecode
Read this
Read x
Read spacing
1.5f

Write Cr
(2) Code Graph
const jint t0 = v2 INT;
const Object<Data samples Mandelbrot> t1 =
v0 2101451235;
const jfloat t2 = DEVPTR(t1.device)->spacing;
const jfloat t3 = (jfloat) t0;
const jfloat t4 = t3*t2;
const jfloat t5 = 1.5f;
const jfloat t6 = t4-t5;
v5 FLOAT = t6;
(3) C++
Example 3.10: C++ code generation for float Cr = (x * spacing - 1.5f);.
3.7.1 C++
Exporting the basic blocks to C++ is performed with a depth-rst search of each
timeline entry in turn. This ensures that stateful instructions are executed in the
correct order, and that all arguments are generated before their use. Results from
instructions (i.e. Producers, see Table 3.1, page 22) are assigned to temporary
const variables. The names of these temporary variables are stored in a map so
that each instruction is only visited once. An example of a basic block and its
exported form is given in Example 3.10.
Control ow is exported using a combination of while, for recognised loops,
and goto, for all conditionals and loops not detected.
3.7.2 Kernel Invocation
The kernel is invoked on the graphics processor using the CUDA runtime library.
This requires dimensions for both the grid of blocks and the blocks themselves
(see Section 2.6.1). Dimensions are chosen using the following rules and heuristics
to maximise performance and ensure the execution succeeds. For each dimension
i, the grid size is denoted by g
i
, the block size by b
i
and the number of required
iterations by r
i
.
1.

i
b
i
is less than or equal to the maximum number of threads per block.
This is governed by register and shared memory usage of the kernel.
2. b
1
must be a multiple of the warp size (therefore the number of threads per
block will also be a multiple of the warp size), or less than a single warp.
3.7. CODE GENERATION 39
Object<T> Array<T>
jobject object jarray object
T* host T* host
T* device T* device
jsize length
Figure 3.6: Array and object type templates for on-GPU execution
3. b
i+1
> 1 = b
i
r
i
4. g
i
= minr
i
/b
i
, G
i
where G
i
is the maximum size of the grid in dimension
i.
This means that the developer does not need to consider the specication of
their specic graphics card, or have knowledge of the threading model.
3.7.3 Data Copying
Primitive types are transferred directly into the corresponding C++ types. In the
case of doubles, data must rst be switched to single precision if it is to be used
on cards without double precision support. For arrays of doubles, a single check
is made to determine whether this is necessary in order to avoid unnecessary
overheads.
For reference types, the C++ types, Object<T> and Array<T> (Figure 3.6),
are dened using template meta-programming, enabling recursive types to be
built up (e.g. Array<Array<Object<struct foo> > >). The object identier
allows objects to be switched during GPU computation (for example, reversing
the rows of a 2D array), while the host pointer is used to record the location in
host memory where the object is held. It would have been possible to free this
memory while the GPU code executed, reallocating space to perform the export.
However, I felt that the further allocation overheads outweighed any benet.
On import, each reference is placed in a map to ensure it is not imported
twice. If this did occur and both copies were modied, then only one set of
changes would be preserved by the export. The map is also used as a list of
objects that must be exported. Without this, an object that became unreachable
as a result of the kernel might not be exported, even though it may still be
reachable from elsewhere in the program.
40 CHAPTER 3. IMPLEMENTATION
Arrays
Arrays with primitive elements are imported using JNI functions that force
the JVM to provide direct access to the array without copying the data
(GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical). This
avoids the need for two copies (rst into a C buer and then onto the device) at
the expense of halting the virtual machines garbage collector.
However, for arrays with reference-typed elements, each element must be read
separately and then imported appropriately, causing two copy stages.
Objects
Since CUDA devices support C structures, these can be used to represent Java
objects on the graphics processor. Unfortunately, populating these via the JNI
API requires a function call to access each eld of each object, which creates
noticeable overheads for large objects or large numbers of objects.
Memory Allocation
In order to minimise the number of memory allocations required, all device mem-
ory is allocated with a single allocation, and then divided up as needed. This
also results in improvements in copy performance (see Section 4.2.1).
Similarly, the host memory for an array of objects is allocated in a single
batch rather than one-by-one.
Statics
Rather than passing statics to the kernel as arguments, which must in turn be
passed on to any other methods called, they are stored in CUDAs __constant__
memory. The read-only nature of this memory is not a problem, since a static will
never be directly written to (Section 3.5.2). There are also possible performance
gains as it allows caching by the GPU. The restricted size of __constant__
memory (64Kb for the card used in development) is unlikely to be an issue, since
even on 64bit machines, Array<T> only requires 28 bytes
4
.
4
JVM array lengths are dened as 32bit integers even on 64bit machines: jsize jint
int.
3.8. COMPILER TOOL 41
3.8 Compiler Tool
The compiler is brought together in tools.Parallelise. This makes calls to the
stages of the compiler: import, the 3 stages of loop detection, kernel extraction
(which in turn performs dependency analysis and code generation) and nally
export. A description of the available arguments and their eects is given in
Appendix C. These are parsed by an open source library, jopt simple
5
.
The compiler also invokes the CUDA compiler (nvcc) automatically, so that a
developer does not need to understand the process of producing JNI compatible
libraries from CUDA code.
3.8.1 Feedback to the User
Compiler feedback is provided at a variety of levels
6
, ranging from just fatal
errors through to full debugging information. Logging messages are managed,
like command line arguments, by an external library log4j
7
. As a standard
problem in many applications, it was unnecessary to implement a custom set of
logging classes. Log messages are categorised by the module of the compiler and
a level.
As well as controlling the verbosity of messages, when the logging level is set
to debug, debugging output is added to the generated CUDA code. This then
provides information regarding the invocation sizes used (Section 3.7.2) and a
breakdown of the GPU execution time into the following stages:
1. Importing data from Java (using JNI) and allocating any extra host memory
required, as well as calculating how much device memory to allocate.
2. Allocating device memory and copying data to the GPU.
3. Executing the kernel on the GPU.
4. Copying data back from the GPU.
5. Exporting any data back to Java as required and freeing memory resources.
3.9 Summary
In this chapter, I have given a complete overview of the internals of the parallelis-
ing compiler. This includes the theoretical basis for the analysis most notably
5
http://jopt-simple.sourceforge.net/
6
The possible levels are FATAL, ERROR, WARN, INFO, DEBUG and TRACE.
7
http://logging.apache.org/log4j/
42 CHAPTER 3. IMPLEMENTATION
the alias and automatic dependency analysis. Appendix F gives an index of
where information can be found for each class in the compiler implementation.
An extract of the compiler source code is given in Appendix G.
CHAPTER 4
Evaluation
This chapter evaluates the compiler and presents a model of the overheads caused
by data copying to the GPU. An objective comparison with related work in the
literature is also provided. Descriptions of all sample code, including their origins,
are given in Appendix D.
4.1 Correctness
As described in Section 2.2, the compiler was developed by the gradual introduc-
tion of stages. It was checked that all sample code (see Appendix D) supported
at the time continued to produce correct results after compilation.
Unit tests were also performed for the analysis stages (Table 4.1). These
consisted of gold standards (Appendix E) for each of the sample codes that could
be compared with the given results.
For the scope of target programs dened by C4 (see Section 2.1), all tests
were passed. When moving outside this scope, specically making use of object
Compiler Stage Tests
Loop Detection Correct number.
Loop Trivialisation Correct increments and bounds.
Kernel Extraction Correct dimensions and copy in state. Safe copy
out state.
Code Generation Specic code for dierent aspects (e.g. objects).
Dependency Analysis Safe results.
Table 4.1: Tests made for each compiler state.
43
44 CHAPTER 4. EVALUATION
inheritance, the code generation and dependency analysis stages both wrongly
assume that methods are final so that they can be exported, since it is not
possible to know what classes may later extend and override these methods. The
alternative of rejecting code generation in these cases would prevent many valid
compilations, since the final keyword is often omitted, even if applicable.
4.2 Performance
The performance benets achievable using the compiler depend on the combina-
tion of the speedup due to parallel execution on the graphics processor, and the
overheads due to data copying.
The execution speedup is dicult to predict due to the dierences between
GPU and CPU architectures. CPU execution time depends heavily on the
amount of instruction level parallelism that can be achieved through out-of-order
execution. Whilst the GPU is simpler in this respect, its performance can be
aected by the locality of memory accesses (due to coalescing, see Section 2.6.2)
and also the runtime eect of thread divergence (see Section 2.6.1). This sec-
tion therefore comments on the measured speedups rather than trying to predict
them.
The overheads are more predictable, allowing a model to be developed and
then tested against measurements made on the collection of sample code.
In order to achieve fair results, benchmarks were run on the dedicated machine
(bing) with the GPU in dedicated mode. As far as possible, other programs were
terminated before benchmarking to avoid contention for CPU time. Benchmarks
were repeated 10 times and the median of these used. All execution timings were
made against wall clock time. Using CPU time would have given biased results,
since time spent on the GPU appears as I/O and would not have been included.
4.2.1 Model of Overheads
The overheads related to o-loading computation onto the graphics processor can
be split into four categories as in Section 3.8.1: importing from Java; copying to
the GPU; copying back from the GPU; and exporting to Java. The operations
within these stages (Section 3.7.3) suggest the following costs. In general, I
expect these to behave linearly (i.e. an initial latency l

, plus a further cost nt

depending on the size n of the operation).


Stopping Garbage Collection (l
g
). Since the critical array access JNI func-
tions are used, there will be a constant cost for stopping garbage collection.
4.2. PERFORMANCE 45
Stage Overhead Time
Import
l
g
+ (l
r

p
A(p)) + Sl
s
for all parameters p
where S is the number of statics used.
Copy On l
d

p
R(p) + t
d

p
M(p) for all parameters p
Copy O l
h

p
R(p) + t
h

p
M(p) for copy o parameters p
Export l
f
+ (l
w

p
A(p)) + (t
w

p
E(p)) for copy o parameters p
Table 4.2: Expected timings for overhead stages according to model.
JNI Reads (l
r
). For each read from Java, there will be a constant cost. This
also applies to the critical array accesses, since no copy is performed.
Constant Setting (l
s
). When statics are used, CUDA constant memory must
be set.
Copies (l
d
, t
d
, l
h
and t
h
). Copies in each direction are likely to have dierent
bandwidths. I ignore the allocation cost at the beginning of the copy on
stage, since this will be negligible compared to the copy.
JNI Writes (l
w
and t
w
). The critical array access functions allow changes to
be aborted, suggesting that a copy-on-write may occur internally. This is
therefore modelled as a linear cost.
Freeing (l
f
). Finally, there is the cost of freeing the used device memory.
This gives the expressions in Table 4.2 for the overheads associated with each
of the four stages. These rely on knowing certain values for each parameter p of
the kernel.
The number of accesses A(p) required to read or write the parameter from
Java.
The total amount of memory E(p) that is exported by these accesses.
The number of memory regions R(p) that this data is spread out over.
The total amount of memory M(p) that the data occupies once in the C++
code (this is higher due to the representations shown in Figure 3.6).
These can be calculated recursively based on the type of the parameter (and
array lengths), as shown in Equations 4.1 to 4.4.
46 CHAPTER 4. EVALUATION
Accesses
_

_
A(primitive) = 0
A(array of primitive) = 1
A(array of ) = length (1 + A())
A(object ) =

elds()
1 + A(

)
(4.1)
Exported Memory
_

_
E(primitive) = sizeof(primitive)
E(array of ) = sizeof(pointer) + (length E())
E(object ) = sizeof(pointer) +

elds()
E(

)
(4.2)
Memory Regions
_

_
R(primitive) = 0
R(array of object ) = 2 + (length

elds()
R(

))
R(array of ) = 1 + (length R())
R(object ) = 1 +

elds()
R(

)
(4.3)
Total Memory
_

_
M(primitive) = sizeof(primitive)
M(array of ) = 3 sizeof(pointer) + 4 + (length M())
M(object ) = 3 sizeof(pointer) +

elds()
M(

)
(4.4)
Measurement of Copy Parameters
As a preliminary test of the copy on and copy o models, a CUDA program that
measured the time taken to copy N arrays of N doubles (i.e. 8N bytes) was
written in C++. It became apparent that the model only holds when the copies
are within a single device memory allocation. The test program was therefore
extended to allow for a variety of memory locations both on the host and the
device. These were as follows:
Separate Each array is allocated separately with a call to the relevant memory
allocator.
Sequential The memory for all arrays is allocated at once, and then the locations
allocated sequentially from this pool.
Non-sequential Again the memory for all arrays is allocated at once, but the
regions of memory are allocated alternately from the start and end of this
pool. This was designed to simulate the case where the order of the copies
could not be predicted and would not be in-order.
4.2. PERFORMANCE 47
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0.010
0
1.010
3
2.010
3
3.010
3
4.010
3
5.010
3
6.010
3
7.010
3
8.010
3
T
i
m
e

(
m
s
)
N
Comparison of Copies
Separate
Single Allocation
Cubic Fit
As per model
Figure 4.1: Eect on copy performance (host-to-device) of single vs. multiple
allocations.
As shown in Figure 4.1, the predicted model (Nl
d
+ 8N
2
t
d
) is only followed
in the two cases of single allocation, with the separate case exhibiting cubic
behaviour. This also shows the improvement in copy performance that can be
achieved by performing just a single allocation. An appropriate modication was
therefore made to the compiler.
The model parameters given by gnuplots tting function were l
d
= (8.07
0.11) 10
3
ms and t
d
= (1.614 0.002) 10
6
ms byte
1
. The respective values
for device-to-host copy were l
h
= (8.560.15)10
3
ms and t
h
= (2.5480.003)
10
6
ms byte
1
.
NVIDIA provide a similar tool in their SDK that measures memory copy
performance. This gives results which can then be used to estimate t
d
and t
h
.
Timing single copies does not give sucient accuracy to measure the latencies, so
the values for l
d
and l
h
are taken from above. However, as shown in Figure 4.2,
for small copies (< 50KB) the model does not hold, with t
d
and t
h
taking varying
values as shown in Figure 4.3. For the remainder of the evaluation, I continue to
assume the simple linear model, but split each parameter into small and large
values, for below and above 50KB respectively (i.e. t
d,small
, t
d,large
, . . . ).
48 CHAPTER 4. EVALUATION
0.001
0.01
0.1
1
10
100
1.010
3
1.010
4
1.010
5
1.010
6
1.010
7
1.010
8
T
i
m
e

(
m
s
)
Size (bytes)
Comparison of Copy Performance with Linear Model
Host to Device
Device to Host
l
d
+ Nt
d
l
h
+ Nt
h
Figure 4.2: Comparison of measured performance with model (using CUDA
SDK).
0.010
0
1.010
-6
2.010
-6
3.010
-6
4.010
-6
5.010
-6
6.010
-6
7.010
-6
1.010
3
1.010
4
1.010
5
1.010
6
1.010
7
1.010
8
T
i
m
e
/
b
y
t
e

(
m
s
/
b
y
t
e
)
Size (bytes)
Values of t
d
and t
h
assuming Constant Latency
t
d
t
h
Figure 4.3: Values of t
d
and t
h
for measurements (using CUDA SDK).
4.2. PERFORMANCE 49
4.2.2 Component Benchmarks
Here I present a number of micro-benchmarks that compute sin
2
x+cos
2
x over a
sequence of random numbers (length N). Each version of the benchmark stores
the sequence in a dierent manner. This allows the overheads model to be tested
on code produced by the compiler. It also evaluates whether speedups can be
achieved when very little computation is performed. The versions produced were:
Baseline The baseline version stores the numbers in a local 1D array.
Statics In this case, the numbers are stored as a static variable.
Objects Each number is placed inside a class, and the computation is performed
as a method of this class.
Two Dimensions The numbers are stored in a rectangular array with roughly
the same number of elements. The dimensions for the array were chosen as

N|
_
N

N
_
.
Using the model in the previous section, the overhead times for each of these
versions can be predictedas shown in Table 4.3. Measurements for a range
of N can then be used to assess whether the model ts accurately and to give
estimates for its parameters. Due to the nature of the parameters, it is necessary
to consider the complete data set (e.g. in the static case l
g
and l
s
could be varied
arbitrarily provided l
g
+l
s
gave a suitable value). The measured values are shown
in Table 4.4 and are reasonably consistent with those measured in the previous
section. The slight shift in copy latencies may be due to an overlooked dierence
between the C++ test program in the previous section, and the copies performed
for ooading Java. Recalculating the rates in Figure 4.3 using the new values of
l
d
and l
h
gives values that coincide with the rates calculated here. An indication
of the quality of the ts is given by the graphs in Figure 4.4.
The benchmark timings also give an indication of the execution speedup. The
results are summarised in Table 4.5. The performance when executed on the CPU
was the same for all versions.
The baseline benchmark is encouraging as it shows that even when little com-
putation is performed on the GPU, the overhead associated with transferring data
to the graphics card is not prohibitive. The statics version performs similarly, as
would be expected, with a slightly improved speedup possibly due to the array
pointer being held in constant memory which can be cached (see Section 2.6.2).
When an object array is considered, the overheads (although vastly improved
by using single memory allocation) make ooading to the GPU impractical. The
50 CHAPTER 4. EVALUATION
Import Export
Baseline l
g
+ l
r
l
f
+ l
w
+ (8 + 8N)t
w
Statics l
g
+ l
r
+ l
s
l
f
+ l
w
+ (8 + 8N)t
w
Objects l
g
+ 2Nl
r
l
f
+ 2Nl
w
+ (8 + 16N)t
w
2D l
g
+ 2l
r

N l
f
+ 2

Nl
w
+ (8 + 8

N + 8N)t
w
Copy On (for Copy O replace l
d
and t
d
with l
h
and t
h
)
Baseline l
d
+ (28 + 8N)t
d
Statics l
d
+ (28 + 8N)t
d
Objects 2l
d
+ (28 + 32N)t
d
2D (1 +

N)l
d
+ (28 + 28

N + 8N)t
d
Table 4.3: Model of overheads for component benchmark versions.
l
g
l
r
l
s
l
d
l
h
ms 7.37 10
3
3.76 10
4
2.08 10
2
1.05 10
2
1.04 10
2
l
f
l
w
ms 1.43 10
1
2.40 10
4
t
d,small
t
d,large
t
h,small
t
h,large
t
w
ms/byte 9.56 10
7
6.82 10
7
1.75 10
6
1.23 10
6
1.97 10
9
Table 4.4: Model parameters, as measured using component benchmarks.
B
a
s
e
l
i
n
e
T
i
m
e
N
Import
T
i
m
e
N
Copy On
T
i
m
e
N
Copy Off
T
i
m
e
N
Export
S
t
a
t
i
c
s
T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

O
b
j
e
c
t
s
T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

2
D
T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

T
i
m
e
N

Figure 4.4: Fit of model (green) to component benchmarks.
4.2. PERFORMANCE 51
Version Execute Only Inc. Overheads
Baseline 192 40
Statics 239 41
Objects 220 0.18
2D 229 22
Table 4.5: Speedup factors for the component benchmarks.
0
0.005
0.01
0.015
0.02
0.025
T
i
m
e

(
m
s
)
N
Import
0
1
2
3
4
5
6
7
8
9
10
T
i
m
e

(
m
s
)
N
Copy On
0
2
4
6
8
10
12
14
16
18
20
T
i
m
e

(
m
s
)
N
Copy Off
0
0.05
0.1
0.15
0.2
0.25
T
i
m
e

(
m
s
)
N
Export
Figure 4.5: Fit of model to Fourier Series benchmark, using previously calculated
parameters.
inaccuracy of the model during the import stage may be due to unexpected
overheads associated with the map used for listing references. Further work is
needed to isolate this and make suitable improvements.
The overheads in the two-dimensional case are also much reduced by the
single memory allocation and this improves the overall speedup from 5.6 to 22.
4.2.3 Java Grande Benchmark Suite [7]
The Java Grande benchmark suite was used as a source of external unbiased code
that could be passed to the compiler. The sequential code was annotated and
fed to the compiler. Timings were then compared between the GPU and original
versions.
A full description of the suite is given in Appendix Dincluding an explana-
tion of which benchmarks were used. Here I give the results of the Series and
Crypt benchmarks, relating these to the hypothesised overheads model, and also
the hardware characteristics of CUDA.
Series: Fourier Series of (x + 1)
x
This benchmark exhibited the biggest speedup factor (187 overall). The break-
down of the execution time shows that only 0.5% of the GPU time was due to
overheads. These overheads were generally in agreement with those predicted
using the parameters measured in the previous section (Figure 4.5).
52 CHAPTER 4. EVALUATION
0
20
40
60
80
100
T
i
m
e

(
m
s
)
N
Import
0
200
400
600
800
1000
1200
T
i
m
e

(
m
s
)
N
Copy On
0
200
400
600
800
1000
1200
T
i
m
e

(
m
s
)
N
Copy Off
0
2
4
6
8
10
T
i
m
e

(
m
s
)
N
Export
Figure 4.6: Fit of model to Mandelbrot benchmark, using previously calculated
parameters.
Crypt: IDEA Encryption/Decryption
Whilst only using integer operations, the graphics processor execution still
achieves a signicant speedup factor (8.7). This is again helped by the relatively
small amount of data required for computation. The lower factor is probably due
to the CPU performing better on integer benchmarks.
4.2.4 Mandelbrot Set Computation
The Mandelbrot set is dened as the set of complex values c such that the absolute
value of z
n
remains bounded for any value of n, where z
n
is dened as:
z
n
=
_
0 if n = 0
z
2
n1
+ c if n > 0
(4.5)
For computation, we must dene a limit on the size of n (the iteration limit)
and also a bound on values of z
n
. Here the bound is set as 4.0 (as used in the
original code). The iteration limit means that it is possible to vary the amount
of computation performed on the data, altering the signicance of the overheads.
Again, the measured parameters from Section 4.2.2 were used to predict the
overheads, giving very accurate results (Figure 4.6). This demonstrates that the
model does not suer from overtting.
Turning to the speedup during the actual execution portion, I rst consider
the case where the iteration limit is xed at 250 (as in [16]) and the grid size is
altered. Figure 4.7 plots the speedup achieved on the execute portion, and also
the overall speedup when overheads are included. The reason the execute-only
speedup is lower in this benchmark could be due to the eect of thread divergence
(described in Section 2.6.1). This means that the calculation for each pixel takes
as long as the slowest pixel in its warp.
Similarly, the variation in speedup can be investigated as the iteration limit is
altered. This is done for a xed size computation (8000 8000 grid) and plotted
4.2. PERFORMANCE 53
0
10
20
30
40
50
60
70
80
90
100
S
p
e
e
d
u
p

F
a
c
t
o
r
Execute Only
Overall
0
25
50
75
100
0.010
0
2.010
3
4.010
3
6.010
3
8.010
3
1.010
4
1.210
4
%

O
v
e
r
h
e
a
d
N
Figure 4.7: Speedups and overhead for Mandelbrot benchmark with xed itera-
tion limit (250).
in Figure 4.8. Since the overheads are xed, they become less signicant as the
number of iterations rises, with the overall speedup tending towards the execute
speedup.
4.2.5 Conways Game of Life
Conways Game of Life is a cellular automaton. The evolution of each cell in a
2D grid requires independent computation (see Appendix D for details).
The simulation of such a game provides an interesting benchmark for parallel
computing, since there is a trade-o between the nave computation that is easy to
parallelise, and more sophisticated algorithms that are less suited. In particular,
I will consider the Hashlife implementation [12] that accelerates simulation by
recording the evolution subgrids to avoid later recomputation.
As shown in Figure 4.9, the nave algorithm running on the GPU in fact runs
slower than on the CPU. Both are much slower than HashLife. One reason for
this is that all data is copied back and forth from the graphics card on each
iteration, even though the data is not used by the host in between each kernel
54 CHAPTER 4. EVALUATION
0
10
20
30
40
50
60
70
80
90
S
p
e
e
d
u
p

F
a
c
t
o
r
Execute Only
Overall
0
25
50
75
100
0.010
0
2.010
2
4.010
2
6.010
2
8.010
2
1.010
3
1.210
3
%

O
v
e
r
h
e
a
d
Iterations
Figure 4.8: Speedups and overhead for Mandelbrot benchmark with xed grid
size (8000 8000).
invocation. Other work [16] introduces multi-pass loops, where the loop body
only consists of GPU code, allowing data to be left on the GPU. In the case of
this specic benchmark, a more advanced approach would be needed, since a new
array is used for each iteration rather than double buering.
A second issue is the manner in which a cells neighbours are counted. Since
the world is stored as an array of booleans, there is an if ...else control ow
structure for each neighbour. This suggests that the execution may be suering
from thread divergence.
4.2.6 Summary
These results show that signicant performance improvements are possible over
a range of benchmarks. Whilst accurate predictions of execute speedups have
not been possible, the factors measured are consistent with those expected given
the number of cores on the GPU and also the number of double precision units
available. Both the execute and overall speedups for each benchmark are sum-
marised in Table 4.6. These are combined using the geometric mean (see [9] for
4.2. PERFORMANCE 55
0
5000
10000
15000
20000 10
100
1000
10000
0
5000
10000
15000
20000
25000
30000
Time (ms)
GPU
CPU
Grid Size
Generations
Time (ms)
Figure 4.9: Overall times for simulation of Conways Game of Life.
Benchmark Double Precision Execute Overall
Baseline 192 40
Statics 239 41
Objects 220 0.18
2D 229 22
Mandelbrot (250 iterations) 83 39
Mandelbrot (8000 8000 grid) 106 79
Life
1
Series 189 187
Crypt 8.7
182.4 20.6
Table 4.6: Summary of speedup factors.
reasons why this is appropriate) to give an average speedup factor of 20.6.
The overheads model has also been evaluated, with the parameters measured
from the component benchmarks giving accurate predictions of the overheads in
other cases. However, some aspects are not fully understood (i.e. GPU behaviour
for small copies, and object import time).
1
The Life speedup factors were all very low ( 1) but varied considerably. Therefore, there
was not a suitable single value.
56 CHAPTER 4. EVALUATION
Benchmark Series (Floating Point) Crypt (Integer)
Data Size 10
4
10
5
10
6
3 10
6
2 10
7
5 10
7
CPU on bing (ms) 17971 182894 2878469 414 2190 5344
This Project on bing (ms) 99 968 9358 41 245 545
JCUDA on Tesla C1060 (ms) 110 1040 10140 20 160 450
Table 4.7: Comparison of Java Grande benchmark timings with JCUDA.
4.3 Accuracy of Dependency Analysis
Using the same gold standards as were used for testing (i.e. @Parallel annota-
tions), it was possible to measure the accuracy of the automatic analysis. This
showed that an accuracy of 85% (
29
34
) was achieved for the range of benchmarks.
In cases where the check was too conservative, the behaviour could be explained
by the may-alias and checking algorithms.
4.4 Comparison with Existing Work
This projects approach was compared with other related work in Section 1.3. In
terms of performance, published results allow some quantitative comparisons to
be made regarding the speedups achieved. Unfortunately, the JikesRVM work
[16] uses a much older card (GeForce 7800) so is incomparable. JCUDA [25]
uses a similar card (NVIDIA Tesla C1060, 1.3GHz, 240 cores) to that of bing
(NVIDIA GTX 260, 1.24GHz, 216 cores). Their work ports the Java Grande
benchmarks [7] to C++ so that the GPU performance can be compared to that
of raw Java. My results for the Series and Crypt benchmarks (Section 4.2.3)
are broadly similar as shown in Table 4.7.
Turning to the automatic dependency analysis, neither [16] nor javab [4]
give accuracy gures for their analyses (javab instead compares the number of
parallelisable loops with the total). However, it would be expected that the
approach of [16] could use runtime information to produce more accurate results
than either this work or javab.
4.5 Summary
In this chapter, three key aspects of the project have been evaluated. First,
tests were used to demonstrate compiler correctness within the required scope.
A model for overheads was then developed and tested. It was found to be accu-
4.5. SUMMARY 57
rate with large copies, but the bandwidth to the card behaved in a manner not
fully understood with small copies. The modelling also indicated a signicant im-
provement that could be made to the compiler. Execution speedups were found
to be in line with what would be expected based on the hardware architecture.
Finally, investigations were made into the accuracy of the automatic analysis.
Some quantitative comparisons with existing work have also been made, adding
to those in Section 1.3.
CHAPTER 5
Conclusions
This dissertation has highlighted the key aspects of the project and the compiler
that it produced. It has explained the existing work and knowledge that was used
(Chapter 2), and how this allowed a novel compiler to be developed (Chapter 3).
Evaluation of the compiler (Chapter 4) has shown it to both maintain correctness
and provide signicant speedups in the majority of sample cases. This chapter
assesses the project formally with respect to its goals, and suggests future work
to improve the compiler.
5.1 Comparison with Requirements
Ultimately, the project should be judged by whether it meets the requirements
that were elaborated from the project proposal in Section 2.1. The evaluation
allows each of these to be considered in this section.
The tests that were carried out during the project (Section 4.1) showed that
the compiler maintained correctness whenever it succeeded in compiling. A
marginal case is exhibited when the graphics processors memory is exceeded, this
causes the JVM to exit gracefully with a suitable error message. Use of recursion
within parallel loops is a notable case where compilation fails, due to restrictions
of CUDA. This evidence demonstrates that the project meets Requirement C1.
The various performance benchmarks that have been evaluated (Section 4.2)
show a clear benet from using the compiler, satisfying Requirement C2.
The annotations that the compiler uses to assess code (@Parallel and
@Restrict) are both unobtrusive and transparent to the standard Java com-
piler. Transparency allows source code containing these annotations to be built
normally for environments without compatible GPUs. The annotations also al-
59
60 CHAPTER 5. CONCLUSIONS
low explicit marking of parallel for loops of multiple dimensions. Therefore,
Requirement C3 is met. The nature of @Parallel also means that the loop
bound detection extension (E1) was fully implemented.
The scope of code that can be compiled for GPU execution meets the re-
quirements of C4, although recursive code cannot be used due to restrictions in
current GPU architectures. Extension E4 for support of objects has also been
completed up to the limits of the architecture (i.e. no inheritance or allocation).
The compiler provides user feedback, giving reasons whenever parallel com-
pilation fails. This avoids unexplained performance changes when utilising the
automatic dependency analysis, satisfying C5.
The sample code (Appendix D) used in the evaluation has been fully de-
scribed, meaning that all claims can be checked objectively. This fulls the nal
core requirement, C6.
The implementation of simple automatic dependency checking (Section 3.6.2)
means that Extension E2 has also been completed.
5.2 Future Work
There are many additions and improvements that could be made to further de-
velop the compiler, of which I describe a few here.
5.2.1 Further Hardware Support
With the release of NVIDIAs new Fermi cards [21] and CUDA 3.0, it is now
possible to provide a more complete set of features for GPU execution, including
recursion and more complete object support. While recursion would be supported
automatically via nvcc, some features would require more work. Support for
allocations might be possible in some cases by pessimistically allocating space for
all possible allocations, and then freeing unused blocks after the kernel invocation.
Support for multiple graphics cards would also be useful. However, exporting
arrays back to Java, after dierent portions have been modied on dierent cards,
may cause diculties, and extra overheads.
5.2.2 Further Optimisations
There is certainly scope for further transformations within the compiler to im-
prove performance. For example, when copying objects onto the device, it makes
sense only to copy elds that will be used. As mentioned in the original ex-
5.3. FINAL CONCLUSIONS 61
for(int i = 0; i < length; i++)
if(arr[i] < minimum)
minimum = arr[i];

(a) Sequential
min
.
.
.
<
0.3 0.2
<
1.7 3.1
(b) Parallel Reduction
Figure 5.1: Minimum nding algorithms
tensions, there may also be optimisations that neither nvcc nor the JVM can
perform, such as loop invariant code motion (see Section 2.1, E5).
As exhibited in the Game of Life benchmark (Section 4.2.5), support for multi-
pass loops (as implemented in [16]) could also improve performance dramatically
in some iterative algorithms.
5.2.3 Further Automatic Detection
Given the undecidability of the automatic parallelisation, there will always be
scope for introduction of more accurate and sophisticated tests. However, an
alternative might be to leave a CPU version of the code in the class, selecting
which to use at runtime. This could be based, not just on correctness, but also
on whether the number of iterations justify the expected overheads.
There is also the potential for pattern matching transformations to yield
signicant benets (albeit in a limited number of cases). For example, common
implementations of minimum, maximum and sum (Figure 5.1a) are not suitable
for parallel execution, however, the solution can be sped up using parallel reduc-
tion (Figure 5.1b).
5.3 Final Conclusions
Overall, I believe that the compiler is able to oer a higher level of abstraction
than other attempts (Section 1.3) without sacricing performance (Section 4.4).
Bibliography
[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers : Principles,
Techniques, & Tools, Second Edition. Addison-Wesley, second edition, 2007.
[2] B. Alpern, A. Cocchi, D. Lieber, M. Mergen, and V. Sarkar. Jalape no-
a compiler-supported Java virtual machine for servers. In Workshop on
Compiler Support for Software System (WCSSS 99), volume 14, pages 87
94. Citeseer, 1999.
[3] B. Amedro, V. Bodnartchouk, D. Caromel, C. Delbe, F. Huet, and
G. Taboada. Current State of Java for HPC. Technical Report RT-0353,
INRIA, 2008.
[4] A. Bik and D. Gannon. javab - A prototype bytecode parallelization tool. In
ACM Workshop on Java for High-Performance Network Computing, 1998.
[5] B. Boehm. A spiral model of software development and enhancement. SIG-
SOFT Softw. Eng. Notes, 11(4):1424, 1986.
[6] E. Bruneton, R. Lenglet, and T. Coupaye. ASM: a code manipulation tool to
implement adaptable systems. Adaptable and extensible component systems,
2002.
[7] J. M. Bull, L. A. Smith, M. D. Westhead, D. S. Henty, and R. A. Davey.
A methodology for benchmarking Java Grande applications. In JAVA 99:
Proceedings of the ACM 1999 conference on Java Grande, pages 8188, New
York, NY, USA, 1999. ACM.
[8] L. Damas and R. Milner. Principal type-schemes for functional programs. In
POPL 82: Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on
Principles of programming languages, pages 207212, New York, NY, USA,
1982. ACM.
63
64 BIBLIOGRAPHY
[9] P. Fleming and J. Wallace. How not to lie with statistics: the correct way
to summarize benchmark results. Communications of the ACM, 29(3):221,
1986.
[10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design patterns: elements
of reusable object-oriented software. Addison-Wesley Reading, MA, 1995.
[11] M. Gardner. Mathematical games: The fantastic combinations of John Con-
ways new solitaire game Life. Scientic American, 223(4):120123, 1970.
[12] R. Gosper. Exploiting regularities in large cellular spaces. Physica D Non-
linear Phenomena, 10:7580, 1984.
[13] G. A. Kildall. A unied approach to global program optimization. In POPL
73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on
Principles of programming languages, pages 194206, New York, NY, USA,
1973. ACM.
[14] A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Py-
CUDA: GPU Run-Time Code Generation for High-Performance Computing.
Arxiv preprint arXiv:0911.3456, 2009.
[15] W. Landi. Undecidability of static analysis. ACM Letters on Programming
Languages and Systems, 1(4):323337, 1992.
[16] A. Leung, O. Lhot ak, and G. Lashari. Automatic parallelization for graph-
ics processing units. In Proceedings of the 7th International Conference on
Principles and Practice of Programming in Java, pages 91100, 2009.
[17] J. Lewis and U. Neumann. Performance of Java versus C++. Computer
Graphics and Immersive Technology Lab, University of Southern California,
Jan, 2003 (updated 2004).
[18] S. Liang. Java Native Interface 6.0 Specication. Sun, 1999.
[19] T. Lindholm and F. Yellin. The Java(TM) Virtual Machine Specication
(2nd Edition). Prentice Hall, 1999.
[20] NVIDIA. Compute Unied Device Architecture. Programming Guide, Au-
gust 2009. Version 2.3.1.
[21] NVIDIA. Fermi: NVIDIAs Next Generation CUDA Compute Architecture.
White paper, October 2009.
65
[22] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and
T. Purcell. A survey of general-purpose computation on graphics hardware.
In Computer Graphics Forum, volume 26, pages 80113, 2007.
[23] L. Smith and M. Bull. Java for High Performance Computing.
[24] H. Sutter. The free lunch is over: A fundamental turn toward concurrency
in software. Dr Dobbs Journal, March 2005.
[25] Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A Programmer-Friendly
Interface for Accelerating Java Programs with CUDA. In Proceedings of the
15th International Euro-Par Conference on Parallel Processing, 2009.
APPENDIX A
Dataow Convergence Proofs
In this appendix, proofs are given for the convergence of the iterative computation
of the various dataow analyses, as described in Sections 2.7.1 to 2.7.4.
A.1 General Dataow Analysis
As stated in Section 2.7.1, for an analysis over the complete lattice (X, _), with
transfer function F
b
: X X, convergence is guaranteed if (X, _) is of nite
height and F
b
is monotone.
The proof is based on that given in [1, pp. 627 to 628], but is adjusted so
that each calculation makes use of the latest result, rather than always looking
to the previous iteration.
Denition 11. A function F : X X is monotone if a _ b = F(a) _ F(b).
Theorem 2. If F
b
(for all b) is monotone and the lattice is of nite height, then
the dataow analysis converges.
Proof. For all b, we consider the value of R(b) on the ith iteration (i.e. R
i
(b)). If
we can show that b.R
i
(b) _ R
i+1
(b), then the iterative calculation must converge
since we must either reach a xed point, or all R(b) will eventually equal the upper
bound on the lattice (since the lattice has nite height, there are no innite
chains).
For the case where children(b) = , R
i
(b) is constant, so trivially R
i
(b) _
R
i+1
(b). We consider the other cases by induction.
Base Case: Since we initialise R
0
(b) as , no matter what value R
1
(b) takes,
we have that R
0
(b) _ R
1
(b).
67
68 APPENDIX A. DATAFLOW CONVERGENCE PROOFS
Induction Step: Now we consider R
i+1
(b), assuming x.(R
i1
(x) _ R
i
(x)).
Without loss of generality, we can also presume that there is an ordering of
calculations within the iteration, although this is not necessarily the same for
all iterations. We denote the set of blocks or instructions calculated before b as
calc(b). Using an inner induction proof, we can now show that b.R
i
(b) _ R
i+1
(b).
Inner Base Case: For calc(b) = , the value of R
i+1
(b) is calculated as:
R
i+1
(b) = F
b
_
_

cchildren(b)
R
i
(c)
_
_
We also know that R
i
(b) was calculated as:
R
i
(b) = F
b
_
_

cchildren(b)
_
R
i
(c)
R
i1
(c)
_
_
By assumption and reexivity, we have:
c children(b).R
i1
(c) _ R
i
(c) _ R
i
(c)
Therefore, since F
b
and both join and meet
1
are monotone, we have that R
i
(b) _
R
i+1
(b) if calc(b) = .
Inner Induction Step: Now we assume that x calc(b).R
i
(x) _ R
i+1
(x).
In this case, R
i+1
(b) is calculated as (using calc(b) as a partition):
R
i+1
(b) = F
b
_
_
_
_

cchildren(b)calc(b)
R
i+1
(c)
_
_

_
_

cchildren(b)\calc(b)
R
i
(c)
_
_
_
_
Again since F
b
and both meet and join are monotone, and also by our assump-
tions, we have that R
i
(b) _ R
i+1
(b) if x calc(x).R
i
(x) _ R
i+1
(x).
Therefore, using both the inner and outer inductions in turn, we have that
i, b.R
i
(b) _ R
i+1
(b). This proves that the iterative calculation converges.
A.2 Live Variable Analysis
Recall that live variable analysis is performed over the lattice ((Vars), ) with
transfer function:
F
n
(x) = (x Write(n)) Read(n)
1
It is a standard result of lattices that join and meet are monotone.
A.3. CONSTANT PROPAGATION 69
Theorem 3. Iterative computation of liveness information converges.
Proof. Convergence can be shown with the help of Theorem 2 by showing that
F
n
is monotone and that the lattice has nite height. This is trivially the case
since F
n
is the composition of two monotone operations, set minus and set union.
Also, since (Vars) is nite and has top element Vars, the lattice must have nite
height.
A.3 Constant Propagation
Recall that constant propagation is performed over the lattice (,
Constants, _) and transfer function F
n,v
where:
x _ y (x = ) (y = )
F
n,v
(x) =
_

_
c if n assigns c to v
if n writes a non-constant to v
x otherwise
Theorem 4. Iterative computation of constant propagation converges.
Proof. First we show that F
n,v
is monotone. The denition of F
n,v
can be con-
sidered in two cases. When n writes to v, F
n,v
is simply a constant function, so
is trivially monotone. Equally, when n does not write to v, F
n,v
is the identity
function, so is also monotone.
We can also show that the lattice is nite. By denition of _, the only
increasing chains are _ and for each c Constant, _ c _ .
Therefore, as for other dataow analyses, convergence is guaranteed by these
two properties according to Theorem 2.
APPENDIX B
Code Generation Details
This appendix describes the naming conventions used within the code generation
stage of the compiler.
Within a kernel or method, the temporary variables used are simply named
consecutively (i.e. t1, t2, . . . ). For local variables, it is necessary to append
a type sux since the same local variable might be used for dierent types in
dierent live ranges (as in Example 3.4). This gives names of vi TypeSort for
variable i where the type sort is any of the primitive types, or a unique number
for reference types.
Kernel launcher methods (i.e. those called as a replacement for the loops)
are named using the hashcode of the internal object representing the kernel (i.e.
kernel <hashcode> or kernel M<-hashcode> if the value is negative). This
gives a unique name amongst the kernels exported, and is unlikely to conict
with any methods within the original class.
JNI species a mangling scheme for converting Java method names to C++
[18, Table 2-1]. This must be adopted for the launcher, but is also used by the
compiler for method and static variable names with altered prexes in place of
Java_ (Static_ for statics and none for methods). This is necessary to ensure
that there are not conicts in naming (e.g. a nave approach might result in
ClassX Test.f() and ClassX.Test f() both mapping to ClassX Test f).
71
APPENDIX C
Command Line Interface
The command line interface to the compiler has a number of optional arguments
that aect its behaviour. These are shown in the table below:
Option Description
cuda Directory into which the CUDA toolkit was installed,
should contain bin/nvcc.
jdk Directory into which the JDK was installed, should con-
tain an include directory with the JNI header les.
includes Directory in which the compilers include les are stored
(parallel.h et al.).
library Name of the shared library that should be generated by
the compiler (defaults to libparallel).
classpath Paths (separated by :) in which input classes can be
found.
output Output directory for the shared library and modied
class les.
log Log level for feedback, accepting each of the Log4J pos-
sibilities.
detect Dependency checking method. This can be either manual
(default), auto or combined.
generate When specied, the shared library is not compiled and
the C++ code is saved.
nonportable Allows bytecode from the core Java class library to be
compiled onto the GPU. This may allow more code to be
compiled, but is not portable between library versions.
73
74 APPENDIX C. COMMAND LINE INTERFACE
Below an example of the compiler output is given, for automatic detection,
with the logging level set to INFO:
bing:dist$ java -jar Parallel.jar --log info --detect auto samples.Mandelbrot
INFO [core]: Considering samples/Mandelbrot.<init>(I)V
INFO [core]: Considering samples/Mandelbrot.compute()V
INFO [loops.detect]: Natural loop on line 62.
INFO [loops.detect]: Natural loop on line 63.
INFO [loops.detect]: Natural loop on line 73.
INFO [loops.trivialise]: Loop has multiple exit points (line 73).
INFO [loops.trivialise]: Trivial loop found (line 62): y#1 (I) <
READ ->samples/Mandelbrot.height [I] {y#1 (I)=1}
INFO [loops.trivialise]: Trivial loop found (line 63): x#2 (I) <
READ ->samples/Mandelbrot.width [I] {x#2 (I)=1}
INFO [check.Basic]: Accepted loop (line 62) based on basic test.
INFO [check.Basic]: Accepted loop (line 63) based on basic test.
INFO [extract]: Kernel of 2 dimensions extracted (line 63).
INFO [extract]: Copy In: [Var#0 (Lsamples/Mandelbrot;), Var#2 (I), Var#1 (I)]
INFO [extract]: Copy Out: [Var#0 (Lsamples/Mandelbrot;)]
INFO [core]: Considering samples/Mandelbrot.main([Ljava/lang/String;)V
INFO [core]: Considering samples/Mandelbrot.output(Ljava/io/File;)V
INFO [loops.detect]: Natural loop on line 90.
INFO [loops.detect]: Natural loop on line 89.
INFO [loops.trivialise]: Trivial loop found (line 89): y#4 (I) <
READ ->samples/Mandelbrot.height [I] {y#4 (I)=1}
INFO [loops.trivialise]: Trivial loop found (line 90): x#5 (I) <
READ ->samples/Mandelbrot.width [I] {x#5 (I)=1}
INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 89).
INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 90).
INFO [core]: Considering samples/Mandelbrot.<init>(II)V
INFO [core]: Considering samples/Mandelbrot.run(I)J
APPENDIX D
Sample Code Used
This appendix gives further details on the sample code used in the evaluation.
D.1 Java Grande Benchmark Suite [7]
The suite is split into 3 distinct sections. The rst concentrates on testing the
performance of low level operations such as arithmetic, and is not relevant to
this project. The second provides 7 kernel benchmarks, while the third concen-
trates on larger scale applications. A summary of the Section 2 benchmarks
available
1
is given in Table D.1.
Benchmarks that could not be parallelised through use of parallel for loops
were not considered, since the goal was to use unmodied code.
1
Version of 2.0 of the sequential suite was used.
Benchmark Description Used
Series Fourier coecient analysis.
LUFact LU factorisation.
SOR Successive over-relaxation.
HeapSort Integer sorting.
Crypt IDEA encryption.
FFT Fast Fourier transform.
Sparse Sparse matrix multiplication.
Table D.1: Summary of Section 2 of the Java Grande Benchmark Suite.
75
76 APPENDIX D. SAMPLE CODE USED
Figure D.1: 3 generations of the Game of Life.
D.2 Mandelbrot Computation
A brief description of the Mandelbrot set is given in Section 4.2.4. The routine
used is from The Computer Language Benchmarks Game
2
. Whilst the bench-
marks are now considered a bad way of comparing performance of languages,
they are still valid when comparing performance of dierent compilers (or run-
times) for a single language.
The only modication made to the source code was to re-express the
do { ... } while(...); loop as a standard while(...) { ... } loop. This
allows trivialisation of the loop.
D.3 Conways Game of Life
Conways Game of Life is a cellular automaton. The evolution of each cell in a
2D grid is described by three simple rules (quoted from [11]), considered with
respect to the cells eight neighbours (an example evolution is given in Figure
D.1):
1. Survivals: Every counter with two or three neighboring counters survives
for the next generation.
2. Deaths: Each counter with four or more neighbors dies (is removed)
from overpopulation. Every counter with one neighbor or none dies from
isolation.
3. Births: Each empty cell adjacent to exactly three neighbors no more,
no fewer is a birth cell. A counter is placed on it at the next move.
The source code used for both the nave algorithm and Hashlife is that devel-
oped by Dr Andrew Rice for use in a Java programming course
3
.
2
http://shootout.alioth.debian.org/
3
http://www.cl.cam.ac.uk/teaching/0809/ProgJava/
APPENDIX E
Testing Gold Standards
The gold standard for loop trivialisation is given in the table below. Similar style
checks were made for both loop detection and kernel extraction.
Sample Details of Trivial Loops
Component Benchmarks
Base (Trigonometry) 27 (i < nums.length, i=+1), 34 (i < nums.length, i=+1), 43
(j < nums.length, j=+1)
Static (Statics) 28 (i < nums.length, i=+1), 35 (i < nums.length, i=+1), 44
(j < nums.length, j=+1)
Objects (Objects) 33 (i < nums.length, i=+1), 40 (i < nums.length, i=+1), 49
(j < nums.length, j=+1)
2D (MultiDimension) 27 (k < nums.length, k=+1), 28 (l < nums[0].length,
l=+1), 36 (k < nums.length, k=+1), 37 (l <
nums[0].length, l=+1), 47 (i < nums.length, i=+1),
48 (j < nums[0].length, j=+1)
Java Grande Benchmarks
JGFCryptBench 48 (i < array rows, i=+1)
IDEATest 115 (i < 8, i=+1), 130 (j < array rows, j=+1), 154 (k < 52,
k=+1), 157 (k < 8, i=+1), 174 (i < 52, i=+1), 222 (i < 7,
i=+1), 273 (i < text1.length, i=+8, i1=+8, i2=+8), 291 (r
!= 0, j=-1)
JGFSeriesBench 56 (i < 4, i=+1), 57 (j < 2, j=+1)
SeriesTest 103 (i < array rows, i=+1), 169 (nsteps > 0, nsteps=-1)
Mandelbrot 62 (y < height, y=+1), 63 (x < width, x=+1), 89 (y <
height, y=+1), 90 (x < width, x=+1)
ReverseArray 16 (i < 3, i=+1), 20 (j < 3, j=+1), 24 (i < 3, i=+1)
The majority of benchmarks tested a range of the code generation features.
Since many benchmarks were represented by an object at the top level, this
77
78 APPENDIX E. TESTING GOLD STANDARDS
immediately tested object support. However, several benchmarks were used for
ensuring test coverage of other features:
Statics Tested support for static class elds.
MultiDimension Tested support for arrays, and arrays of arrays.
ReverseArray Tested support for manipulation of references on the GPU.
Objects Tested support for full use of objects, involving modication of multiple
classes and invoking instance methods.
Testing of the automatic dependency analysis could be done against the
@Parallel annotations that were already in place to mark parallel loops.
APPENDIX F
Class Index
SLOC
1
Class Name Relevant Sections
105 analysis.dataflow.Dataflow
57 analysis.dataflow.ReachingConstants 2.7.4
153 analysis.dataflow.LiveVariable 2.7.3, 3.2.4, 3.3.1
330 analysis.dataflow.AliasUsed 3.3.3, 3.3.4
71 analysis.dataflow.IncrementVariables 3.3.2
131 analysis.dataflow.SimpleUsed 3.3.4
7 analysis.dependency.DependencyCheck 3.6
174 analysis.dependency.BasicCheck 3.6.2
32 analysis.dependency.AnnotationCheck 3.6.1
23 analysis.dependency.CombinedCheck
104 analysis.loops.LoopDetector 2.7.2
141 analysis.loops.LoopTrivialiser 3.4.1
40 analysis.loops.LoopNester 2.7.2
16 analysis.AliasMap
35 analysis.BlockCollector
72 analysis.CanonicalState State in 3.3.3
17 analysis.CodeTraverser 3.2.2
25 analysis.InstructionCollector
154 analysis.KernelExtractor 3.5
71 analysis.LooseState LooseState in 3.3.3
80 bytecode.AnnotationImporter
119 bytecode.BlockExporter 3.2.3
462 bytecode.InstructionExporter 3.2.3
99 bytecode.ClassImporter
76 bytecode.ClassExporter
626 bytecode.MethodImporter 3.2.3
81 bytecode.ClassFinder
320 cuda.Helper Appendix B
258 cuda.CppGenerator 3.7.1
79
80 APPENDIX F. CLASS INDEX
SLOC
1
Class Name Relevant Sections
108 cuda.BlockExporter 3.7.1
182 cuda.CUDAExporter 3.7
40 cuda.Beautifier 3.7
60 debug.ControlFlowOutput e.g. Example 3.8
20 debug.LinePropagator
32 exceptions.UnsupportedInstruction
1108 graph.instructions.* Table 3.1
10 graph.state.State
57 graph.state.ArrayElement
69 graph.state.Variable
75 graph.state.Field
49 graph.state.InstanceField
36 graph.Annotation 3.2
85 graph.BasicBlock 3.2.1
64 graph.Block 3.2.1
22 graph.BlockVisitor 3.2.2
152 graph.ClassNode 3.2
39 graph.CodeVisitor 3.2.2
71 graph.Kernel
29 graph.Loop 3.2.1
111 graph.Method 3.2
123 graph.Modifier 3.2
32 graph.TrivialLoop 3.2.1, 3.4.1
254 graph.Type 3.2.4
36 tools.Benchmark
202 tools.Parallelise 3.8
9 tools.Restrict
10 tools.Parallel
51 util.Utils
28 util.EquatableWeakReference
23 util.ConsList
16 util.MapIterable
25 util.Tree
54 util.WeakList 3.2.1
23 util.TransformIterable
10 parallel.h
50 parallel/launch.h 3.7.2
24 parallel/types.h
212 parallel/memory.h 3.7.3
206 parallel/transfer.h 3.7.3
7686
1
As calculated by SLOCCounthttp://www.dwheeler.com/sloccount/.
APPENDIX G
Source Code Extract
1 /
2 Pa r a l l e l i s i ng JVM Compil er
3 Part I I Proj ect , Computer Sci ence Tri pos
4
5 Copyri ght ( c ) 2009 , 2010 Pet er Cal vert , Uni ver s i t y of Cambridge
6 /
7
8 package a na l ys i s . dependency ;
9
10 import graph . Annotati on ;
11 import graph . Method ;
12 import graph . Tr i vi al Loop ;
13 import graph . Type ;
14
15 import j ava . u t i l . Co l l e c t i o ns ;
16 import j ava . u t i l . Li s t ;
17
18 import org . apache . l o g 4 j . Logger ;
19
20 /
21 Checks dependenci es based on annot at i ons on t he cont ai ni ng method .
22 /
23 public cl ass Annotati onCheck implements DependencyCheck {
24 /
25 Names of l oop i nd i c i e s t hat shoul d be run i n p a r a l l e l i n t he current
26 cont ext .
27 /
28 private Li s t <St ri ng> l oopI ndi c e s ;
29
30 /
31 Set s t he cont ext i n which l oops shoul d be consi dered .
32
33 @param method Method i n which l oops t hat f ol l ow are cont ai ned .
34 /
35 @Override
36 public void setContext ( Method method) {
37 Annotati on annot ati on = method . getAnnotati on (
38 Type . getObj ectType ( t o o l s / Pa r a l l e l )
81
82 APPENDIX G. SOURCE CODE EXTRACT
39 ) ;
40
41 i f ( annot ati on == nul l ) {
42 l oopI ndi c e s = Co l l e c t i o ns . emptyLi st ( ) ;
43 } el se {
44 l oopI ndi c e s = ( Li s t <St ri ng >) annot at i on . get ( l oops ) ;
45 }
46 }
47
48 /
49 Checks whet her i t i s s af e t o execut e t he gi ven <code>Tri vi al Loop </code> i n
50 p a r a l l e l based on t he name of t he l oop i ndex .
51
52 @param l oop Tr i v i al l oop t o check .
53 @return <code>t rue </code> i f s af e t o run i n p a r a l l e l ,
54 <code>f al s e </code> ot her wi s e .
55 /
56 @Override
57 public boolean check ( Tr i vi al Loop l oop ) {
58 i f ( l oopI ndi c e s . cont ai ns ( l oop . getI ndex ( ) . getName ( ) ) ) {
59 Logger . getLogger ( annotati on ) . i nf o ( Accepted + l oop + f o r
p a r a l l e l i s a t i o n . ) ;
60 return true ;
61 } el se {
62 Logger . getLogger ( annotati on ) . i nf o ( Rej ected + l oop + f o r
p a r a l l e l i s a t i o n . ) ;
63 return f al se ;
64 }
65 }
66 }
APPENDIX H
Project Proposal
Peter Calvert
Trinity College
prc33
Computer Science Tripos Part II Individual Project Proposal
Parallelisation of Java for Graphics Processors
October 22, 2009
Project Originator: Peter Calvert
Resources Required: See attached Project Resource Form
Project Supervisors: Dr Andrew Rice and Dominic Orchard
Signatures:
Directors of Studies: Dr Arthur Norman and Dr Sean Holden
Signatures:
Overseers: Dr David Greaves and Dr Marcelo Fiore
Signatures:
83
84 APPENDIX H. PROJECT PROPOSAL
Introduction and Description of the Work
In the past, improvements in computational performance have taken the form
of higher clock speeds. However, more recently, increased performance has come
from the use of multiple processors, to solve independent parts of a problem in
parallel.
Graphics processors (GPUs) are a good example of this, and are commonly ar-
chitected as stream processors, meaning that they can apply the same set of
instructions across a grid in parallel. As a result of this, there has been signi-
cant recent interest in using them for more general computation. In particular,
they are suited to running loops in parallel.
However, it is a well known problem that developers nd it hard to reason about
the interactions of code running in parallel. Furthermore, most existing code is
sequential, and thus there are no performance gains from executing it on parallel
architectures. It must be recompiled, or in some cases rewritten, to benet.
Automatic parallelisation aims to address this by analysing existing sequential
code, and identifying areas that can be run in parallel.
This project aims to make it possible to utilise parallel processors by compiling
appropriate loops for GPU execution. Initially, developer input will be required
to determine whether the conversion maintains correctness. However, as the
project develops, it is hoped that some of these decisions can be automated. The
project will be evaluated both by the performance gain resulting from parallel
computation, and also by the scope of the analyses made.
The compilation will be made from Java Virtual Machine (JVM) bytecode, since
it is possible to compile a number of languages
1
for it (including Ruby, Python
and Scala). It is also relatively simple, and libraries exist to aid in its analysis
2
.
The Low Level Virtual Machine (LLVM) would have been a viable alternative
for similar reasons, but was dismissed due to lack of familiarity.
The target of the compilation will be NVIDIAs devices, due to the complete
framework (CUDA) that they have made available to allow GPU kernels to be
written along side CPU code, which will make development easier. A more stan-
dardised approach, OpenCL, is still at the draft stage.
While in general determining whether a loops iterations are independent is unde-
cidable, there are solutions given certain constraints which could be introduced.
There are also transformations that could be applied beforehand to remove some
dependencies. A major diculty often experienced is related to checking whether
variables are aliasing, so this will be left in as a check for the user to make. These
1
http://en.wikipedia.org/wiki/List_of_JVM_languages
2
ASM(http://asm.ow2.org/)
85
automatic extensions could be evaluated in terms of the accuracy of their analysis,
and also the proportion of loops in sample code that they can consider.
Resources Required
Access will be needed to a suitable graphics processor that supports the NVIDIA
CUDA architecture. However, during development it will be possible to use the
emulation mode included in the NVIDIA development tools.
Starting Point
This project will be undertaken starting from the following knowledge and expe-
rience:
General knowledge of JVM bytecode from Part Ib course Compiler Con-
struction.
Successful compilation and execution of a couple of CUDA examples under
the emulation environment.
Rudimentary code put together during the rst week of Michaelmas term
that produces an unrened graph of JVM code using the ASM library, and
then detects loops in this.
Preliminary reading over the long vacation into compiler optimisation tech-
niques and dependency analysis.
Further knowledge will be gained during Michaelmas term of Part II from the
Optimizing Compilers course.
Substance and Structure of the Project
In order to allow any compilation or analysis to occur, the Java bytecode must
rst be read in and represented in a suitable structure for both control and data
ow analysis. This will be a graph of basic blocks, within each of which a data
ow graph will be contained. To allow the compiler, analysis and transformers
to traverse the structure, a variant of the visitor pattern should be implemented.
The project can then be divided into the following stages, starred items are being
considered as possible extensions rather than core parts:
86 APPENDIX H. PROJECT PROPOSAL
1. Detection of loops within the control ow graph (JVM bytecode represents
control ow in an unstructured manner) and insertion of the appropriate
loop nodes. This can be done using analysis of each basic blocks domi-
nators.
2. Wrappers that can transfer the various JVM primitive types and arrays
to the GPU. This would be done using Javas native code interface (JNI).
At this stage it is also necessary to be able to invoke the kernels over the
required dimensions, converting these into a suitable grid of blocks for the
size of GPU available.
3. Compilation of loop bodies for execution on a NVIDIA CUDA compatible
GPU. Since NVIDIA already provide a C compiler for this, the simplest
approach here is to generate C code from JVM bytecode.
4. Detection of which variables need to be passed into the CUDA kernel.
5. Transformation of the Java class to use the relevant wrappers in place of
the original loop code.
* Automatic detection of the loop variable and its bounds rather than
prompts to the user. This will be characterised by the variable that is
used in the exit condition, and which is also only written to by a single
INCREMENT instruction on each iteration (this instruction also accepts neg-
ative increments for the case of a decrementing loop).
* Basic dependency analysis of variable and eld usage, for array accesses the
relatively simple GCD test should be used (allowing analysis where array
usages are of the form ax + b).
* Support for compiling object oriented JVM code to CUDA C.
* Loop-invariant code motion: this is a common optimisation that is used
by all compilers, however, since the code is being split and passed to two
separate compilers, there is no scope for code to be moved from inside the
loop, to the outside.
* Runtime checks for aliases and regular shaped arrays.
* A constrained version of loop ssion (or loop distribution) in which we
require that the loop body does not contain conditional blocks (i.e. just
sequential instructions and nested loops). This splits existing loops into
multiple loops, so that at least some of these can be run in parallel, even if
the combined loop could not.
87
Using existing code from a benchmark suite
3
as well as other code that can be
sourced, an evaluation will then be drawn up on the performance gains that can
be achieved. Additionally, these gains will be compared with those made by
hand-written parallel versions of some of the benchmarks. The success of the
automatic checks at detecting safe loops will also be evaluated. Where safe loops
were not detected as such, it will be noted (when obvious) what further analysis
or transformations may have helped. This could then be used to guide any future
work.
Success Criteria
The core parts of the project will have been a success if:
1. Existing Java code (that has had GPU areas manually marked) can be run
using CUDA hardware, producing the same results.
2. The performance of CUDA-enabled benchmarks can be compared to their
original running time, and also to the running time when the conversion to
CUDA code is done by hand.
3. In some cases, an overall speed-up can be found. However, this will not
always be possible due to the transfer overhead associated with using the
GPU. Given sucient large problem sizes, this overhead should become
negligible.
The automatic detection extension to the project will have been a success if
common dependency analysis techniques can be evaluated based on their ability
to detect loops that are safe for parallelisation.
Timetable and Milestones
The timeline below is structured into 2 week slots. In allocating work to slots,
there were several aims in mind:
To have a general structure in place that allows independent testing of
separate components as early as possible.
3
Java Grande (http://www2.epcc.ed.ac.uk/computing/research_activities/java_
grande/sequential.html)
88 APPENDIX H. PROJECT PROPOSAL
To attempt the most dicult and risky parts of the project early on, so
that there is plenty of recovery time if problems do arise.
To implement all required features and evaluate these before extensions are
incorporated.
To write a draft dissertation as the work is done, rather than leaving it as
a big job for the end.
Slot 0: 1st October to 16th October
Discuss with Researchers, Overseers and Director of Studies the feasibility
of the project idea, along with background reading to assess the existing
work in the area, and the quantity of work entailed.
Arrange with Project Supervisors a schedule of meetings to ensure the
project stays on track.
Organise access to equipment for the project (i.e. a capable computer with
CUDA GPU), as well as setting up a regular backup system.
Milestones: Project proposal and availability of CUDA GPU.
Slot 1: 17th October to 30th October
Experiment with CUDA and gain familiarity with what it can do.
Rework preliminary ow graph producing code, taking more care over the
data structure.
Based on the algorithms being used, implement traversal facilities for the
ow graph that give easy access to the relevant information and structure.
Rework the preliminary loop detection code using the structure from above.
Milestone: Be able to read in JVM class les and represent both the control
ow and data ow inherent in them, recovering loop structures.
Slot 2: 31st October to 13th November
Produce code that can transfer primitive Java types and also arrays onto a
GPU.
89
Produce code that can invoke a compiled CUDA kernel from Java.
Milestone: Implementation of all required CUDA wrappers in JNI.
Slot 3: 14th November to 27th November
Produce code that can detect which variables need to be transferred to and
from the GPU for a given block of code.
Produce code that generates valid CUDA C for a given section of JVM
bytecode.
Milestones: Be able to detect which variables need to be transferred to and
from the GPU, and be able to generate CUDA C from bytecode.
Slot 4: 28th November to 11th December
Use this time to consolidate and tidy up any loose ends in the code, and test it
on a wider range of JVM bytecode.
Due to end of term events and also a ski holiday (4th to 13th December), less
work has been scheduled for this slot.
Slot 5: 12th December to 25th December
Tie components together to be able to produce rewritten class les that
invoke GPU kernels rather than the original loops.
Start drafting a dissertation for the core parts of the project, using notes
made whilst this was implemented in slots 1 to 3.
Milestones: Core implementation complete, and dissertation with most struc-
ture drafted along with content for the core preparation/implementation.
Slot 6: 26th December to 8th January
Catch up time to x non-critical bugs that have been put o during previous
slots.
Source as many benchmarks and suitable applications written in JVM lan-
guages as possible (ideally containing a couple of hundred loops in total
across all the code).
90 APPENDIX H. PROJECT PROPOSAL
Work out safe loops in the benchmark code collected.
Evaluate the performance improvements from the CUDA compilation for
the benchmark code.
Milestone: Extensive set of benchmarks for CPU and CUDA versions.
Slot 7: 9th January to 22nd January
Manually produce CUDA versions of some of the benchmarks, and add the
performance of these to the evaluation.
Start writing the evaluation section of dissertation based on the results.
Decide on whether to implement extensions, and if so how much of the
automated detection to attempt.
Slot 8: 23rd January to 5th February
Prepare the required progress report and the accompanying presentation.
Work on extensions / catch up.
Milestones: Progress report and presentation.
Slot 9: 6th February to 19th February
Further extensions and catch up time.
Milestone: Complete code base.
Slot 10: 20th February to 5th March
Update the dissertation with details of any extension work, and prepare it to
draft standard (based on the work already achieved).
Milestone: Complete draft dissertation.
Slot 11: 6th March to 19th March
End of Lent term / Easter holiday, emphasis on revision.
Slots 12 and 13: 20th March to 16th April
Easter holiday, emphasis on revision.
91
Slot 14: 17th April to 30th April
This coincides with the beginning of Easter term. This time will be spent nal-
ising the dissertation, and proof reading.
Milestone: Printed dissertation ready to hand in.
Slot 15: 1st May to 14th May
This slot ends with the nal deadline for the dissertation. It is intended that
this slot wont be used, and therefore it provides some buer time for any serious
issues.

Вам также может понравиться