Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT: Recently advances in Graphics Processing Units (GPU) architecture have enabled
the use of GPUs to offer tremendous potential for applications in computational wind engineer-
ing. GPU Computing is more cost effective than traditional distributed parallel computing (using
MPI). Distributed parallel computing using Preconditioned Conjugate Gradient (PCG) solver in
MPI the speed up is about 4-12 with an efficiency of 20 to 70% (Sarkar, 2008) and is relatively
expensive for a computing cluster. A GPU computational desktop can be assembled for about
$1300, which can achieve speed up of 10 to 20 relative to the CPU for solving wind engineering
problems. Several solvers suitable for implementation on the GPU computing are surveyed and
are compared with the serial CPU performance. Furthermore methods to combine FORTRAN
and C programs are illustrated for swift implementation of GPU computing for portions of the
FORTRAN code which need acceleration.
1 INTRODUCTION
Time dependent phenomena in wind engineering requires high performance computing. The
current parallel computing techniques for CPU platforms achieve efficiency of 10 to 20% (using
MPI) while the cost of the system vary from $50K for 40 processors to close to a million dollars
for 500 or more processors. Recent graphics cards cost around $300 and containing 240 proces-
sors are used for parallel computing. The GPU computing system itself can be acquired for
$1000 to $10,000, which can achieve speed up in the range of 10 to 100. This is an economical
and viable computing option available in a personal computer. Instead of achieving parallelism in
the traditional distributed form, inexpensive shared memory computing can be realized using
GPU computing.
GPU computing can be accomplished using software development toolkits such as CUDA
(NIVIDA, 2008) or OpenCL (Khronos 2008). CUDA an extension to the C language is currently
a popular choice for programming GPUs, however it is a closed proprietary parallel program-
ming environment supporting only NVIDIA GPUs. OpenCL (Open Compute Language) is an
open standard for parallel computing on CPUs, GPUs, and hardware accelerators using the same
source code. OpenCL seeks to address these problems by being an open standard supported by
multiple suppliers, and secondly by supporting CPUs and hardware accelerators (CELL, DSPs)
as compute devices along with GPU devices. With OpenCL, developers can use a single stan-
dard toolset and language to target all the currently available parallel processors, and can easily
be adapted to future parallel computing processors.
CUDA was selected for the applications discussed herein as OpenCL is still new with ap-
proved implementations only available in late 2009. A major challenge in implementing CUDA
code is to select proper numerical algorithms which are suitable for GPU computing. In serial
computing, preconditioned conjugate gradient (PCG) (Selvam, 1996) or multi-grid (MG) solvers
are widely used. In GPU computing, the serial incomplete Cholesky decomposition precondi-
tioner cannot be used. A systematic review of different numerical techniques that can be used to
solve momentum and pressure equations will be discussed with performance comparisons with
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010
serial computing implementations. The other challenge at this time using GPU computing is the
C language requirement. Since most of the codes are in FORTRAN, to convert them to C takes
considerable time and effort. So for a quick increase in the performance of existing codes a
method to combine FORTRAN and GPU computing using C is illustrated with a simple exam-
ple. The performance of the GPU computing solutions are compared with a complete serial
FORTRAN code computing for a bridge flow problem.
In this work, the existing computer model (Selvam, 2002) to study flow around bridges
will be considered as a benchmark study to use GPU computing. The existing 2D model was
modified for 3D flow and implemented in parallel using the CUDA toolkit. The computer model
is based on the finite difference method (FDM). In GPU computing the CUDA tools are used to
compile the program. In the solution of Navier-Stokes (NS) equations, the pressure solution takes
more than 80% of the computer time. An efficient solver to solve the pressure equation in the
GPU computing is very important. The performance of different solvers will be compared and
the efficient solver for GPU computing is proposed.
There are two major issues to be considered in GPU computing when using global memory.
The first issue is the method in which the numbers are stored for each coefficient. Minimizing
memory conflicts improves the code performance. In the case of FDM if the coefficients ap, ae,
aw etc. are stored consecutively in a single array or as separate vectors improves the perform-
ance. Storing coefficients in CSR form (Saad, 1996) in which the ap, ae etc. are continuous for
each point is not efficient on the GPU. In this work the coefficients ap, ae, aw etc. are stored con-
secutively in a single vector. The second issue involves understanding the architecture of the
NVIDIA GPUs, performance is improved by assigning the number of threads in a block, in mul-
tiples of 16 (called a half-warp) in each direction. For solution of Poisson’s equation the un-
knowns considered are stored in a 128x128x128 mesh and hence they are divisible by 16. Due to
space considerations the performance difference in these issues are not discussed here. For de-
tails refer to internal report at this time (Selvam, 2010).
The CUDA GPU directly supports only two dimensional blocks, therefore the third dimension
needs to be considered explicitly. The implementation is illustrated very well in Micikevicius
(2009), and is the procedure used in this work for the z-dimension.
The Poisson’s equation of the form Uxx+Uyy+Uzz=6 where subscripts are differentiation is
solved in the unit square region. The actual solution is U=x2+y2+z2. The U value is specified all
around (Dirichlet boundary condition). The equations are iterated until the absolute sum of the
residue is less than 10-4 times the number of unknowns. The number of points in the x, y and z
directions are 128. The speedup is calculated as:
Speed up (SU) = cpu time(serial code)/gpu time (parallel code)
The computer time was calculated using the command $time ./a.out in the linux terminal. This
gives the total time taken by CPU and GPU together to perform the work including overhead of
allocating and copying data to/from the GPU global memory. The computation is done using
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010
double precision for comparison. At this time double precision calculation is less efficient on
current GPUs compared with single precision. The recent NIVIDA hardware is better than what
is reported. The method for compiling with support for CUDA double precision option is
explained in Appendix A.
To compare the performance, red black Gauss Seidel (RB-GS) or Successive Over Relaxation
(RB-SOR), multigrid using GS (MG-GS), and Jacobi Conjugate Gradient (JCG) solvers are con-
sidered for comparison. These solvers can be easily implemented both in CPU as well as GPU
computing for comparison. The incomplete Cholesky decomposition preconditioned CG (PCG)
can only be performed in serial computing code and this is considered to be the baseline CPU
time for comparison purposes. The line solvers are not suitable for GPU computing using normal
program methods. However different techniques can be used to implement GPU codes as dis-
cussed by Zhang et al. (2010). For MG-GS a blocksize of 8x32 for larger grids and 4x4 for smal-
ler grids are attempted. In the MG solver the number of spaces in each direction should be div-
isible by 2 implying the number of grid points would be 129. This does not match the GPU
prefered blocksize and hence the performance is affected. Techniques to adjust this issue in the
code are considered. Since the RB-GS solver considered only half the points in a plane during
one loop, the GPU performance is lower compared to Jacobi iterations. In Table 1 the speed up,
and computer time for each solver is reported. The relaxation factor used in GS solvers is report-
ed as rf. The extensions .c and .f for the programs are for C and Fortran programs respectively.
The extension .cu is used for GPU CUDA programs.
From Table 1 the speed up from CPU and GPU varies from 8 to 24. The maximum speed up
is achieved using the JCG solver. The MG solvers are generally better than PCG solvers if proper
iterative solvers are used in the CPU computing. For GPU computing they cannot be used and
hence only MG-GS are attempted. Additionaly the MG solution is done from one grid to another
and this is not a preferred configuration in GPU computing. Hence at this time either RB-SOR or
JCG are very competitive. The draw-back with using the SOR solver is determining the optimal
relaxation factor to solve the problem. Hence the engineer who runs the code may have to supply
the rf factor. In JCG there is no such factor required and hence it is not dependent on the user.
The results show the JCG seems to be the preferable solver which could achieve a speed up of
24 compared with the CPU and overall it achieved a speed up of 7 relative to the CPU baseline
PCG solver. An overall speed up of 7 releative to CPU computing with a minimal cost for a
graphics card is an cost effective computing solution. In the MPI distributed platforms with an
efficiency of 30% for each processor, more than 20 processors are needed to achieve comparable
performance. So for computational problems where serial work is not required then GPU com-
puting is a preferable technique. In these GPU codes only global memory access was utilized.
Further improvements can be achieved if shared memory optimizations are used, providing at
least 10 to 20% additional speed up (Nickolls, 2008).
_____________________________________________________________________________
Prog. Name Comp. Time Speed up Overall Comments & other details
Speedup
_____________________________________________________________________________
Block size-32x8
Mg-gs.cu 5.8s 14 2.7 MG-GS-R-B-mgl=3,rf=1.0,niter=131x4
Mg-gs.cu 5.9s 8 2.6 MG-GS-R-B-mgl=4,rf=1.0,,niter=135x4
Block size-16x16
Mg-gs.cu 6.6s 12.3 2.4 MG-GS-R-B-mgl=3,rf=1.0,niter=131x4
Mg-gs.cu 6.7s 12 2.4 MG-GS-R-B-mgl=4,rf=1.0,,niter=135x4
_____________________________________________________________________________
4 CONCLUSIONS
GPU computing techniques for wind engineering application were introduced. In the solution
of NS equations, the pressure equation takes more than 80% of the computer time and solvers
that are suitable for pressure solution are compared with codes implemented using GPU comput-
ing. The JCG solver is currently the most viable solver with the next suitable solver being red
black SOR. For the benchmark problem, the JCG achieved a speed up of 24 relative to the CPU
code and 7 relative to the PCG baseline solver on the CPU.
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010
Since GPU computing code is written using the C language, it takes extensive time to convert
FORTRAN codes to C codes. An alternative approach of combining FORTRAN and C for GPU
computing is illustrated with examples. Additional performance may be achieved by using mul-
tiple GPU devices (up to 4) in a computer system; however this also requires partitioning the
solvers across multiple kernels. For large problems combining GPU (shared memory) with MPI
(distributed memory) for parallel computing is a viable alternative. At this time GPU computing
is inexpensive alternative to traditional distributed parallel computing.
5 ACKNOWLEDGEMENTS
We wish to acknowledge Professor Antti Hellsten of the Finnish Meteorological Institute for his
comments which helped improve the paper. We wish to acknowledge the support provided by
Mr. Kendell Connor, Computer Tech- nician, Department of Civil Engineering, University of
Arkansas in purchasing the graphics hardware and assembling the GPU computing platform in a
short time for a cost of $1300.
6 REFERENCES
Barrettet, R. et al. 1994, Templates for the solution of linear systems, SIAM, Philadelphia. Also available in the
web: http://www.netlib.org/templates/
Cohen, J.F, and Molemaker, M.J., 2009, A fast double precision code using CUDA, Proceedings of Parallel CFD
2009
Micikevicius, P., 2009, 3D finite difference computation on GPUs using CUDA, Proceedings of the 2nd Workshop
on General Processing on Graphics Processing Units (Washington, D.C., Mar. 8, ), GPGPU-2, Vol. 383, ACM,
New York, NY, 79-84.
Nickolls, J. et al. 2008, Scalable parallel programming, ACM QUEUE, March/April, pp. 40-53
NVIDIA 2008, NVIDIA Corporation. CUDA programming guide, Version 2.3.
Khronos 2008 OpenCL - http://www.khronos.org/opencl
Saad, Y., 1996, Iterative methods for sparse linear systems, PWS Publication Co. Boston
Selvam, R.P., 1996, Computation of flow around Texas Tech building using k-ε and Kato-Launder k-ε turbulence
model, Engineering Structures, 18, pp. 856-860.
Selvam, R.P. 2002, Computer modeling for bridge aerodynamics, in Wind Engineering, by K. Kumar (Ed), Phoenix
Publishing House, New Delhi, India, 2002, pp. 11-25
Selvam, R.P. 2010, Parallel computing using GPU for CFD applications, Department of Civil engineering, Univer-
sity of Arkansas, Fayetteville, AR 72701, email: rps@uark.edu
Thibault, J.C. and I. Senocak, I., 2009, CUDA implementation of Navier-Stokes solver on multi-GPU desktop plat-
forms for incompressible flows, AIAA -2009-758, 47th AIAA Aerospace science meeting, pp. 1-15, Jan. 5-8,
Orlando, Florida
Zhang, Y., J. Cohen and J.D. Owens, 2010, Fast tridiagonal solvers on the GPU, PPoPP’10, January 9-14, 2010,
Bangalore, India.
Sarkar, 2008, 3-D Multiphase Flow Modeling of spray cooling using Parallel Computing, Dissertation, University of
Arkansas
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010
7 APPENDIX-FORTRAN+GPU COMPUTING
Usually the pressure solution part of the Navier-Stokes equation takes 80% of the computing
time at each time step. Hence this portion is a suitable candidate to accelerate using GPU com-
puting. In the FORTRAN code the GPU code is called in the usual manner as a subroutine. The
points to be noted are: 1. In FORTRAN the variables are used as 2 dimensional or 3 dimensional
arrays and in C it is one dimension. Careful attention should be paid to transform the numbers
from one to another. 2. In FORTRAN the arrays range from 1 to N whereas in C, 0 to N-1 for N
dimensioned variable. In the example the arrays are varied from 1 to N in both C and FORTRAN
to reduce this conflict.
The C CUDA GPU program is developed in the normal procedure except the main function is
renamed as a function with the name kernel_add_ as shown in the listing. To allow changing be-
tween double precision or single precision floating point variables, the REAL type is defined
using #define. Thus allowing the variables to be changed by changing one line in the program. In
this example REAL is defined as double. This program is named cuadd.cu. The extension for the
CUDA source code is .cu.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
//#define REAL float
#define REAL double
// simple kernel function that adds two vectors
__global__ void vect_add(REAL *a, REAL *b, REAL *c, int n)
{
int idx = threadIdx.x+1;
if (idx<=n) c[idx] = a[idx] + b[idx];
}//end add
// function called from main fortran program
extern "C" void kernel_add_(REAL *a, REAL *b, REAL *c, int *nd)
{
int n=*nd,itot=n+2,size;
REAL *a_d, *b_d, *c_d; // declare GPU vectors
int NBX=1, N=n; // uses 1 block of N threads on GPU
// Allocate memory on GPU
size=itot *sizeof(REAL);
cudaMalloc( (void **)&a_d, size);
cudaMalloc( (void **)&b_d, size);
cudaMalloc( (void **)&c_d, size);
// copy vectors from CPU to GPU
cudaMemcpy( a_d, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( b_d, b, size, cudaMemcpyHostToDevice );
// call function on GPU
vect_add<<< NBX, N >>>( a_d, b_d, c_d,n);
// copy vectors back from GPU to CPU
cudaMemcpy( c, c_d, size, cudaMemcpyDeviceToHost );
// free GPU memory
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
return;
}
First the cuadd.cu is compiled to create the object code cuadd.o using the CUDA nvcc compiler
driver. The double precision option is invoked in CUDA using the compiler flag –arch=sm_13.
Once the cuadd.o is created, the FORTRAN code is linked with the CUDA object using the
following commands:
The above command creates the a.out file in the system and this can be run from the teminal
as $ ./a.out. In our system (which is 64 bit) the library is lib64 and one needs to make sure that
is in their system. To improve the serial performance, the gfortran –O3 compiler option can be
added.