Вы находитесь на странице: 1из 8

The Fifth International Symposium on Computational Wind Engineering (CWE2010)

Chapel Hill, North Carolina, USA May 23-27, 2010

GPU Computing For Wind Engineering

R. Panneer Selvama, Kurt Landrus b


aBELL 4190 University of Arkansas, Fayetteville, AR 72701, USA, rps@uark.edu
bUniversityof Arkansas, Fayetteville, AR 72701, USA, kal@uark.edu

ABSTRACT: Recently advances in Graphics Processing Units (GPU) architecture have enabled
the use of GPUs to offer tremendous potential for applications in computational wind engineer-
ing. GPU Computing is more cost effective than traditional distributed parallel computing (using
MPI). Distributed parallel computing using Preconditioned Conjugate Gradient (PCG) solver in
MPI the speed up is about 4-12 with an efficiency of 20 to 70% (Sarkar, 2008) and is relatively
expensive for a computing cluster. A GPU computational desktop can be assembled for about
$1300, which can achieve speed up of 10 to 20 relative to the CPU for solving wind engineering
problems. Several solvers suitable for implementation on the GPU computing are surveyed and
are compared with the serial CPU performance. Furthermore methods to combine FORTRAN
and C programs are illustrated for swift implementation of GPU computing for portions of the
FORTRAN code which need acceleration.

1 INTRODUCTION

Time dependent phenomena in wind engineering requires high performance computing. The
current parallel computing techniques for CPU platforms achieve efficiency of 10 to 20% (using
MPI) while the cost of the system vary from $50K for 40 processors to close to a million dollars
for 500 or more processors. Recent graphics cards cost around $300 and containing 240 proces-
sors are used for parallel computing. The GPU computing system itself can be acquired for
$1000 to $10,000, which can achieve speed up in the range of 10 to 100. This is an economical
and viable computing option available in a personal computer. Instead of achieving parallelism in
the traditional distributed form, inexpensive shared memory computing can be realized using
GPU computing.
GPU computing can be accomplished using software development toolkits such as CUDA
(NIVIDA, 2008) or OpenCL (Khronos 2008). CUDA an extension to the C language is currently
a popular choice for programming GPUs, however it is a closed proprietary parallel program-
ming environment supporting only NVIDIA GPUs. OpenCL (Open Compute Language) is an
open standard for parallel computing on CPUs, GPUs, and hardware accelerators using the same
source code. OpenCL seeks to address these problems by being an open standard supported by
multiple suppliers, and secondly by supporting CPUs and hardware accelerators (CELL, DSPs)
as compute devices along with GPU devices. With OpenCL, developers can use a single stan-
dard toolset and language to target all the currently available parallel processors, and can easily
be adapted to future parallel computing processors.
CUDA was selected for the applications discussed herein as OpenCL is still new with ap-
proved implementations only available in late 2009. A major challenge in implementing CUDA
code is to select proper numerical algorithms which are suitable for GPU computing. In serial
computing, preconditioned conjugate gradient (PCG) (Selvam, 1996) or multi-grid (MG) solvers
are widely used. In GPU computing, the serial incomplete Cholesky decomposition precondi-
tioner cannot be used. A systematic review of different numerical techniques that can be used to
solve momentum and pressure equations will be discussed with performance comparisons with
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

serial computing implementations. The other challenge at this time using GPU computing is the
C language requirement. Since most of the codes are in FORTRAN, to convert them to C takes
considerable time and effort. So for a quick increase in the performance of existing codes a
method to combine FORTRAN and GPU computing using C is illustrated with a simple exam-
ple. The performance of the GPU computing solutions are compared with a complete serial
FORTRAN code computing for a bridge flow problem.

In this work, the existing computer model (Selvam, 2002) to study flow around bridges
will be considered as a benchmark study to use GPU computing. The existing 2D model was
modified for 3D flow and implemented in parallel using the CUDA toolkit. The computer model
is based on the finite difference method (FDM). In GPU computing the CUDA tools are used to
compile the program. In the solution of Navier-Stokes (NS) equations, the pressure solution takes
more than 80% of the computer time. An efficient solver to solve the pressure equation in the
GPU computing is very important. The performance of different solvers will be compared and
the efficient solver for GPU computing is proposed.

2 SOLVERS FOR GPU COMPUTING


2.1 Status of GPU Computing
Only limited work is available in the literature using GPU computing. Thibault and Senocak
(2009) used point Jacobi solver to compute flow in an urban environment. They used single,
double, and quad GPU graphics cards using shared and distributed parallel computing. They
demonstrated speed up from 30 to 100 relative to a single CPU core. The Jacobi solver for pres-
sure equation is well-suited for GPU computing but the convergence rate is very slow (Barret et
al. 1994, Saad, 1996). Cohen and Molemaker (2009) used multi-grid solver with Gauss Siedel
red-black smoother for a general benchmark problem. They illustrated how to improve the multi-
grid solver at different levels using GPU computing. They showed maximum speed up in the
range of 5 to 8.5. Each of these projects used shared memory in their GPU code implementa-
tions. Shared memory is local high-speed memory located on the GPU chip itself, its use in-
creases the performance by 10 to 20% but requires additional programming work. In this study
only global memory, which is located on the GPU PC card, but is external to the actual chip, is
used for application.

2.2 GPU Computing


In programming for GPU computing the data is copied from host memory to GPU (device)
global memory. Any computation where the same type of computation needs to be repeated for
several points can benefit from GPU computing, as long as the data is independent. A high ratio
of computation to memory access is preferred. On the CPU, floating point computation is expen-
sive while memory access is cheap, therefore calculations are often saved in lookup tables in
memory. On the GPU the reverse is true, there are so many processor cores it is usually cheaper
to do the calculations each time rather than incur memory access penalties. Also each point com-
putation should be independent of next computation. If data has dependencies on previous calcu-
lations then it becomes serial computing as in the case of standard Gauss-Siedel (GS). With Red-
black GS, the computation is not interdependent and is highly suitable for GPU computing. With
increasing opportunity for parallel operations in GPU computing relative to the serial CPU por-
tion in a code will result in a greater increase in performance.
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

There are two major issues to be considered in GPU computing when using global memory.
The first issue is the method in which the numbers are stored for each coefficient. Minimizing
memory conflicts improves the code performance. In the case of FDM if the coefficients ap, ae,
aw etc. are stored consecutively in a single array or as separate vectors improves the perform-
ance. Storing coefficients in CSR form (Saad, 1996) in which the ap, ae etc. are continuous for
each point is not efficient on the GPU. In this work the coefficients ap, ae, aw etc. are stored con-
secutively in a single vector. The second issue involves understanding the architecture of the
NVIDIA GPUs, performance is improved by assigning the number of threads in a block, in mul-
tiples of 16 (called a half-warp) in each direction. For solution of Poisson’s equation the un-
knowns considered are stored in a 128x128x128 mesh and hence they are divisible by 16. Due to
space considerations the performance difference in these issues are not discussed here. For de-
tails refer to internal report at this time (Selvam, 2010).
The CUDA GPU directly supports only two dimensional blocks, therefore the third dimension
needs to be considered explicitly. The implementation is illustrated very well in Micikevicius
(2009), and is the procedure used in this work for the z-dimension.

2.3 GPU Platform Used


There are several different NIVIDA graphics hardware available in the market for various
price points. In this work a GTX275 graphics card with 1.8 GB of global memory was used at a
cost of approximately $300. The GTX275 GPU was installed in a custom-built computer system
which included a quad core AMD PhenomII processor at 3 GHz and host memory of 8 GB
SDRAM. The NIVIDA CUDA toolkit software is supported for Windows, Mac OS X, or Linux
operating systems. The operating system used was Ubuntu 9.04 64 bit Linux. The total cost of
the assembled system in Aug 2009 was $1300.
At this time the CUDA development can be accomplished only using C programming lan-
guage. Since most of our existing codes are in FORTRAN, it required time to convert it to C and
then to modify it for GPU computing techniques. To save time in converting the FORTRAN to
C, portions of the code can be written in C and can be linked with FORTRAN. The method for
working with FORTRAN to achieve that objective is illustrated in Appendix A using a simple
vector addition example. The drawback in this approach is that each time the matrices for A, X
and B for the solution of AX=B needs to be copied from CPU host memory to GPU global
memory and back for each time step. Every time memory allocation in GPU and copying of data
from host memory to GPU global memory, or vice versa occurs at each time step. This causes
considerable overhead and adversely affecting performance. Converting the entire program to C
avoids this overhead.

2.4 Comparison of Solvers for Poisson’s Equations

The Poisson’s equation of the form Uxx+Uyy+Uzz=6 where subscripts are differentiation is
solved in the unit square region. The actual solution is U=x2+y2+z2. The U value is specified all
around (Dirichlet boundary condition). The equations are iterated until the absolute sum of the
residue is less than 10-4 times the number of unknowns. The number of points in the x, y and z
directions are 128. The speedup is calculated as:
Speed up (SU) = cpu time(serial code)/gpu time (parallel code)
The computer time was calculated using the command $time ./a.out in the linux terminal. This
gives the total time taken by CPU and GPU together to perform the work including overhead of
allocating and copying data to/from the GPU global memory. The computation is done using
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

double precision for comparison. At this time double precision calculation is less efficient on
current GPUs compared with single precision. The recent NIVIDA hardware is better than what
is reported. The method for compiling with support for CUDA double precision option is
explained in Appendix A.
To compare the performance, red black Gauss Seidel (RB-GS) or Successive Over Relaxation
(RB-SOR), multigrid using GS (MG-GS), and Jacobi Conjugate Gradient (JCG) solvers are con-
sidered for comparison. These solvers can be easily implemented both in CPU as well as GPU
computing for comparison. The incomplete Cholesky decomposition preconditioned CG (PCG)
can only be performed in serial computing code and this is considered to be the baseline CPU
time for comparison purposes. The line solvers are not suitable for GPU computing using normal
program methods. However different techniques can be used to implement GPU codes as dis-
cussed by Zhang et al. (2010). For MG-GS a blocksize of 8x32 for larger grids and 4x4 for smal-
ler grids are attempted. In the MG solver the number of spaces in each direction should be div-
isible by 2 implying the number of grid points would be 129. This does not match the GPU
prefered blocksize and hence the performance is affected. Techniques to adjust this issue in the
code are considered. Since the RB-GS solver considered only half the points in a plane during
one loop, the GPU performance is lower compared to Jacobi iterations. In Table 1 the speed up,
and computer time for each solver is reported. The relaxation factor used in GS solvers is report-
ed as rf. The extensions .c and .f for the programs are for C and Fortran programs respectively.
The extension .cu is used for GPU CUDA programs.
From Table 1 the speed up from CPU and GPU varies from 8 to 24. The maximum speed up
is achieved using the JCG solver. The MG solvers are generally better than PCG solvers if proper
iterative solvers are used in the CPU computing. For GPU computing they cannot be used and
hence only MG-GS are attempted. Additionaly the MG solution is done from one grid to another
and this is not a preferred configuration in GPU computing. Hence at this time either RB-SOR or
JCG are very competitive. The draw-back with using the SOR solver is determining the optimal
relaxation factor to solve the problem. Hence the engineer who runs the code may have to supply
the rf factor. In JCG there is no such factor required and hence it is not dependent on the user.
The results show the JCG seems to be the preferable solver which could achieve a speed up of
24 compared with the CPU and overall it achieved a speed up of 7 relative to the CPU baseline
PCG solver. An overall speed up of 7 releative to CPU computing with a minimal cost for a
graphics card is an cost effective computing solution. In the MPI distributed platforms with an
efficiency of 30% for each processor, more than 20 processors are needed to achieve comparable
performance. So for computational problems where serial work is not required then GPU com-
puting is a preferable technique. In these GPU codes only global memory access was utilized.
Further improvements can be achieved if shared memory optimizations are used, providing at
least 10 to 20% additional speed up (Nickolls, 2008).

Table 1. Comparison of Different Solvers


_____________________________________________________________________________
Prog. Name Comp. Time Speed up Overall Comments & other details
Speedup
_____________________________________________________________________________

Pcg.f 15.6s ICCG-(Selvam, 1996)-niter=40-cpu

Jcg.f 53.5s JCG-niter-219-cpu


Jcg.cu 2.2s 24.3 7.1 JCG-niter=219-GPU

Gs.c 371s rf=1.0,niter=660x7,,RB-GS-cpu


The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

_____________________________________________________________________________
Prog. Name Comp. Time Speed up Overall Comments & other details
Speedup
_____________________________________________________________________________

SOR.c 36s rf=1.9,niter=63x7,,RB-GS-cpu

Gs.cu 36s 10.3 0.4 rf=1.0, niter=660x7,R-B-GS-gpu

SOR.cu 4.s 9 3.9 rf=1.9, niter=63x7,R-B-GS

Mg-gs.c 81s MG-GS-R-B-mgl=3,rf=1.0,niter=196x4


Mg-gs.c 46s MG-GS-R-B-mgl=4,rf=1.0,niter=196x4

Block size-32x8
Mg-gs.cu 5.8s 14 2.7 MG-GS-R-B-mgl=3,rf=1.0,niter=131x4
Mg-gs.cu 5.9s 8 2.6 MG-GS-R-B-mgl=4,rf=1.0,,niter=135x4

Block size-16x16
Mg-gs.cu 6.6s 12.3 2.4 MG-GS-R-B-mgl=3,rf=1.0,niter=131x4
Mg-gs.cu 6.7s 12 2.4 MG-GS-R-B-mgl=4,rf=1.0,,niter=135x4
_____________________________________________________________________________

3 WIND ENGINEERING APPLICATION


3.1 Bridge Aerodynamics Application
At this time the existing 2D-FDM code (Selvam, 2002) is modified to solve 3D flow over
bridge sections. The code solves the momentum equations using line solvers and the pressure
equations using the PCG solver. Due to time constraints for converting the entire code into C,
only the pressure equation was converted to C. The main FORTRAN code and the pressure
solver in C are combined and compiled as illustrated in Appendix A. The challenge in this pro-
gram is to use periodic boundary conditions in I and K directions. Therefore the solvers devel-
oped in section 2 cannot be used directly. At this time only SOR or GS solver could be imple-
mented. For comparisons the grid used is 213x90x11. The other solvers will be implemented in
future work. The pressure solver is iterated until the average error for each point is reduced to
10-4. For comparison purposes only one time step is considered at this time.
The serial code using the PCG solver took 5.8 s for 181 iterations. The JCG sequential took
7.5s for 717 iterations. The SOR solver with a relaxation factor of 1.99 took 12.6s on the CPU
and 4.7s on the GPU. A speed up of 2.7 is achieved when compared with SOR serial computing,
and 1.2 when compared with the PCG solver. Additional performance benefits can be achieved if
the JCG solver is implemented using the GPU. At this time the grid points were not adjusted to
suit the GPU warp size of 32. This may improve the performance by an additional 30%.

4 CONCLUSIONS

GPU computing techniques for wind engineering application were introduced. In the solution
of NS equations, the pressure equation takes more than 80% of the computer time and solvers
that are suitable for pressure solution are compared with codes implemented using GPU comput-
ing. The JCG solver is currently the most viable solver with the next suitable solver being red
black SOR. For the benchmark problem, the JCG achieved a speed up of 24 relative to the CPU
code and 7 relative to the PCG baseline solver on the CPU.
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

Since GPU computing code is written using the C language, it takes extensive time to convert
FORTRAN codes to C codes. An alternative approach of combining FORTRAN and C for GPU
computing is illustrated with examples. Additional performance may be achieved by using mul-
tiple GPU devices (up to 4) in a computer system; however this also requires partitioning the
solvers across multiple kernels. For large problems combining GPU (shared memory) with MPI
(distributed memory) for parallel computing is a viable alternative. At this time GPU computing
is inexpensive alternative to traditional distributed parallel computing.

5 ACKNOWLEDGEMENTS

We wish to acknowledge Professor Antti Hellsten of the Finnish Meteorological Institute for his
comments which helped improve the paper. We wish to acknowledge the support provided by
Mr. Kendell Connor, Computer Tech- nician, Department of Civil Engineering, University of
Arkansas in purchasing the graphics hardware and assembling the GPU computing platform in a
short time for a cost of $1300.

6 REFERENCES

Barrettet, R. et al. 1994, Templates for the solution of linear systems, SIAM, Philadelphia. Also available in the
web: http://www.netlib.org/templates/
Cohen, J.F, and Molemaker, M.J., 2009, A fast double precision code using CUDA, Proceedings of Parallel CFD
2009
Micikevicius, P., 2009, 3D finite difference computation on GPUs using CUDA, Proceedings of the 2nd Workshop
on General Processing on Graphics Processing Units (Washington, D.C., Mar. 8, ), GPGPU-2, Vol. 383, ACM,
New York, NY, 79-84.
Nickolls, J. et al. 2008, Scalable parallel programming, ACM QUEUE, March/April, pp. 40-53
NVIDIA 2008, NVIDIA Corporation. CUDA programming guide, Version 2.3.
Khronos 2008 OpenCL - http://www.khronos.org/opencl
Saad, Y., 1996, Iterative methods for sparse linear systems, PWS Publication Co. Boston
Selvam, R.P., 1996, Computation of flow around Texas Tech building using k-ε and Kato-Launder k-ε turbulence
model, Engineering Structures, 18, pp. 856-860.
Selvam, R.P. 2002, Computer modeling for bridge aerodynamics, in Wind Engineering, by K. Kumar (Ed), Phoenix
Publishing House, New Delhi, India, 2002, pp. 11-25
Selvam, R.P. 2010, Parallel computing using GPU for CFD applications, Department of Civil engineering, Univer-
sity of Arkansas, Fayetteville, AR 72701, email: rps@uark.edu
Thibault, J.C. and I. Senocak, I., 2009, CUDA implementation of Navier-Stokes solver on multi-GPU desktop plat-
forms for incompressible flows, AIAA -2009-758, 47th AIAA Aerospace science meeting, pp. 1-15, Jan. 5-8,
Orlando, Florida
Zhang, Y., J. Cohen and J.D. Owens, 2010, Fast tridiagonal solvers on the GPU, PPoPP’10, January 9-14, 2010,
Bangalore, India.
Sarkar, 2008, 3-D Multiphase Flow Modeling of spray cooling using Parallel Computing, Dissertation, University of
Arkansas
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

7 APPENDIX-FORTRAN+GPU COMPUTING

Usually the pressure solution part of the Navier-Stokes equation takes 80% of the computing
time at each time step. Hence this portion is a suitable candidate to accelerate using GPU com-
puting. In the FORTRAN code the GPU code is called in the usual manner as a subroutine. The
points to be noted are: 1. In FORTRAN the variables are used as 2 dimensional or 3 dimensional
arrays and in C it is one dimension. Careful attention should be paid to transform the numbers
from one to another. 2. In FORTRAN the arrays range from 1 to N whereas in C, 0 to N-1 for N
dimensioned variable. In the example the arrays are varied from 1 to N in both C and FORTRAN
to reduce this conflict.

7.1 FORTRAN Code

To illustrate the calling of C CUDA GPU subroutine in FORTRAN, a simple application of


adding two vectors A and B is provided. The program is called fadd.f. Double precision is the
preferred computational type in large systems and is considered here. The source code compila-
tion is accomplished using gfortran or f95 compilers in linux systems. The subroutine kernel_add
will execute C(i)=A(i)+B(i) on the GPU. The fadd.f program source is listing as follows:

C PROGRAM fadd.f, Feb. 12, 2010


C R. Panneer Selvam, email: rps@uark.edu
C ph:479-575-5356
C Illustration of calling GPU-CUDA through fortran
C Perform c(i)=a(i)+b(i) using CUDA-GPU
PARAMETER(N=8)
IMPLICIT REAL*8 (A-H,O-Z)
DIMENSION A(0:N+1),B(0:N+1),C(0:N+1)
DO I=1,N
A(I)=I
B(I)=3.
END DO
C.....CALL ADD
call kernel_add(a,b,c,n)
PRINT *,'C=A+B',(C(I),I=1,N)
STOP
END

7.2 C CUDA Code

The C CUDA GPU program is developed in the normal procedure except the main function is
renamed as a function with the name kernel_add_ as shown in the listing. To allow changing be-
tween double precision or single precision floating point variables, the REAL type is defined
using #define. Thus allowing the variables to be changed by changing one line in the program. In
this example REAL is defined as double. This program is named cuadd.cu. The extension for the
CUDA source code is .cu.

//prog. cuadd.cu, Feb. 12, 2010


//R. Panneer Selvam, email:rps@uark.edu, ph: 479-575-5356
//Perform:c=a+d for fadd.f
The Fifth International Symposium on Computational Wind Engineering (CWE2010)
Chapel Hill, North Carolina, USA May 23-27, 2010

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
//#define REAL float
#define REAL double
// simple kernel function that adds two vectors
__global__ void vect_add(REAL *a, REAL *b, REAL *c, int n)
{
int idx = threadIdx.x+1;
if (idx<=n) c[idx] = a[idx] + b[idx];
}//end add
// function called from main fortran program
extern "C" void kernel_add_(REAL *a, REAL *b, REAL *c, int *nd)
{
int n=*nd,itot=n+2,size;
REAL *a_d, *b_d, *c_d; // declare GPU vectors
int NBX=1, N=n; // uses 1 block of N threads on GPU
// Allocate memory on GPU
size=itot *sizeof(REAL);
cudaMalloc( (void **)&a_d, size);
cudaMalloc( (void **)&b_d, size);
cudaMalloc( (void **)&c_d, size);
// copy vectors from CPU to GPU
cudaMemcpy( a_d, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( b_d, b, size, cudaMemcpyHostToDevice );
// call function on GPU
vect_add<<< NBX, N >>>( a_d, b_d, c_d,n);
// copy vectors back from GPU to CPU
cudaMemcpy( c, c_d, size, cudaMemcpyDeviceToHost );
// free GPU memory
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
return;
}

7.3 Compilation Details

First the cuadd.cu is compiled to create the object code cuadd.o using the CUDA nvcc compiler
driver. The double precision option is invoked in CUDA using the compiler flag –arch=sm_13.

$nvcc –c –arch=sm_13 cuadd.cu

Once the cuadd.o is created, the FORTRAN code is linked with the CUDA object using the
following commands:

$gfortran -L /usr/local/cuda/lib64 -I /usr/local/cuda/include -lcudart -lcuda fadd.f cuadd.o

The above command creates the a.out file in the system and this can be run from the teminal
as $ ./a.out. In our system (which is 64 bit) the library is lib64 and one needs to make sure that
is in their system. To improve the serial performance, the gfortran –O3 compiler option can be
added.

Вам также может понравиться