Вы находитесь на странице: 1из 9

CS 315A Parallel Computer Architecture and Programming Winter 2009

1

AbstractIn this report, I outline the implementation and
preliminary benchmarking of a parallelized program to perform
reverse time migration (RTM) seismic imaging using the Nvidia
CUDA platform for scientific computing, accelerated by a
general purpose graphics processing unit (GPGPU). This novel
software architecture allows access to the massively parallel
computational capabilities of a high performance GPU system,
which is leveraged for its high throughput of numeric
capabilities.

Index TermsGPGPU, Reverse time migration, seismic
imaging, parallel computing, FDTD wave propagation

I. INTRODUCTION
HIS REPORT SUMMARIZES THE PRELIMINARY
technical results of the parallel implementation of Reverse
Time Migration. This work was completed in conjunction
with the Stanford Exploration Project and in fulfillment of the
requirements for the final project in CS315A (Parallel
Computer Architecture and Programming).
A. Algorithm Overview
Reverse Time Migration (RTM) is technique for
generating images of subsurface geological structures. This
algorithm is prevalent in geophysical data processing, as it has
preferable numerical and physical properties compared to
competing algorithms, thus generating better images
[1
].
Seismic imaging is used in many industrial and scientific
applications to generate a 2D or 3D representation of
subsurface topography and geological structure. In the
petroleum industry, reflection seismography enables informed
decision-making for resource extraction; RTM has developed
a reputation over the last few years for increasing the quality
of computed image results, but at greater computational cost
to some alternative techniques such as One-Way Wave
Propagation. However, RTM has great potential for
parallelism at both coarse and fine granularity due to its
numerical structure.
Briefly summarized, seismic data is acquired by
generating a shot, an impulse-like source which creates
acoustic and elastic waves that travel into the earth. A series
of receivers are located elsewhere in the survey area and each
sensor records the received acoustic waveform. These
recordings contain information about the material that the
wave has passed through or reflected from. Seismic imaging
algorithms such as RTM perform the computational task of
combining received sensor data and inverting them to
determine the subsurface topography.


Figure 1 Seismic imaging, showing the shot at the surface (left),
and modeling the forward wave propagation as it reflects off a
source to arrive at a receiver at the surface (right). At times t
s

and t
r
where the forward-modeled and reverse-modeled
wavefields cross paths (correlate), the imaging condition
indicates the presence of a subsurface reflector. (Image courtesy
Zhang and Sun, Practical Issues in RTM, [1]).
RTM models two wavefields: a forward wavefield p
F

produced by the seismic shot; and a reverse wavefield p
R

estimated from the recordings of received data at each sensor.
The reverse wavefield is modeled as traveling backwards
(reverse-time) and its numerical wave operator is adjusted
accordingly. To generate an image, an imaging condition is
applied. The most straightforward implementation is a direct
cross-correlation between p
F
and p
R
. Together, the forward
wavefield computation, reverse wavefield computation, and
imaging condition calculation comprise the entire image
generation process, or migration.
In addition to numerical accuracy, the RTM algorithm
provides several geophysical imaging advantages. Correct
implementation of the algorithm inherently accounts for
inhomogeneous, complex velocity behaviors of subsurface
layers, with no additional code complexity. This enables to
correctly model the true amplitudes for many wave behaviors,
including diffractions, refractions, multipath wave
propagation, and evanescent waves. Altogether, the result is a
Seismic imaging using GPGPU accelerated
Reverse Time Migration
CS315A Final Project Report
Nader W. Moussa (nwmoussa@stanford.edu)
T
CS 315A Parallel Computer Architecture and Programming Winter 2009

2
better image of the subsurface geography with fewer
computationally-induced artifacts.
Several tiers of computational parallelism are available
for a large seismic survey. At the highest level of abstraction,
a data-set may be very large, and can be broken into spatially
separate regions of largely independent RTM imaging
processes. This is a Single Program, Multiple Data (SPMD)
approach, and is commonly used without any relevant data
communication between different programs.
On each RTM migration, the three stages have potential
for parallelism, but there is a severe data dependency
limitation. The imaging condition requires both of the
computed wavefields p
F
and p
R
for each time step.
Unfortunately, because the two are computed in opposite time
directions, this usually involves computing the complete
wavefield p
F
, writing it to disk, and reading its precomputed
values for image condition correlation as soon as that time-
step is available from the reverse time wavefield.
At the finest level of parallelism, the individual wavefield
propagation steps can take advantage of vectorization,
floating-point math optimizations, and numerical
reorganization, to reduce computational load. The imaging
condition can also benefit from parallelization, because it is
essentially a large 2D or 3D correlation. This is easily
vectorizable, if a sufficiently large vector processor exists.
This multi-tiered parallelism inspires the implementation
used throughout this research project,

B. CUDA Programming Methodology
Nvidias novel technology, Compute Unified Device
Architecture (CUDA) is a software interface and compiler
technology for general purpose GPU programming
[3
]. The
CUDA technology includes a software interface, utility
toolkit, and compiler suite designed to allow hardware access
to the massive parallel capabilities of the modern GPU without
requiring the programmer to construct logical operations as
graphical instructions. CUDAs latest release, version 2.1,
exposes certain features only yet available in the Tesla T10
GPU series. Below, all specifications are given based on the
T10 GPU capabilities with CUDA 2.1 software.
CUDA programs involve two parts: host code, which
will run on the main computers CPU(s); and device code,
which is compiled and linked with the Nvidia driver to run on
the GPU device. Most device code is a kernel, the basic
functional design block for parallelized device code. Kernels
are prepared and dispatched by host code. When dispatched,
the host specifies parallelism parameters, and the kernel is
assigned to independent threads which are mapped to device
hardware for parallel execution.
The coarsest kernel parallelism is the block, which
contains several copies of threads running the same code.
Each block structure maps to a hardware multiprocessor.
Blocks subdivide a large problem into manageable units which
will execute independently; notably, inter-block
synchronization and communication is difficult without using
expensive global memory space or coarse barriers. Inside
each block are up to 512 threads, organized into sub-groups or
warps or groups of up to 32 threads. At this level of
parallelism, shared memory and thread synchronization is very
cheap, and specific hardware instructions exist for thread
synchronization. As of CUDA 1.3, available on the Tesla
T10, synchronization voting can be used to enable single-
cycle inter-thread control. As an extra performance boost,
threads which are running the same instructions are optimized
with Single Instruction, Multiple Thread (SIMT) hardware,
sharing the Instruction Fetch and Decode logic and efficiently
pipelining operations. If conditional program control flow
requires different instructions, the threads must then serialize
some of these pipeline stages; peak performance is achieved
when all conditional control-flow is identical for threads in a
single warp. In the case of FDTD wave propagation code, this
is generally applicable and all threads operate in SIMT mode.

C. Hardware Platform
For the purposes of this paper, testing was performed on
an HP ProLiant with attached Tesla S1070 GPGPU rack-
mounted blade server. This unique platform implements the
CUDA 2.1 software specification, with hardware Compute
Capability 1.3 GPU acceleration
[3
]. The S1070 provides four
Tesla T10 GPUs, which perform vector-style parallelism for
general purpose computing.
The basic architecture consists of a Host System, using
regular CPUs and running a standard Linux operating system.
Attached is the Device, a 1U rack-mounted GPGPU
accelerator which provides the parallelism discussed in earlier
sections. The CUDA technology uses this terminology, host
and device, to refer to the various hardware and software
abstractions that apply to either CPU or GPU systems.
Although S1070 has four GPUs, it is considered one device;
and similarly, there is one host although it has 8 CPU cores
in this system.
Below is a summary of the system specifications:

Host: HP ProLiant DL360 G5
[5
]
, [6
]
2x Quad-Core Intel Xeon E5430 @ 2.66 GHz
6144 KB L2 Cache (per core)
L2 Cache Block size: 64 B
32 GB main memory
1333MHz Front Side Bus
PCI-e #1:8x pipes @ 250MB/s each
PCI-e #2:8x pipes @ 250MB/s each
Gigabit Ethernet connection to SEP Intranet

Device: Nvidia Tesla S1070 Computing System
[7
]
4xT10 GPU @ 1.44GHz
30 Streaming Multiprocessors (SM) per GPU
8 Scalar Processor cores per SM
32 threads per warp
16K 32-bit registers per block
16KB Shared Memory per block
4GB addressable main memory per GPU
Memory controller interconnect for data transfer
to host and other GPU address spaces
Concurrent Copy & Execution feature Direct
Memory Access (DMA) style asynchronous
transfer available on T10 GPUs
Programmable in CUDA

CS 315A Parallel Computer Architecture and Programming


Figure 2 Schematic representation of Host-Device (CPU
interconnection and memory structure.
The S1070 is a very recently released commercial
platform specifically tailored for high-performance scientific
computing. It is supported only on a few specific pl
and environments. The following section details the practical
implications for system and driver configuration.
During early work, I configured the ProLiant system
run Ubuntu 8.10 and Nvidia 180.22 drivers
recompiling Nvidia Debian kernel modules (
successfully connected to the S1070 system but incorrectly
identified it as an Nvidia C1060. Downgrading the operating
system to Ubuntu 8.04 correctly connected and recognized the
Nvidia S1070 but resulted in an unstable system
occasionally crashed. Per advice from Nvidia, I switched the
ProLiant operating system to CENTOS, and had significantly
more success. However, the Nvidia 180.22 drivers have not
been fully tested on the 1U rack-mount S1070 systems w
four GPUs, resulting in several system hang
unexpected, non-repeatable crashes. It should be noted that
the GPGPU driver for the 1U Tesla system
elements of the Linux operating system (notably, automatic
configurations for X11), as it appears to be a
and display driver although it cannot be connected to a
physical display monitor.
Another potential configuration issue is the presence of
two Host Interconnects on the S1070 1U unit
documentation mentions that these allow the S1070 to
optionally connect to two separate host CPU systems.
However, even though only one host is used in our system,
both interconnects should be used anyway.
configuring only one card results in access to only 2
4 available Tesla T10 GPUs on the S1070 1U server.
this results in a functional system, it limits the coarse
parallelism discussed earlier, which leverages multiple GPUs
working on independent tasks. This also doubles the
bandwidth available to the S1070 memory controller

A successful, stable system was eventually set up

CENTOS 5.2 for x86_64
Nvidia Tesla Driver (Linux x86_64)
Installing both Host Interconnect Cards (HIC) in the
ProLiant and connecting two cables to the S1070 unit

CS 315A Parallel Computer Architecture and Programming
Device (CPU-GPU)
The S1070 is a very recently released commercial
performance scientific
It is supported only on a few specific platforms
The following section details the practical
implications for system and driver configuration.
ProLiant system to
drivers. This required
ian kernel modules (.ko files). This
successfully connected to the S1070 system but incorrectly
Downgrading the operating
4 correctly connected and recognized the
unstable system which
. Per advice from Nvidia, I switched the
ProLiant operating system to CENTOS, and had significantly
180.22 drivers have not
S1070 systems with
four GPUs, resulting in several system hang-ups and
It should be noted that
for the 1U Tesla system confuses some
(notably, automatic
as it appears to be a video accelerator
driver although it cannot be connected to a
is the presence of
on the S1070 1U unit. The Nvidia
ns that these allow the S1070 to
optionally connect to two separate host CPU systems.
However, even though only one host is used in our system,
. Connecting and
configuring only one card results in access to only 2 out of the
4 available Tesla T10 GPUs on the S1070 1U server. Though
this results in a functional system, it limits the coarse-grain
parallelism discussed earlier, which leverages multiple GPUs
This also doubles the PCI-e
bandwidth available to the S1070 memory controller.
eventually set up:
) 177.70.11
Installing both Host Interconnect Cards (HIC) in the
cables to the S1070 unit

Figure 3 Rack mount view of the SEP
the S1070 1U 4xGPU GPGPU Computing Appliance
the HP ProLiant Xeon 64 bit, 8 core
tesLc0.stcnord.edu which runs the host operating system
D. Evaluation Metrics
There are many ways to compare and evaluate
parallelization schemes for RTM
approach is so novel, it is difficult to perform direct
comparison with other parallelization schemes
Time Migration. Other hardware platforms do not provide the
same software abstractions, so this excludes many metrics
from side-by-side comparison.
Of course, certain performance metrics are directly
comparable to serial or parallel CPU
such as:
Total Execution Time
Cost ($) per Floating FLOPS
FLOPS per Watt

Other internal performance metrics of my implementation
can be compared to academic and industrial research progress
in high-performance GPGPU wave propagation
propagation has been previously implemented in Finite
Difference Time Domain (FDTD)
hardware
[2
], so the forward-
computation performance can be directly compared to such an
implementation. FDTD performance measurements include:
Maximum Computational Grid Size
Block Subdivision Size
Wavefield Grid Points Per Second
Numerical Order of Spatial Derivatives

One goal of SEPs investigations
parallelization technologies is to subjectively evaluate the
feasibility for future performance, ease of development, and
maintainability of code. Technologies like the
CUDA/GPGPU approach are compared subje
systems, such as the SiCortex SC072
well as conventional multicore and multi
parallelization. The following metrics can be roughly
estimated for each technology, noting that there is some
Winter 2009 3
SEP Tesla system. At top is
Computing Appliance. Below is
8 core (2xSMP, 4xCMP) system,
runs the host operating system.
There are many ways to compare and evaluate
for RTM. Because the GPGPU
approach is so novel, it is difficult to perform direct
comparison with other parallelization schemes of Reverse
Time Migration. Other hardware platforms do not provide the
same software abstractions, so this excludes many metrics
Of course, certain performance metrics are directly
comparable to serial or parallel CPU RTM implementations,
FLOPS
metrics of my implementation
industrial research progress
performance GPGPU wave propagation. Wave
propagation has been previously implemented in Finite
(FDTD) for nearly identical
and reverse-wavefield
computation performance can be directly compared to such an
performance measurements include:
Computational Grid Size
Wavefield Grid Points Per Second
of Spatial Derivatives
investigations into various
is to subjectively evaluate the
feasibility for future performance, ease of development, and
maintainability of code. Technologies like the
CUDA/GPGPU approach are compared subjectively to other
SiCortex SC072 Desktop Cluster as
well as conventional multicore and multi-node CPU
The following metrics can be roughly
estimated for each technology, noting that there is some
CS 315A Parallel Computer Architecture and Programming

ambiguity in direct comparisons across widely varying exotic
architectures:

Cost ($) per FLOPS
FLOPS per Watt
FLOP operations needed for complete migration
Execution Time for complete migration
Image Quality (subjective geophysical analysis)

As I am not trained as an interpretational geologist,
subjective assessment of the image quality is difficult, but it is
widely established in industrial contexts that the correct
implementation of RTM yields better images
making and analysis. Certain computational
enhance this effect by enabling higher-accuracy RTM, (e.g.
using higher-order wavefield operators). By
cheap floating-point math, the GPGPU approach
operations per each data point, allowing more accurate wave
modeling without execution time overhead. The result is a
subjectively better migrated image. Quantitative metrics to
describe the degree of image focus do exist bu
seem to be in widespread use.
Finally, it is worth noting the benefits of GPGPU
parallelization from a software engineering and code
maintenance standpoint. CUDA is designed to be simple,
existing as a set of extensions to standard C programming. As
such, the system is easy to learn for most programmers;
code is systematically separated into host setup code and
device parallelization code; and CUDA can interoperate with
C or C++, allowing functional- or object
design as the situation requires.

II. IMPLEMENTATION
Note: A version of my CUDA kernel, implementing the 2
wave-solver has been included as an appendix
I developed a wave propagation kernel, implemented in
CUDA, for use in forward- and reverse
propagation. I also implemented a simple correlation imaging
condition. Due to time constraints, I was not able to
implement a more advanced imaging condition with true
amplitude correction, noise-removal, and angle
compensation.
For the purposes of this report, I will focus on single
kernels. Significant progress was made towards multi
asynchronous parallelization, but this code is not yet ready to
present benchmark results. The eventual goal is to perform
the forward, reverse, and imaging subroutines on independent
GPUs. However, preliminary benchmark results cast doubt on
whether that approach will decrease total execution time,
because the bottleneck appears to be host-device transfer time
rather than computational limitations.
Implementation of an eighth-order spatial derivative
added negligible computational overhead to the
compared to the nave second-order wave operator.
suggests that other more sophisticated time-stepping methods,
such as Arbitrary Difference Precise Integration (ADPI) wa
solvers may also have negligible overhead. Such methods will
enable coarser time-steps without the numerical stability
CS 315A Parallel Computer Architecture and Programming
comparisons across widely varying exotic
needed for complete migration
Execution Time for complete migration
Image Quality (subjective geophysical analysis)
As I am not trained as an interpretational geologist,
subjective assessment of the image quality is difficult, but it is
widely established in industrial contexts that the correct
images for decision-
computational architectures can
accuracy RTM, (e.g.
. By providing very
the GPGPU approach enables more
wing more accurate wave
. The result is a
Quantitative metrics to
describe the degree of image focus do exist but they do not
oting the benefits of GPGPU
parallelization from a software engineering and code-
maintenance standpoint. CUDA is designed to be simple,
existing as a set of extensions to standard C programming. As
such, the system is easy to learn for most programmers; the
code is systematically separated into host setup code and
device parallelization code; and CUDA can interoperate with
or object-oriented system
kernel, implementing the 2
nd
order 2D
solver has been included as an appendix.
I developed a wave propagation kernel, implemented in
and reverse-time wave
orrelation imaging
Due to time constraints, I was not able to
implement a more advanced imaging condition with true-
removal, and angle-gather gain
For the purposes of this report, I will focus on single-GPU
ificant progress was made towards multi-GPU
asynchronous parallelization, but this code is not yet ready to
The eventual goal is to perform
the forward, reverse, and imaging subroutines on independent
benchmark results cast doubt on
whether that approach will decrease total execution time,
device transfer time
rder spatial derivative
added negligible computational overhead to the problem,
order wave operator. This
stepping methods,
such as Arbitrary Difference Precise Integration (ADPI) wave
solvers may also have negligible overhead. Such methods will
steps without the numerical stability
limitations that are inherent to FDTD approaches.
reduce overall execution time.
The expansion of the solver to a full 3D
require significant extra programming. The
2D model is intended to be extensible, and the CUDA
framework allows block indexing to subdivide a
computational space into 3 dimensions, assigning an (X,Y,Z)
coordinate to each block and each thread. Due to time
constraints, full 3D modeling
benchmarked.
The host code does not perform significant parallelism.
Earlier efforts involved pthread
CPUs for data preprocessing while the
but this workload parallelism was negligible compared to the
overall execution time.
The result of my implementation is a propagation
program, waveprop0, and an imaging program,
written in CUDA. These are piped together w
shell scripts to manage the overall RTM sequence for forward
and reverse-time propagation with imaging condition.
Future implementations will seek to integrate these
programs into one set, but the overlying data
remains to be solved theoretically before the processes can be
entirely converted to a streaming methodology. For trivial
sized problems, the entire computational result of one
wave propagation can remain in graphics device memory for
use, but this approach has inherent problem
so it was not heavily pursued.

III. PERFORMANCE AND BENCHMARK
For the sake of simplicity and consistency
RTM code on synthetic data. I used a
velocity model with a few reflecting laye
refractor. This same velocity model has been used by other
SEP students and researchers, and although it does not
represent the complex subsurface behavior of a real earth
model, it provides sufficient complexity to evaluate the correct
functionality of the RTM implementation. Further work will
apply my RTM code on real data sets and compare against
more traditional (serial) RTM implementations.
Figure 4 Multi-GPU approach, per
subroutines of the overall RTM algorithm on separate
GPUs. This will enable coarser parallelism
hide memory transfer time. The current benchmarking
code does not yet implement this
Winter 2009 4
limitations that are inherent to FDTD approaches. This may
The expansion of the solver to a full 3D model space will
require significant extra programming. The code base for the
intended to be extensible, and the CUDA
framework allows block indexing to subdivide a
computational space into 3 dimensions, assigning an (X,Y,Z)
block and each thread. Due to time
was not completed or
The host code does not perform significant parallelism.
pthread parallelism on the host
CPUs for data preprocessing while the input loaded from disk,
but this workload parallelism was negligible compared to the
The result of my implementation is a propagation
, and an imaging program, mgcorr0,
written in CUDA. These are piped together with a set of Unix
shell scripts to manage the overall RTM sequence for forward
time propagation with imaging condition.
Future implementations will seek to integrate these
programs into one set, but the overlying data-dependence issue
o be solved theoretically before the processes can be
entirely converted to a streaming methodology. For trivial-
sized problems, the entire computational result of one-way
wave propagation can remain in graphics device memory for
s inherent problem-size limitations,
ENCHMARK SUMMARY
and consistency, I tested my
RTM code on synthetic data. I used a simple subsurface
velocity model with a few reflecting layers and one point
This same velocity model has been used by other
and although it does not
represent the complex subsurface behavior of a real earth
model, it provides sufficient complexity to evaluate the correct
unctionality of the RTM implementation. Further work will
apply my RTM code on real data sets and compare against
more traditional (serial) RTM implementations.
GPU approach, performing separate
subroutines of the overall RTM algorithm on separate
GPUs. This will enable coarser parallelism and may help
he current benchmarking
code does not yet implement this methodology.
CS 315A Parallel Computer Architecture and Programming Winter 2009

5
Unless otherwise noted, these benchmark results used a
1,000 x 1,000 wavefield space, modeled in 2D. The values
obtained for these benchmarks are using internal wave
propagation kernel version (,svn,work,sep,cudar1029).
Note that this does not necessarily match the kernel
implementations shown in Appendix ( IV.C).


Figure 5 Imaging condition applied, revealing subsurface
reflectors. The wave diffraction off of the sharp corners is a
result of the abrupt end to the layers, which would not occur in
real (non-synthetic) subsurface data.

GPU execution time is shown for a 1,000,000 point grid,
(1000x1000 2D computational space). It is compared to a
serial implementation of RTM on the CPU. Due to time
constraints, I was not able to compare the GPGPU
parallelization to other parallel RTM versions.


Evidently, the GPU parallelization has a dramatic effect
on the total execution time reducing it by a factor of more
than 10x. With 240x as many cores, however, this is much
lower than linear parallelization. Closer inspection of the
CUDA execution time indicates the following behavior:


Figure 7 Breakdown of program execution time for the CUDA
implementation. Very little time is spent executing numerical
processing code, with the vast majority of time spent in host-
device memory transfer over the PCI-e bus (between the
ProLiant CPU system and the Nvidia Tesla S1070).

Evidently, memory transfer between host and device is
the severely limiting factor. The overwhelming majority of
the time, 91%, is spent transferring data over the PCI-e bus to
the GPU. This suggests the first place to begin optimization
(and at the same time, it suggests that next-generation product
development efforts should focus on improving bus
bandwidth). This result was very unexpected; while memory
stalls are necessarily part of any computing architecture, a
90% overhead is highly undesirable. Effort to amortize this
cost or hide the latency with pipelining can yield very little
return, as the total computational execution time for the wave
propagation and correlations takes under 2 seconds.
Significant investigation should be expended into
asynchronous memory transfers, which are available on the
T10 GPU.
An optimistic interpretation of this result is that the
coarse-grain, multi-GPU approach may increase available bus
bandwidth (see the discussion of the PCI-e bus parallelism in
(Section I.C). Two opportunities arise: first, parallelizing the
PCI-e transfer, followed by an on-device asynchronous
memory copy over the onboard memory controller, should
immediately work around the PCI-e bottleneck. Secondly,
data dependencies may be sparser than the current
implementation assumes; and it may be possible to perform
significant amount of kernel computation on independent
GPUs without sharing data over the device memory controller
(as explored in Figure 4).Figure 4 Multi-GPU approach,
67ms, 0
67ms, 0 0.230s, 1
2.183s, 8
26.869, 91
CUDA Lxecut|on 1|me (seconds)
lorward Wave
ropagaLlon
8everse Wave
ropagaLlon
Cross CorrelaLlon
ulsk Access
(PosL)
Memory 1ransfer
0
100
200
300
400
Cu (serlal) Cu (240 Core)
1ota| Lxecut|on 1|me
Figure 6 Total execution time for RTM imaging, comparing
serial implementation on CPU (executed on the ProLiant
Xeon host), compared to the single-GPU CUDA
parallelization.
CS 315A Parallel Computer Architecture and Programming Winter 2009

6
performing separate subroutines of the overall RTM algorithm
on separate GPUs. This will enable coarser parallelism and
may help hide memory transfer time. The current
benchmarking code does not yet implement this methodology.
IV. CONCLUSION
The dramatic speedup of the computational kernel
provides strong motivation for continued work in GPGPU
parallelism. Benchmark results suggest that the most
important area to tackle is Host-Device (PCI-e) bus
bandwidth, which accounts for 90% of the total system
utilization time.
At present, the implementation does not have any tasks
for the high-performing Xeon processors on the host. These
are suitable for performing a lot of useful work, so it is
possible that additional post-processing could be performed on
them.
Another suggested research area is the implementation of
compression during transfer. Velocity models, which contain
large quantities of redundant data, could easily be compressed;
seismic records will probably not compress well with a
lossless algorithm such as GZIP (LZ77) because they do not
contain the same amount of redundancy. Quantification of the
average compression ratios for real data sets is reserved for
future work.

REFERENCES
[1] Yu Zhang and James Sun. Practical Issues in Reverse Time Migration.
First Break, Vol. 27 Jan 2009
[2] Paulius Micikeviciuis. 3D Finite Difference Computation on GPUs
using CUDA. (Draft technical report). Nvidia Corporation, 2008.
[3] CUDA Programming Guide, version 2.1. Nvidia Corporation,
December 12, 2008.
[4] Lei Jia, Wei Shi, and Jie Guo. Arbitrary-Difference Precise Integration
Method for the Computation of Electromagnetic Transients in Single-
Phase Nonuniform Transmission Line. IEEE Transactions on Power
Delivery, Vol. 23, No. 3, July 2008.
[5] HP ProLiant DL360 G5 Server series specifications.
http://h10010.www1.hp.com/wwpc/us/en/en/WF06a/15351-15351-
3328412-241644-241475-1121486.html
[6] HP ProLiant DL360 Generation 5 Overview.
http://h18004.www1.hp.com/products/quickspecs/12476_na/12476_na.h
tml
[7] Nvidia S1070 Computing System.
http://www.nvidia.com/object/product_tesla_s1070_us.html

ACKNOWLEDGEMENT
This work was carried out in conjunction with the Stanford
Exploration Project (SEP). Bob Clapp, Abdullah al-Theyab,
and Biondo Biondi all contributed invaluable technical
insights and conceptual guidance.
APPENDIX
A. Acronyms Quick-Reference
CUDA Compute Unified Device Architecture
FDTD Finite Difference, Time Domain (wave simulation)
GPU Graphics processing unit
GPGPU General purpose graphics processing unit
RTM Reverse Time Migration
SEP Stanford Exploration Project
SM Streaming Multiprocessor
SPMD Single program, multiple data


B. CUDA Software and Hardware Mapping
This table is a quick reference which explains the
approximate relationship between CUDA software
abstractions and their physical implementation in hardware.
This concise summary does not elaborate on details, which
can be found in the Programming Guide
[3]
.

Software Model Hardware
Implementation
Element Maximum Physical Unit #
T
h
r
e
a
d

512 threads per
block

Arranged in 3D
block not
exceeding
512x512x64 in
<x,y,z>
Scalar Processor

(Streaming Core,
Core, SP)

Each core
executes one
thread at a time
8
W
a
r
p

Each 32 threads are
statically assigned
to a warp
SP pipeline
A full warp (32
threads) executes
in 4 clock cycles
(pipelined 4-deep
across 8 cores)
16
B
l
o
c
k

65535x65535
in <x,y>
Streaming
Multiprocessor
(SM)
30
K
e
r
n
e
l

G
r
i
d

Problem/simulation
representation
GPU
Only one kernel is
running on the
GPU at a time
(More are
possible, but this
is complicated).
4

CS 315A Parallel Computer Architecture and Programming Winter 2009

7
C. CUDA FDTD Kernel Implementation
,***************************************************
Cu0^ Code 1or Co1umn-based wave propagaton equaton

Co1umn kerne1 1or 6u mp1ementaton o1 Wave ropagaton

0evce Code {kerne1 runs n para11e1 threads on 6u)


$8evson: 1072 $
$1d: waveprop_kerne1.cu 1072 2009-02-06 02:48:172 nwmoussa $
$0ate: 2009-02-0S 19:48:17 -0800 {1hu, 0S Mar 2009) $
$^uthor: nwmoussa $
*,


#1nde1 _W^vL80_C0luMu_kL8uLl_u_
#de1ne _W^vL80_C0luMu_kL8uLl_u_

#de1ne u_{y,x) jy*W^vLI1Ll0_S2_x+x


,* Co1umn-based wave propagaton code *,
__g1oba1__ vod
waveprop_kerne1{ 11oat* g_wv1e1d0, ,, 1uu1: Wave va1ues at {t-1)
11oat* g_wv1e1d1, ,, 1uu1: Wave va1ues at { t )
11oat* g_wv1e1d2, ,, 0u1u1: Wave va1ues at {t+1) {to be computed here)
11oat* g_ve1, ,, 1uu1: ve1octy mode1 {shou1d be read-on1y)
dm2 wv1e1d_sz ,, 1uu1: Sze o1 the array 1or each o1 the ponters
) {



,, __shared__ 11oat s_wv1e1djC0lS_L8_8lkjW^vLI1Ll0_S2_Y,


,, 1hreads are ndexed as a "co1umn number" n the current b1ock
1 {thread1dx.x == 0){
,, Cache uand1er code {optona1 per1ormance tweak)
,,memcpy{s_wv1e1d, g_wv1e1d, C0lS_L8_8lk*W^vLI1Ll0_S2_Y*szeo1{11oat)),
)
__syncthreads{), ,, uow, a11 threads are ready to run

,, 6rd boundary condtons. ,, 1000: 6enera1ze 1rom the 2nd order case
1 {thread1dx.x == 0 || thread1dx.x == C0lS_L8_8lk-1) {
,, ... {boundary condtons) ... ,,
)

unsgned nt x,y, ,, x: uorzonta1 grd pont
,, y: vertca1 {depth) grd pont


,, uumerc wave-equaton so1ver
,, 1terate one co1umn over depth
,, 1000: ^dd constant-sze dt, dx, dy, dz nto the 1ap1acans
,, 1000: 6et j1 -2 1 constants 1rom texture memory
x = thread1dx.x,
1or {y=0, y<W^vLI1Ll0_S2_Y,y++) {

g_wv1e1d2 u_{y,x) = -1*g_wv1e1d0 u_{y,x) + 2*g_wv1e1d1 u_{y,x) +
g_ve1 u_{y,x)*g_ve1 u_{y,x)*
{1*g_wv1e1d1 u_{y-1,x) - 2*g_wv1e1d1 u_{y,x) + 1*g_wv1e1d1 u_{y+1,x) +
1*g_wv1e1d1 u_{y,x-1) - 2*g_wv1e1d1 u_{y,x) + 1*g_wv1e1d1 u_{y,x+1)),

)

,, Store resu1t back to g1ob_wv1e1d
__syncthreads{),
1 {thread1dx.x == 0){
,, when shared_memory {cache) s used... we need to o111oad t
,, memcpy{g_wv1e1d, s_wv1e1d, C0lS_L8_8lk*szeo1{11oat)),
)


return,
)


#end1 ,, #1nde1 _W^vL80_C0luMu_kL8uLl_u_

CS 315A Parallel Computer Architecture and Programming Winter 2009

8
D. CUDA Host Code for Wave Propagation
,***************************************************
Cu0^ Code 1or Co1umn-based wave propagaton equaton

Man program source code 11e
1hs code executes on the uost {xeon)

$8evson: 1074 $
$1d: waveprop.cu 1074 2009-02-06 02:S2:202 nwmoussa $
$0ate: 2009-02-0S 19:S2:20 -0800 {1hu, 0S Mar 2009) $
$^uthor: nwmoussa $
*,

#nc1ude <std1b.h> ,, Standard C 1brary
#nc1ude <stdo.h> ,, Conso1e and I1e 1,0
#nc1ude <strng.h> ,, Some basc C Strng 1unctons
#nc1ude <math.h> ,, Math 1brary nc1udng exponenta1s and trgonometry

,, roect 1nc1udes
#nc1ude "wave_params.h" ,, Wave Lquaton proect parameters
#nc1ude "cut1_mn.h" ,, Mnma1st mp1ementaton o1 the Cu0^ ut1ty 1oo1kt

#nc1ude <waveprop_kerne1.cu> ,, Man kerne1 code 1or co1umnzed wave equaton so1ver


,* Iuncton 0ec1aratons *,
vod wave_nt{),

,,1000: 1hese externs shou1d eventua11y be moved to a separate ueader I1e
extern "C" nt 1oad_ve1{const char* ve1octy_11ename, vod* ve1bu11er),


,* Man program entry pont *,
nt man{ nt argc, char** argv) {
wave_nt{),
return 0,
)

,* Wave nta1zaton and para11e1-kerne1 ca11er. *,
vod wave_nt{) {

,, ^ssgn envronment to a sng1e 6u {s1070 a11ows a va1ue o1 {0, 1, 2, 2))
cudaSet0evce{ 0 ), ,, 6u 0

,, I11 a 1oca1 dm2 varab1e wth the computatona1 grd sze.
dm2 wv1e1d_sz{W^vLI1Ll0_S2_x,W^vLI1Ll0_S2_Y,W^vLI1Ll0_S2_2), ,, Wave1e1d sze 1or one tme-step
nt wv1e1d_nbytes = wv1e1d_sz.x*wv1e1d_sz.y*wv1e1d_sz.z*szeo1{11oat), ,, Sze n bytes



,, ^11ocate uost Memory 1or ve1octy Mode1, to be read 1rom dsk
11oat* h_ve1mode1,
h_ve1mode1 = {11oat*) ma11oc{wv1e1d_nbytes), ,, pthread_1ork here, read 1rom dsk wh1e cuda ma11ocs
1oad_ve1{"ve1data.bn", h_ve1mode1),


,, ^11ocate 6u devce memory:
11oat* d_g_ve1mode1, ,, ve1octy Mode1 stored at 6u

,, Wave Ie1d round-robn bu11ers
11oat* d_g_wv1e1d0, ,, {x,y) p1ane at tme {t-1)
11oat* d_g_wv1e1d1, ,, {x,y) p1ane at tme { t )
11oat* d_g_wv1e1d2, ,, {x,y) p1ane at tme {t+1)
,, ^1ter each tmestep, we trans1er wv1e1d0 back to uost {or somethng)
,, 1hen set {n sequence):
,, wv10 = wv11, {data n here has 1nshed trans1er and s not needed)
,, wv12 = wv11,
,, wv12 = wv10, uew data 1or {t+1) w11 overwrte o1d wv10.

cudaMa11oc{ {vod**) &d_g_wv1e1d0, wv1e1d_nbytes ),
cudaMa11oc{ {vod**) &d_g_wv1e1d1, wv1e1d_nbytes ),
cudaMa11oc{ {vod**) &d_g_wv1e1d2, wv1e1d_nbytes ),
cudaMa11oc{ {vod**) &d_g_ve1mode1, wv1e1d_nbytes ),

,, Memory error check
1 {{d_g_wv1e1d0 == uull) || {d_g_wv1e1d1 == uull) || {d_g_wv1e1d2 == uull) || {d_g_ve1mode1 == uull)) {
prnt1{"Ia1ed to a11ocate enough 6u 0evce Memory.\n"),
prnt1{" 8ytes er Wave1e1d: xd\n", wv1e1d_nbytes),
prnt1{" Wave1e1d 8u11ers:\n d_wv11: \t0xxx\n d_wv12: \t0xxx\n d_wv12: \t0xxx\n",
d_g_wv1e1d0,d_g_wv1e1d1,d_g_wv1e1d2),
prnt1{" ve1octy Mode1:\n d_ve1: \t0xxx\n", d_g_ve1mode1),
)

CS 315A Parallel Computer Architecture and Programming Winter 2009

9


,, 1ake the ve1octy Mode1 over to the 6u
,,cudaMemcpy{ destnaton, source, num_bytes, drecton)
cudaMemcpy{ d_g_ve1mode1, h_ve1mode1, wv1e1d_nbytes, cudaMemcpyuost1o0evce),

,, 1000: 1nta1 condtons, 1 any, 1or wv1e1d0,1,2? cudaMemCpy them over to the 6u.



,* 0e1ne the geometry o1 the para11e1 so1ver 1or the kerne1s *,
dm2 grd{ 8lk_M^x_x, 8lk_M^x_Y, 8lk_M^x_2),
dm2 threads{ C0lS_L8_8lk, 1, 1 ),

,* Ca11 the kerne1 n para11e1. uo dynamc shared-memory a11ocated. *,
waveprop_kerne1<<<grd, threads, 0>>>{
d_g_wv1e1d0,
d_g_wv1e1d1,
d_g_wv1e1d2,
d_g_ve1mode1,
wv1e1d_sz
),





,, 0ea11ocate memory on uost {Cu) and 0evce {6u)
1ree{h_ve1mode1),
cudaIree{d_g_wv1e1d0),
cudaIree{d_g_wv1e1d1),
cudaIree{d_g_wv1e1d2),
cudaIree{d_g_ve1mode1),

cuda1hreadLxt{),
)