Вы находитесь на странице: 1из 6

Image Convolution Processing: a GPU versus FPGA

Comparison
Lucas M. Russo, Emerson C. Pedrino, Edilson Kato

Valentin Obac Roda

Federal University of Sao Carlos - DC


Rodovia Washington Lus, km 235 - SP-310
13565-905 So Carlos - So Paulo - Brazil
lucas_russo@comp.ufscar.br; emerson, kato@dc.ufscar.br

Federal University of Rio Grande do Norte - DEE


Campus Universitrio Lagoa Nova
59072-970 Natal Rio Grande do Norte Brazil
valentin@ct.ufrn.br

AbstractConvolution is one of the most important operators


used in image processing. With the constant need to increase the
performance in high-end applications and the rise and popularity
of parallel architectures, such as GPUs and the ones implemented
in FPGAs, comes the necessity to compare these architectures in
order to determine which of them performs better and in what
scenario. In this article, convolution was implemented in each of
the aforementioned architectures with the following languages:
CUDA for GPUs and Verilog for FPGAs. In addition, the same
algorithms were also implemented in MATLAB, using predefined
operations and in C using a regular x86 quad-core processor.
Comparative performance measures, considering the execution
time and the clock ratio, were taken and commented in the
paper. Overall, it was possible to achieve a CUDA speedup of
roughly 200x in comparison to C, 70x in comparison to Matlab
and 20x in comparison to FPGA.
Keywords- Image processing; Convolution; GPU; CUDA; FPGA

I.

INTRODUCTION

In 2006 Nvidia Corporation announced a new general


purpose parallel computing architecture based on the GPGPU
paradigm (General-Purpose Computing on Graphics
Processing Units): CUDA (Compute Unified Device
Architecture) [1]. CUDA is an architecture classified as
GPGPU, and it is a category of the SPMD (single process,
multiple data; or single program, multiple data) parallel
programming, the model is based on the execution of the same
program by different processors, supplied with different input
data, without the strict coordination requirement among them
that the SIMD (single instruction, multiple data) model
imposes. As a central point to the model are the so called
kernels: C-style functions that are parallel executed through
multiple threads and, when called from the application,
dynamically allocate a hierarchy processing structure specified
by the user. Interchangeably with the execution of the kernels,
portions of sequential code are usually inserted in a CUDA
program flow. For this reason, it constitutes a heterogeneous
programming model.
The CUDA model was conceived to implement the so
called transparent scalability effectively, i.e., the ability of the
programming model to adapt itself in the available hardware in
such a way that more processors can be scalable without
altering the algorithm and, at the same time, reduce the
development time of parallel or heterogeneous solutions. All
aforementioned model abstractions are particularly suitable and
Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4;
DC/UFSCAR; DEE UFRN

978-1-4673-0186-2/12/$31.00 2012 IEEE

easily adapted to the field of digital image processing, given


that many applications in this area operate in independent pixel
by pixel or pixel window approach.
Many years before the advent of the CUDA architecture,
Xilinx in 1985 made available to the market the first FPGA
chip [2]. The FPGA is basically, a highly customizable
integrated chip that has been used in a variety of science fields,
such as: digital signal processing, voice recognition,
bioinformatics, computer vision, digital image processing and
other applications that require high performance: real time
systems and high performance computing.
The comparison between CUDA and FPGA has been
documented in various works in different applications domains.
Asano et al [3] compared the use of CUDA and FPGAs in
image processing applications, namely two-dimensional filters,
stereo vision and k-means clustering; Che et al [4] compared
their use in three applications algorithms: Gaussian
Elimination, Data Encryption Standard (DES), and NeedlemanWunsch; Kestur et al [5] developed a comparison for BLAS
(Basic Linear Algebra Subroutines); Park et al [6] analyzed the
performance of integer and floating-point algorithms and
Weber et al [7] compared the architectures using a Quantum
Monte Carlo Application.
In this work, CUDA and a FPGA dedicated architecture
will be used and compared on the implementation of the
convolution, an operation often used for image processing.
II. METHODOLOGY
All CPU (i.e., Matlab and C) and GPU (i.e., CUDA)
execution times were obtained from the following
configuration:
Component
Hardware

Software
Drivers

FPGA

Description
Processor: Intel Core i5 750 (8MB cache L2),
Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2
x 2 GB Corsair (DDR2-800); Graphics Board: XFX
Nvidia GTX 295, 896MB
Windows 7 Professional 64-bit; Visual Studio 2008
SP1
Nvidia driver video version: 190.38; Nvidia CUDA
toolkit version: 2.3
Cyclone II EP2C35F672 on Terasic DE2 board;
Quartus II 10.1 Software with SOPC Builder, NIOS II
EDS 10.1 and ModelSim 6.6d Simulation Tool, for the
implementation of the algorithms.

The main comparison parameters presented in this article


are the execution time and the number of clock cycles of the
implemented algorithms. In order to obtain that, different
approaches were used according to the architecture profiled.
On C, the Performance Counters were used through the
functions: QueryPerformanceCounter() and QueryPerformance
Frequency(). The former is used to extract the value of the
counter until the function call.
On CUDA, the Event Management provides functionality
to create, destroy and record an event. Hence, it is possible to
measure the amount of time it took to execute a specific part of
code, such as a kernel call, in the manner described in [1].
Concerning the clock cycles, the clock() function was used
within the kernel to obtain the measure.
On Matlab, a simple approach is provided through the
usage of a built-in stopwatch. It is possible to control it with the
tic and toc syntax. The first starts the timer and the second
stops it, displaying the time, in seconds, to execute the
statements between tic and toc. The Matlab number of clock
cycles was not measured since it was not found a simple way to
do it.
At last, on the FPGA, it is possible to infer the execution
time directly from the architecture implemented on it. With the
knowledge of the clock rate, explicitly defined by the designer,
and the number of clock cycles taken to process the input data,
extracted from the waveforms or from the architecture itself,
the following expression can be used:
execution time = number of clock cycles/clock frequency (1)
III.

CONVOLUTION

Mathematically, convolution can be expressed as a linear


combination or sum of products of the mask coefficients with
the input function.
(2)
Where f denotes the input function and w the mask. It is
implicit that equation (2) is applied for every point in the input
function.
It is possible to extend the convolution operation to a 2-D
dimension as follows:
(3)

There is, in convolution, a limitation in what refers to the


boundaries of an input image, since the mask is positioned in
such way that there are mask values which do not overlap with
the input image. Thus, two approaches are commonly used in
the context of image processing: padding the edges of the
input image with zeros or clamping the edges of the input

image with the closest border pixel. In this work the first
choice is used as in GONZALES [8].
Considering an image of size of MxN pixels, a mask of
size SxT the multiplication is the more costly operation.
Hence, (MN)(ST) operations are performed and, consequently,
the algorithm belongs to O(MNST).
A mask w(x,y) can be decomposed in w1(x) and w2(y)
in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a
vector of size (Sx1) and w2(y) is a vector of size (1xT), the
2D convolution can be performed as two 1D convolutions. In
this way, it is said that the convolution is separable and the
algorithmic complexity decays allowing for a more flexible
implementation. Hence, the separable convolution formula can
be expressed as in equation 4.
(4)

IV.

IMPLEMENTATION

The separable convolution was implemented in C, CUDA


and Matlab (built-in function) and the regular convolution was
implemented in FPGA [Eq. 3]. The reason to implement the
regular convolution in FPGA was due to performance
limitations. The separable algorithm, although reducing the
total number of operations performed [Eq. 4], requires the
image data stream to be processed twice, one for lines and one
for columns. Consequently only the column filter itself would
take as much time as the regular convolution to process the
entire image. The reason for that is due the time required to fill
the shift register and the streaming interface, which can
transmit only one pixel at clock cycle.
A. C Implementation
The C implementation of convolution was based in [Eq. 4]
and it is fairly straightforward. Follows the sequential
separable algorithm implemented.
/* Line Convolution*/
for i 0 to number of lines-1
for j 0 to number of columns -1
g(i, j) 0
for l b to -b
if j-l >= 0 and j-l < number of columns
g(i, j) g(i, j) + f(i, j - l)*w2(b+l)
end-if
end-for
end-for
end-for

/* Column Convolution*/
for i 0 to number of lines-1
for j 0 to number of columns-1
o(i, j) 0
for k a to a
if i-k >= 0 and i-k < number of lines
o(i, j) o(i, j) + g(i-k, j)*w1(a+k)
end-if
end-for
end-for
end-for

The image was first loaded to memory with the OpenCV C


library. Later, for each input pixel, the column convolution
(with mask w2 and size equal to 2*b+1) was applied to it.
B. Matlab Implementation
For Matlab, the conv2() built-in function was used to
perform the convolution.

C. CUDA
C
Implem
mentation
On
O CUDA, thhe algorithm is implemen
nted through two
diffeerent kernels: the first part is
i implementeed through thee line
kern
nel and the seccond one is im
mplemented th
hrough the collumn
kern
nel. The devellopment of thee convolution
n was based onn the
algorithm of Poddlozhnyuk [9], extended to
o support anyy size
real images as inpputs.
n Kernel
Linee Convolution
The
T threads w
were grouped in 2-D blockss with size (44x16)
or 4 lines by 16 ccolumns and, in turn, they were
w
groupedd in a
nsions of the iinput
2-D grid with sizze depending on the dimen
imag
ge. Each threaad in the blocck is responsib
ble for fetchinng six
pixels from the innput image too per-block sh
hared memoryy. By
doin
ng this, the access to thhe Global Deevice Memorry is
redu
uced, as long as the mem
mory coalescin
ng restrictionss are
guarranteed.
Figs.
F
1 and 2 illustrate the general idea with block sizze of
(4x4
4) instead off its real dim
mensions (4 x 16) for im
mage
displaying purposses. The first line
l
of the 2-D
D block (Fig. 1) is
mapped to the firrst line of the input imagee (Fig. 2). Inn this
way, the thread0,0 is mapped to the pixells S0,0 (left aapron
regio
on), S0,0+block_ssize, S0,0+2*block__size, S0,0+3*blockk_size, S0,0+4*blocck_size,
S0,0++5*block_size (rigght apron region) and so on. Moreoover,
conccerning the acctual block sizze, 384, 16x6 (number of ppixels
load
ded for each line) x 4 (num
mber of block
k lines), pixells are
load
ded to shared m
memory.

nel is launcheed again withh the followin


ng offset from
m the
kern
begiinning of the image:
i
rowOffset = width
w
BLOC
CK_SIZE_RO
OW_X *
NUM_LOA
ADS_PER_TH
HREAD

(55)

By
B doing this, every colum
mn between rowOffset
r
andd the
last column will be calculatedd, even if so
ome of them were
prev
viously calculated. Hence, it is not necessary to deterrmine
exacctly which were
w
the last calculated co
olumn in ordder to
consstruct the row
wOffset, increaasing the veriification overhhead.
Add
ditionally, th
he memory coalescing requeriments are
auto
omatically satisfied for N
NUM_LOADS
S_PER_THR
READ
equaal four and BL
LOCK_SIZE__ROW_X equ
ual 16.
Colu
umn Convolu
ution Kernel
The
T threads, in
i the columnn filter, are div
vided in 2-D blocks
b
with
h size 8x16 orr 8 lines by 166 columns. Ass in the line keernel,
the grid size dep
pends on the iinput image and
a each threead is
ponsible for fetching
f
six ppixels from th
he input imagge to
resp
sharred memory.
ock size was made
m
The decision to 16 columnns in this blo
conssidering the memory coallescing requirrements (i.e., half
warp
rp access to co
ontiguous mem
mory positions). Converselyy, the
8 lin
nes in the blo
ock size tend to reduce thee number of apron
a
pixeels loaded to shared
s
memorry, reduce thee ratio (numbber of
apro
on pixels/num
mber of outpuut pixels) and
d increase meemory
reusse, since they will
w be loadedd to others sha
ared memoriess.
In
n the same way
w as the linne kernel, thee column kernnel is
laun
nched
again
n
for
im
mages
nott
multiple
of
(BL
LOCK_SIZE_
_COLUMN_Y
Y*NUM_LOA
ADS_PER_TH
HRE
AD)
), in which BLOCK_SIZE
B
E_COLUMN_
N_Y is the nuumber
of liines per block
k and NUM_L
LOADS_PER
R_THREAD is
i the
num
mber of fetches per thread ffrom the imag
ge main. Thuss, the
follo
owing offset was
w used:

Fig
gure 1. Example oof a 2-D block of size 16 (i.e., 16 threads)
t
for line aand
column kernels.

columnOffset=
LOCK_SIZE__COLUMN_Y
Y*
height BL
NUM_L
LOADS_PER__THREAD

(6)

After
A
the loadding stage, alll threads witthin a block must
syncchronize their execution, siince, in the next
n
stage, thrreads
are going
g
to acceess elements that were load
ded by other oones.
In order
o
to do that, a call to the CUD
DA API funnction
__syyncthreads() is issued annd the progrram can prooceed
correectly.

D. FPGA
F
Implem
mentation
For
F the FPGA
A implementaation, the arcchitecture deppicted
on Fig.
F 3 was dev
veloped with the assistancee of SOPC Buuilder
and Verilog HDL
L coding.

In
I the final sstage, each thread
t
is assiigned the tassk to
calcu
ulate four outpput pixels, whhich are in thee same positionns as
the ones
o
that the thread were mapped to in
n the main reegion
(Fig. 2).

The
T architectu
ure is responsiible for the fo
ollowing functtions.
A grayscale or biinary image inn JPEG formaat, stored on Flash
F
mory, is converted to RAW
W format, thatt is, to a matrrix of
Mem
integ
ger values bettween 0 (i.e., bblack) and 255 (i.e., white).

Concerning
C
thhe flexibility of the convollution filter, ssome
imag
ges are nott multiple of (BLOCK
K_SIZE_ROW
W_X*
NUM
M_LOADS_P
PER_THREAD
D), in which
h BLOCK_SIIZE_
ROW
W_X is thee number of
o columns per block and
NUM
M_LOADS_P
PER_THREAD
D is the number
n
of ppixels
fetch
hed from the main region. In order to solve
s
this, thee line

Such
S
conversiion was perfoormed by the softcore proccessor
NIO
OS II Fast Co
ore [10], the C library libj
bjpeg [11] andd the
wrap
pper function
n for decomprressing JPEG images, avaiilable
in [12].
[
Thereforre, this consttitutes the ap
pplication soft
ftware
layeer and is controlled by the NIOS
N
II EDS.

Figu
ure 2. Example off an image regionn with 96 pixels mapped
m
to 16 threeads of line kerneel. Pixels with the
e same color are m
mapped to the sam
me thread (see Figg. 1).

pixeel is position
ned in the ccenter gray region
r
coordiinate,
of Fig. 4.

Figure 3. Pipeline Architeccture for Image Processing.


P

After
A
that, thee decompresseed image is written
w
to the ppixel
bufffer (SDRAM Chip lower addresses)
a
and
d, in this wayy, the
DMA
A (Pixel Buff
ffer DMA Coontroller) is able to acceess it
with
hout interruptinng the processor and transm
mit the pixel tto the
remaaining of the pipeline (Imaage Processing Pipeline). T
Then,
the image
i
is proceessed through various stream
ming componnents,
with
h its interfacee called Avaalon Streamin
ng Interface [13],
consstituting the o pipeline (Imaage Processing Pipeline Figg. 3).
Firsttly, the imagge is processsed by the User Stream
ming
Com
mponent. Folllowing, eachh pixel (8-b
bit grayscalee) is
conv
verted to 30-bbit RGB (RGB
B Resampler)). Then, theree is a
dual-clock queue (Clock Crosssing Bridge) acting as a brridge
betw
ween two cloock domains (100MHz, th
he general deesign
clock
k, and the 255MHz, the VG
GA clock). An
nd, lastly, a V
VGA
conttroller is used to display thee processed im
mage.
The
T implemennted convoluttion module iss interfaced too the
rest of the archhitecture by means of Avalon
A
Stream
ming
Interrface.
This
T
module iis based on a finite state machine
m
with four
statees: DATA_FIILL_BUFFER
ER_STATE, DATA_PROC
D
CESS
ING
G_STATE_1, DATA_PRO
OCESSING_S
STATE_2 DA
ATA_
END
D_PROCESS
SING_STATE
E.
Firstly,
F
upon the reset signnal, every position of the shift
regisster was inittialized with the value 0.
0 The first state
(DATA_FILL_BU
UFFER_STA
ATE) consists on reading, each
k rise, the inpput interface and store the value read inn the
clock
shiftt register (Fiig. 4). This sttate lasts untill it has been read
floorr (KS/2)*(1 IW), or in otther words, un
ntil the first vvalid

Figu
ure 4. Layout of the convolution m
module shift register. KS (Kernel Size)
S
deenotes the size, in
n one dimension, of the used kerneel, that means, forr a
kern
nel 3x3, KS = 3; IW (Image Widthh) denotes de width of the input im
mage,
that means, for a 64
40x480 image, IW
W = 640. The greey area indicates the
t
pixe
els used to the connvolution calculaation.

Next,
N
the
present
state
is
modified
to
DAT
TA_PROCES
SSING_STATE
TE_1. The first moment off this
statee is depicted in
n Fig. 5.

Fiigure 5. Example of an image withh pixels values raanging from 0 to 255


2
(Lefft) and the values associated with tthe shift register with
w KS = 3 in the first
mom
ment of DATA_PR
ROCESSING_ST
TATE_1 state. Th
he x symbol indiccates a
value not cconsidered.

From
F
this state until the endd of the state machine
m
(i.e., state
DAT
TA_END_PR
ROCESSING__STATE), thee convolutionn sum
(Eq.. 3) will be applied
a
to thee gray area an
nd, at every clock
c
cyclle, an output pixel
p
will be aavailable at th
he output inteerface
(i.e., Avalon Streeaming Sourcce Interface). It is importaant to
ntion that thiis calculationn is perform
med by a paarallel
men
com
mbinational ciircuit sub moodule (Convo
olution Operaation,
to
Fig.6).
The
present
state
willl
change
DAT
TA_PROCES
SSING_STATE
TE_2 when alll input pixells are

read
d, or more sppecifically, whhen the numb
ber of pixels read
equaals the numberr of pixels of the
t input imag
ge.
This
T state is siimilar to DAT
TA_PROCES
SSING_ STAT
TE_1
with
h the exception that, as there are no pixeels to be readd, the
valu
ue zero will bee forced into the
t shift register. By doingg this,
and similarly to the first
fi
input pixels of the
DAT
TA_PROCESS
SSING_ STAT
TE_1 state, a border
b
treatmeent is
perfo
formed, since the border pixels
p
(i.e., without
w
a com
mplete
neighborhood) w
will have its values
v
convoluted to neigghbor
pixels or to valuess 0. The last moment
m
of thiss state is simillar to
the one
o depicted in Fig. 5, witth the exception that the vaalues
conssidered are thee ones locatedd near the end of the image (i.e.,
botto
om-right side)).

For
F the convo
olution appliccation, it is possible to observe
that the execution
n time graph (FFig. 7) for all three architecctures
behaaves as a expo
onential (note the log scale on the y-axiss) and
tend
d to maintain the
t same grow
wth rate, as th
he image resollution
increases. The ex
xception to thiis is the Matlab implementtation
for image resolu
utions 3300x22400 and 409
96x4096, posssibly
relatted to cache performance.

Lastly,
L
after tthe calculationn of the last pixel, the preesent
statee is modifiedd to DATA__END_PROC
CESSING_STA
TATE
statee. This last sstate is respoonsible for resetting
r
the shift
regisster and the coounters to its default values (i.e., zero inn this
case). Next, the state machiine returns to
o its initial state
TA_FILL_BU
UFFER_STA
ATE.
DAT
It
I is importantt to highlight that only kern
nels up to 5x55 and
imag
ges up to 640xx480 are suppported in the FPGA
F
convoluution
mod
dule. This is due the DE
E2 board memory
m
and llogic
elem
ments limitatiion. Thereforre, results in
nvolving diffferent
kern
nel and image sizes were esstimated consiidering the moodule
arch
hitecture itself..

Figuree 6. Convolution module block diaagram

The
T number of clock cycless were obtaineed disregardinng the
load
ding period (i.e., state DATA_FILL_BU
UFFER_STA
ATE).
Afteer this state, uuntil the end of the state machine,
m
at eevery
clock
k cycle, one ooutput pixel iss available at the output moodule
interrface.
V..

RESULTS AND
A
COMPARIISON

The
T graphs foor the convollution operation comparingg the
execcution time foor various grayy scale images resolutions for a
mask
k size of 15, as well as thhe number off clock cycless and
CUD
DA speedup ffor Matlab, C and the FPG
GA architecturre are
presented in figures 7, 8 and 9.

Fig
gure 7. Average execution times off the convolution
n with mask size of
o 15
and various im
mage resolutions.

The
T speedup graph
g
(Fig. 9)) shows an app
proximately stteady
speeedup consideering CUDA
A in regard
d to the other
implementations and various image resollutions. Againn, an
exceeption to this is the Matlab implementatiion for the twoo last
imag
ge samples. The
T reason foor this good CUDA
C
speeddup is
due the large amo
ount of arithm
metic operation
ns, high granuularity
n comparison to C,
and high resourcee utilization oof the GPU in
A.
Mattlab and FPGA
The
T
C impleementation diid not exploiit parallelism
m and
serv
ved as a control implemeentation for comparison
c
too the
otheers parallel alg
gorithms. Beccause of that, its executionn time
and number of cllock cycles peerformed worrse than the others
o
implementations.
The
T FPGA im
mplementationn explored some parallelism
m, as
with
h the parallel convolution calculation, but lacked a true
paraallelism appro
oach because of the FPGA
A and developpment
boarrd (DE2) resource
r
limi
mitations. Theerefore, due the
paraallelism calculation the exeecution timess were only worse
w
than
n a true parrallel implem
mentation (i.ee., CUDA). It is
interresting to nottice that the FFPGA numbeer of clock cycles
c
(Fig
g. 13) was sm
mall, even lesss than CUDA
A for image siize of
512x
x512. Besidees, the clock rate used in
n the FPGA was
relattivity low (i..e., 100MHz in this design) comparinng to
CUD
DA (i.e., 1242
2 MHz for thee processor clo
ock).
Another
A
limitation of the uused FPGA was
w the absennce of
multiplier block
ks that woul
uld improve significantlyy the
perfformance for the convoluttion operation
n. Hence, usiing a
larger FPGA witth more resouurces and a faster
fa
clock shhould
ormance of the
he FPGA and possibly
p
overccome
increase the perfo
the CUDA
C
implem
mentation.

Regarding
R
thee FPGA archiitecture, it can
n be seen from
m the
grap
phs that it perfformed well, aalthough worsse than CUDA
A and
keptt a steady gro
owth in execuution time and
d number of clock
cyclles. It must be
b noticed thaat there are more
m
dense FP
PGAs
avaiilable that caan operate inn higher clocck rates that will
certaainly increasee or even surpaass the GPU performance.
p
Finally,
F
it is possible to im
mprove the performance
p
o the
of
FPG
GA algorithmss even more. D
Dividing the input
i
image reegion
in various squaares and prooviding that each regionn be
transmitted to parallel convoluution moduless through diffferent
n improve thhe performancce roughly byy the
dataa streams, can
num
mber of convolution moduules. Howeveer, by doing this,
morre FPGA die area
a is consum
med and could possibly maake it
impractical.
ACKNOWLLEDGMENTS
Figure 8. Number of cclock cycles of thhe convolution wiith mask size of 115 and
various imagge resolutions.

The
T
authors are gratefull to FAPESP
P, grants nuumber
2010
0/04675-4 an
nd 2009/177736-4, to thee Departmennt of
Com
mputer Sciencee, Federal Uniiversity of Sao
o Carlos and to
t the
Dep
partment of Electrical
E
Engi
gineering, Fed
deral Universiity of
Rio Grande do No
orte for the suupport through
hout this workk.
REFERRENCES
[1]

[2]
[3]

[4]

[5]

[6]
Figu
ure 9. Speedup off the Convolution with mask size of
o 15 and various image
resollutions.

A positive poiint in favor off the FPGA is that in can opperate


in a small boardd, with the peripheral inteerfaces integrrated,
whille the GPU bboard needs a PC to be co
onnected whicch is
certaainly much larrger and moree power consu
uming.
VI.

CON
NCLUSIONS

In
I this paper, we presented a comparison
n between CU
UDA,
C, Matlab
M
and FP
PGA for the coonvolution off grayscale imaages.
Baseed on results ppresented, it is inferable thaat CUDA pre sents
the best perform
mance in execcution time, number of cclock
cycles and speeddup in compparison to C,, Matlab andd the
GA architecturre and increasees with the grrowth
impllemented FPG
in im
mage resolutioon. That is duee to the fact th
hat CUDA tennds to
explore better m
massive amoounts of dataa, such as high
resolution images, based on its inherent features succh as
multtiple pipeliness, high theorettical peak of GFLOPS
G
and high
band
dwidth [14] .

[7]

[8]
[9]

[10]
[11]
[12]
[13]
[14]

Nvidia Corpora
ation. (2009). N
Nvidia CUDA Programming Guide.
[Online].
Ava
ailable:
http:///developer.nvidiaa.com/object/cudaa_2_3_
downloads.html
(2010)
Our
Xilinx,
C.
History.
[Online].
[
Avaailable:
www.xilinx.com
m/company/historry
S. Asano, T. Maruyama
M
and Y. Yamaguchi, Peerformance Compparison
of FPGA, GPU
U and CPU in Image Processsing, in Internaational
Conference on Field Programm
mable Logic and
d Applications - FPL
009, pp. 126-131..
2009, Prague, 20
S. Che, J. Li, J.W. Sheaffer, K
K. Skadron and J.
J Lach, Acceleerating
Compute-Intensive Applications with GPUs and FPGAs,
F
in Sympposium
on Application Specific Processsors - SASP 2008, Anaheim, 20008, pp.
101-107.
S. Kestur, J.D. Davis and O. W
Williams, BLAS Comparison on FPGA,
F
CPU and GPU,
in IEEE Compuuter Society Annu
ual Symposium onn VLSI
(ISVLSI), Lixourri and Kefalonia, 2010, pp. 288-29
93.
S. J. Park, D.R. Shires and B.JJ. Henz, Coproccessor Computingg with
FPGA and GPU
U, in DoD HPC
PCMP Users Gro
oup Conference, 2008.
DOD HPCMP UGC,
U
Seattle, 20008, pp. 366-370.
R. Weber, A. Gothandaraman
an, R.J. Hinde and G.D. Petterson,
Comparing Hardware Accelerat
ators in Scientificc Applications: A Case
Study, in IEEE
E Transactions on Parallel and
d Distributed Syystems,
2011, Vol. 22, no. 1, pp. 58-68.
R. C. Gonzales and R. E. Woodds, Image Enhaancement in the Spatial
S
Domain, in Dig
gital Image Proceessing, 3rd ed. Prrentice Hall, 20088.
V. Podlozhnyuk
k. (2007, jun.). Im
mage Convolution
n with CUDA [O
Online].
Available: http:///developer.downnload.nvidia.com//compute/cuda/1.1-Beta
/x86_64_website
e/projects/convollutionSeparable/d
doc/convolutionSepara
ble.pdf
Altera CO. NIIOS II Processo
sor. [Online]. Available:
A
www..altera.
com/devices/pro
ocessor/nios2/ni2--index.html
Independent JPE
EG Group. libjpegg. [Online]. Available: www.ijg.orrg
Altera CO. Nio
os II System A
Architect Design. [Online]. Avaailable:
www.altera.com
m/support/examplees/nios2/exm-sysstem-architect.htm
ml
Altera CO. (20
011). Avalon Sttreaming Interfa
ace, Cap. 5. [O
Online].
Available: www
w.altera.com/literaature/manual/mnll_avalon_spec.pddf
D. B. Kirk and W.
W W. Hwu, Int
ntroduction, in Programming
P
Masssively
Parallel Processsors: A Hands-onn Approach, 1st ed.
e Morgan Kauffmann,
2010.

Вам также может понравиться