Академический Документы
Профессиональный Документы
Культура Документы
Comparison
Lucas M. Russo, Emerson C. Pedrino, Edilson Kato
I.
INTRODUCTION
Software
Drivers
FPGA
Description
Processor: Intel Core i5 750 (8MB cache L2),
Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2
x 2 GB Corsair (DDR2-800); Graphics Board: XFX
Nvidia GTX 295, 896MB
Windows 7 Professional 64-bit; Visual Studio 2008
SP1
Nvidia driver video version: 190.38; Nvidia CUDA
toolkit version: 2.3
Cyclone II EP2C35F672 on Terasic DE2 board;
Quartus II 10.1 Software with SOPC Builder, NIOS II
EDS 10.1 and ModelSim 6.6d Simulation Tool, for the
implementation of the algorithms.
CONVOLUTION
image with the closest border pixel. In this work the first
choice is used as in GONZALES [8].
Considering an image of size of MxN pixels, a mask of
size SxT the multiplication is the more costly operation.
Hence, (MN)(ST) operations are performed and, consequently,
the algorithm belongs to O(MNST).
A mask w(x,y) can be decomposed in w1(x) and w2(y)
in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a
vector of size (Sx1) and w2(y) is a vector of size (1xT), the
2D convolution can be performed as two 1D convolutions. In
this way, it is said that the convolution is separable and the
algorithmic complexity decays allowing for a more flexible
implementation. Hence, the separable convolution formula can
be expressed as in equation 4.
(4)
IV.
IMPLEMENTATION
/* Column Convolution*/
for i 0 to number of lines-1
for j 0 to number of columns-1
o(i, j) 0
for k a to a
if i-k >= 0 and i-k < number of lines
o(i, j) o(i, j) + g(i-k, j)*w1(a+k)
end-if
end-for
end-for
end-for
C. CUDA
C
Implem
mentation
On
O CUDA, thhe algorithm is implemen
nted through two
diffeerent kernels: the first part is
i implementeed through thee line
kern
nel and the seccond one is im
mplemented th
hrough the collumn
kern
nel. The devellopment of thee convolution
n was based onn the
algorithm of Poddlozhnyuk [9], extended to
o support anyy size
real images as inpputs.
n Kernel
Linee Convolution
The
T threads w
were grouped in 2-D blockss with size (44x16)
or 4 lines by 16 ccolumns and, in turn, they were
w
groupedd in a
nsions of the iinput
2-D grid with sizze depending on the dimen
imag
ge. Each threaad in the blocck is responsib
ble for fetchinng six
pixels from the innput image too per-block sh
hared memoryy. By
doin
ng this, the access to thhe Global Deevice Memorry is
redu
uced, as long as the mem
mory coalescin
ng restrictionss are
guarranteed.
Figs.
F
1 and 2 illustrate the general idea with block sizze of
(4x4
4) instead off its real dim
mensions (4 x 16) for im
mage
displaying purposses. The first line
l
of the 2-D
D block (Fig. 1) is
mapped to the firrst line of the input imagee (Fig. 2). Inn this
way, the thread0,0 is mapped to the pixells S0,0 (left aapron
regio
on), S0,0+block_ssize, S0,0+2*block__size, S0,0+3*blockk_size, S0,0+4*blocck_size,
S0,0++5*block_size (rigght apron region) and so on. Moreoover,
conccerning the acctual block sizze, 384, 16x6 (number of ppixels
load
ded for each line) x 4 (num
mber of block
k lines), pixells are
load
ded to shared m
memory.
(55)
By
B doing this, every colum
mn between rowOffset
r
andd the
last column will be calculatedd, even if so
ome of them were
prev
viously calculated. Hence, it is not necessary to deterrmine
exacctly which were
w
the last calculated co
olumn in ordder to
consstruct the row
wOffset, increaasing the veriification overhhead.
Add
ditionally, th
he memory coalescing requeriments are
auto
omatically satisfied for N
NUM_LOADS
S_PER_THR
READ
equaal four and BL
LOCK_SIZE__ROW_X equ
ual 16.
Colu
umn Convolu
ution Kernel
The
T threads, in
i the columnn filter, are div
vided in 2-D blocks
b
with
h size 8x16 orr 8 lines by 166 columns. Ass in the line keernel,
the grid size dep
pends on the iinput image and
a each threead is
ponsible for fetching
f
six ppixels from th
he input imagge to
resp
sharred memory.
ock size was made
m
The decision to 16 columnns in this blo
conssidering the memory coallescing requirrements (i.e., half
warp
rp access to co
ontiguous mem
mory positions). Converselyy, the
8 lin
nes in the blo
ock size tend to reduce thee number of apron
a
pixeels loaded to shared
s
memorry, reduce thee ratio (numbber of
apro
on pixels/num
mber of outpuut pixels) and
d increase meemory
reusse, since they will
w be loadedd to others sha
ared memoriess.
In
n the same way
w as the linne kernel, thee column kernnel is
laun
nched
again
n
for
im
mages
nott
multiple
of
(BL
LOCK_SIZE_
_COLUMN_Y
Y*NUM_LOA
ADS_PER_TH
HRE
AD)
), in which BLOCK_SIZE
B
E_COLUMN_
N_Y is the nuumber
of liines per block
k and NUM_L
LOADS_PER
R_THREAD is
i the
num
mber of fetches per thread ffrom the imag
ge main. Thuss, the
follo
owing offset was
w used:
Fig
gure 1. Example oof a 2-D block of size 16 (i.e., 16 threads)
t
for line aand
column kernels.
columnOffset=
LOCK_SIZE__COLUMN_Y
Y*
height BL
NUM_L
LOADS_PER__THREAD
(6)
After
A
the loadding stage, alll threads witthin a block must
syncchronize their execution, siince, in the next
n
stage, thrreads
are going
g
to acceess elements that were load
ded by other oones.
In order
o
to do that, a call to the CUD
DA API funnction
__syyncthreads() is issued annd the progrram can prooceed
correectly.
D. FPGA
F
Implem
mentation
For
F the FPGA
A implementaation, the arcchitecture deppicted
on Fig.
F 3 was dev
veloped with the assistancee of SOPC Buuilder
and Verilog HDL
L coding.
In
I the final sstage, each thread
t
is assiigned the tassk to
calcu
ulate four outpput pixels, whhich are in thee same positionns as
the ones
o
that the thread were mapped to in
n the main reegion
(Fig. 2).
The
T architectu
ure is responsiible for the fo
ollowing functtions.
A grayscale or biinary image inn JPEG formaat, stored on Flash
F
mory, is converted to RAW
W format, thatt is, to a matrrix of
Mem
integ
ger values bettween 0 (i.e., bblack) and 255 (i.e., white).
Concerning
C
thhe flexibility of the convollution filter, ssome
imag
ges are nott multiple of (BLOCK
K_SIZE_ROW
W_X*
NUM
M_LOADS_P
PER_THREAD
D), in which
h BLOCK_SIIZE_
ROW
W_X is thee number of
o columns per block and
NUM
M_LOADS_P
PER_THREAD
D is the number
n
of ppixels
fetch
hed from the main region. In order to solve
s
this, thee line
Such
S
conversiion was perfoormed by the softcore proccessor
NIO
OS II Fast Co
ore [10], the C library libj
bjpeg [11] andd the
wrap
pper function
n for decomprressing JPEG images, avaiilable
in [12].
[
Thereforre, this consttitutes the ap
pplication soft
ftware
layeer and is controlled by the NIOS
N
II EDS.
Figu
ure 2. Example off an image regionn with 96 pixels mapped
m
to 16 threeads of line kerneel. Pixels with the
e same color are m
mapped to the sam
me thread (see Figg. 1).
pixeel is position
ned in the ccenter gray region
r
coordiinate,
of Fig. 4.
After
A
that, thee decompresseed image is written
w
to the ppixel
bufffer (SDRAM Chip lower addresses)
a
and
d, in this wayy, the
DMA
A (Pixel Buff
ffer DMA Coontroller) is able to acceess it
with
hout interruptinng the processor and transm
mit the pixel tto the
remaaining of the pipeline (Imaage Processing Pipeline). T
Then,
the image
i
is proceessed through various stream
ming componnents,
with
h its interfacee called Avaalon Streamin
ng Interface [13],
consstituting the o pipeline (Imaage Processing Pipeline Figg. 3).
Firsttly, the imagge is processsed by the User Stream
ming
Com
mponent. Folllowing, eachh pixel (8-b
bit grayscalee) is
conv
verted to 30-bbit RGB (RGB
B Resampler)). Then, theree is a
dual-clock queue (Clock Crosssing Bridge) acting as a brridge
betw
ween two cloock domains (100MHz, th
he general deesign
clock
k, and the 255MHz, the VG
GA clock). An
nd, lastly, a V
VGA
conttroller is used to display thee processed im
mage.
The
T implemennted convoluttion module iss interfaced too the
rest of the archhitecture by means of Avalon
A
Stream
ming
Interrface.
This
T
module iis based on a finite state machine
m
with four
statees: DATA_FIILL_BUFFER
ER_STATE, DATA_PROC
D
CESS
ING
G_STATE_1, DATA_PRO
OCESSING_S
STATE_2 DA
ATA_
END
D_PROCESS
SING_STATE
E.
Firstly,
F
upon the reset signnal, every position of the shift
regisster was inittialized with the value 0.
0 The first state
(DATA_FILL_BU
UFFER_STA
ATE) consists on reading, each
k rise, the inpput interface and store the value read inn the
clock
shiftt register (Fiig. 4). This sttate lasts untill it has been read
floorr (KS/2)*(1 IW), or in otther words, un
ntil the first vvalid
Figu
ure 4. Layout of the convolution m
module shift register. KS (Kernel Size)
S
deenotes the size, in
n one dimension, of the used kerneel, that means, forr a
kern
nel 3x3, KS = 3; IW (Image Widthh) denotes de width of the input im
mage,
that means, for a 64
40x480 image, IW
W = 640. The greey area indicates the
t
pixe
els used to the connvolution calculaation.
Next,
N
the
present
state
is
modified
to
DAT
TA_PROCES
SSING_STATE
TE_1. The first moment off this
statee is depicted in
n Fig. 5.
From
F
this state until the endd of the state machine
m
(i.e., state
DAT
TA_END_PR
ROCESSING__STATE), thee convolutionn sum
(Eq.. 3) will be applied
a
to thee gray area an
nd, at every clock
c
cyclle, an output pixel
p
will be aavailable at th
he output inteerface
(i.e., Avalon Streeaming Sourcce Interface). It is importaant to
ntion that thiis calculationn is perform
med by a paarallel
men
com
mbinational ciircuit sub moodule (Convo
olution Operaation,
to
Fig.6).
The
present
state
willl
change
DAT
TA_PROCES
SSING_STATE
TE_2 when alll input pixells are
read
d, or more sppecifically, whhen the numb
ber of pixels read
equaals the numberr of pixels of the
t input imag
ge.
This
T state is siimilar to DAT
TA_PROCES
SSING_ STAT
TE_1
with
h the exception that, as there are no pixeels to be readd, the
valu
ue zero will bee forced into the
t shift register. By doingg this,
and similarly to the first
fi
input pixels of the
DAT
TA_PROCESS
SSING_ STAT
TE_1 state, a border
b
treatmeent is
perfo
formed, since the border pixels
p
(i.e., without
w
a com
mplete
neighborhood) w
will have its values
v
convoluted to neigghbor
pixels or to valuess 0. The last moment
m
of thiss state is simillar to
the one
o depicted in Fig. 5, witth the exception that the vaalues
conssidered are thee ones locatedd near the end of the image (i.e.,
botto
om-right side)).
For
F the convo
olution appliccation, it is possible to observe
that the execution
n time graph (FFig. 7) for all three architecctures
behaaves as a expo
onential (note the log scale on the y-axiss) and
tend
d to maintain the
t same grow
wth rate, as th
he image resollution
increases. The ex
xception to thiis is the Matlab implementtation
for image resolu
utions 3300x22400 and 409
96x4096, posssibly
relatted to cache performance.
Lastly,
L
after tthe calculationn of the last pixel, the preesent
statee is modifiedd to DATA__END_PROC
CESSING_STA
TATE
statee. This last sstate is respoonsible for resetting
r
the shift
regisster and the coounters to its default values (i.e., zero inn this
case). Next, the state machiine returns to
o its initial state
TA_FILL_BU
UFFER_STA
ATE.
DAT
It
I is importantt to highlight that only kern
nels up to 5x55 and
imag
ges up to 640xx480 are suppported in the FPGA
F
convoluution
mod
dule. This is due the DE
E2 board memory
m
and llogic
elem
ments limitatiion. Thereforre, results in
nvolving diffferent
kern
nel and image sizes were esstimated consiidering the moodule
arch
hitecture itself..
The
T number of clock cycless were obtaineed disregardinng the
load
ding period (i.e., state DATA_FILL_BU
UFFER_STA
ATE).
Afteer this state, uuntil the end of the state machine,
m
at eevery
clock
k cycle, one ooutput pixel iss available at the output moodule
interrface.
V..
RESULTS AND
A
COMPARIISON
The
T graphs foor the convollution operation comparingg the
execcution time foor various grayy scale images resolutions for a
mask
k size of 15, as well as thhe number off clock cycless and
CUD
DA speedup ffor Matlab, C and the FPG
GA architecturre are
presented in figures 7, 8 and 9.
Fig
gure 7. Average execution times off the convolution
n with mask size of
o 15
and various im
mage resolutions.
The
T speedup graph
g
(Fig. 9)) shows an app
proximately stteady
speeedup consideering CUDA
A in regard
d to the other
implementations and various image resollutions. Againn, an
exceeption to this is the Matlab implementatiion for the twoo last
imag
ge samples. The
T reason foor this good CUDA
C
speeddup is
due the large amo
ount of arithm
metic operation
ns, high granuularity
n comparison to C,
and high resourcee utilization oof the GPU in
A.
Mattlab and FPGA
The
T
C impleementation diid not exploiit parallelism
m and
serv
ved as a control implemeentation for comparison
c
too the
otheers parallel alg
gorithms. Beccause of that, its executionn time
and number of cllock cycles peerformed worrse than the others
o
implementations.
The
T FPGA im
mplementationn explored some parallelism
m, as
with
h the parallel convolution calculation, but lacked a true
paraallelism appro
oach because of the FPGA
A and developpment
boarrd (DE2) resource
r
limi
mitations. Theerefore, due the
paraallelism calculation the exeecution timess were only worse
w
than
n a true parrallel implem
mentation (i.ee., CUDA). It is
interresting to nottice that the FFPGA numbeer of clock cycles
c
(Fig
g. 13) was sm
mall, even lesss than CUDA
A for image siize of
512x
x512. Besidees, the clock rate used in
n the FPGA was
relattivity low (i..e., 100MHz in this design) comparinng to
CUD
DA (i.e., 1242
2 MHz for thee processor clo
ock).
Another
A
limitation of the uused FPGA was
w the absennce of
multiplier block
ks that woul
uld improve significantlyy the
perfformance for the convoluttion operation
n. Hence, usiing a
larger FPGA witth more resouurces and a faster
fa
clock shhould
ormance of the
he FPGA and possibly
p
overccome
increase the perfo
the CUDA
C
implem
mentation.
Regarding
R
thee FPGA archiitecture, it can
n be seen from
m the
grap
phs that it perfformed well, aalthough worsse than CUDA
A and
keptt a steady gro
owth in execuution time and
d number of clock
cyclles. It must be
b noticed thaat there are more
m
dense FP
PGAs
avaiilable that caan operate inn higher clocck rates that will
certaainly increasee or even surpaass the GPU performance.
p
Finally,
F
it is possible to im
mprove the performance
p
o the
of
FPG
GA algorithmss even more. D
Dividing the input
i
image reegion
in various squaares and prooviding that each regionn be
transmitted to parallel convoluution moduless through diffferent
n improve thhe performancce roughly byy the
dataa streams, can
num
mber of convolution moduules. Howeveer, by doing this,
morre FPGA die area
a is consum
med and could possibly maake it
impractical.
ACKNOWLLEDGMENTS
Figure 8. Number of cclock cycles of thhe convolution wiith mask size of 115 and
various imagge resolutions.
The
T
authors are gratefull to FAPESP
P, grants nuumber
2010
0/04675-4 an
nd 2009/177736-4, to thee Departmennt of
Com
mputer Sciencee, Federal Uniiversity of Sao
o Carlos and to
t the
Dep
partment of Electrical
E
Engi
gineering, Fed
deral Universiity of
Rio Grande do No
orte for the suupport through
hout this workk.
REFERRENCES
[1]
[2]
[3]
[4]
[5]
[6]
Figu
ure 9. Speedup off the Convolution with mask size of
o 15 and various image
resollutions.
CON
NCLUSIONS
In
I this paper, we presented a comparison
n between CU
UDA,
C, Matlab
M
and FP
PGA for the coonvolution off grayscale imaages.
Baseed on results ppresented, it is inferable thaat CUDA pre sents
the best perform
mance in execcution time, number of cclock
cycles and speeddup in compparison to C,, Matlab andd the
GA architecturre and increasees with the grrowth
impllemented FPG
in im
mage resolutioon. That is duee to the fact th
hat CUDA tennds to
explore better m
massive amoounts of dataa, such as high
resolution images, based on its inherent features succh as
multtiple pipeliness, high theorettical peak of GFLOPS
G
and high
band
dwidth [14] .
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Nvidia Corpora
ation. (2009). N
Nvidia CUDA Programming Guide.
[Online].
Ava
ailable:
http:///developer.nvidiaa.com/object/cudaa_2_3_
downloads.html
(2010)
Our
Xilinx,
C.
History.
[Online].
[
Avaailable:
www.xilinx.com
m/company/historry
S. Asano, T. Maruyama
M
and Y. Yamaguchi, Peerformance Compparison
of FPGA, GPU
U and CPU in Image Processsing, in Internaational
Conference on Field Programm
mable Logic and
d Applications - FPL
009, pp. 126-131..
2009, Prague, 20
S. Che, J. Li, J.W. Sheaffer, K
K. Skadron and J.
J Lach, Acceleerating
Compute-Intensive Applications with GPUs and FPGAs,
F
in Sympposium
on Application Specific Processsors - SASP 2008, Anaheim, 20008, pp.
101-107.
S. Kestur, J.D. Davis and O. W
Williams, BLAS Comparison on FPGA,
F
CPU and GPU,
in IEEE Compuuter Society Annu
ual Symposium onn VLSI
(ISVLSI), Lixourri and Kefalonia, 2010, pp. 288-29
93.
S. J. Park, D.R. Shires and B.JJ. Henz, Coproccessor Computingg with
FPGA and GPU
U, in DoD HPC
PCMP Users Gro
oup Conference, 2008.
DOD HPCMP UGC,
U
Seattle, 20008, pp. 366-370.
R. Weber, A. Gothandaraman
an, R.J. Hinde and G.D. Petterson,
Comparing Hardware Accelerat
ators in Scientificc Applications: A Case
Study, in IEEE
E Transactions on Parallel and
d Distributed Syystems,
2011, Vol. 22, no. 1, pp. 58-68.
R. C. Gonzales and R. E. Woodds, Image Enhaancement in the Spatial
S
Domain, in Dig
gital Image Proceessing, 3rd ed. Prrentice Hall, 20088.
V. Podlozhnyuk
k. (2007, jun.). Im
mage Convolution
n with CUDA [O
Online].
Available: http:///developer.downnload.nvidia.com//compute/cuda/1.1-Beta
/x86_64_website
e/projects/convollutionSeparable/d
doc/convolutionSepara
ble.pdf
Altera CO. NIIOS II Processo
sor. [Online]. Available:
A
www..altera.
com/devices/pro
ocessor/nios2/ni2--index.html
Independent JPE
EG Group. libjpegg. [Online]. Available: www.ijg.orrg
Altera CO. Nio
os II System A
Architect Design. [Online]. Avaailable:
www.altera.com
m/support/examplees/nios2/exm-sysstem-architect.htm
ml
Altera CO. (20
011). Avalon Sttreaming Interfa
ace, Cap. 5. [O
Online].
Available: www
w.altera.com/literaature/manual/mnll_avalon_spec.pddf
D. B. Kirk and W.
W W. Hwu, Int
ntroduction, in Programming
P
Masssively
Parallel Processsors: A Hands-onn Approach, 1st ed.
e Morgan Kauffmann,
2010.