Академический Документы
Профессиональный Документы
Культура Документы
AbstractHeterogeneous system architectures are becoming speed-up and energy efciency. Both are crucial aspects of
more and more of a commodity in the scientic community. future HPC systems. For this, a seismic streaming algorithm
While it remains challenging to fully exploit such architectures, was chosen, which provides required features of typical and
the benets in performance and hybrid speed-up, by using a future HPC applications, and was therefore subsequently
host processor and accelerators in parallel in a non-monolithic implemented and evaluated. In order to cover a fair amount
matter, are signicant. Hereby, the energy efciency is becoming of heterogeneous architectures, not only typical HPC systems
an increasingly critical challenge for future high-performance
computing (HPC) systems, which do want to exceed the Exascale were chosen but also a dedicated embedded system. This was
barrier with several competing architecture concepts ranging done with respect to recent developments showing the potential
from high-performance CPUs, combined with GPUs acting as benets of clustered low-performance CPUs with dedicated
oating-point accelerators, to computationally weak CPUs, paired HPC accelerators. As a result of these measurements, the
with dedicated and highly-performant FPGA-based accelerators. paper presents an overview on suitable aspects of a hybrid
In this paper, we realize and evaluate a hybrid computing implementation framework; addressing the question whether
approach based on a two-dimensional seismic streaming algo- or not costly manual optimizations are benecial for the given
rithm with several heterogeneous system architectures, including instruction-set architectures (ISA).
conventional HPC approaches based on powerful CPUs and The paper is structured as follows: Section II reviews the
GPUs. Furthermore, we elaborate the effort on an embedded actual situation of the HPC market and future trends towards
system platform claiming to be a mini supercomputer [1].
Several CPU and accelerator combinations are utilized in a Exascale. Section III provides an overview of the chosen
manual work-sharing manner with the aim of achieving sig- seismic algorithm, while Section IV explains the implemented
nicant performance speed-ups and a detailed energy-efciency application; providing various hybrid execution models. The
study. Based on rooine models and experimental evaluations, the evaluated heterogeneous systems are introduced in Section V,
paper provides an insight into the fact that hybrid computing is which also include a homogeneous performance projection
mostly unconditionally benecial for balanced systems regarding based on rooine models. Experimental results are analyzed in
the performance as well as the energy efciency, aiding the Section VI. The paper closes with Section VII, which contains
programmer in the decision whether or not costly, manually concluding remarks and an outlook to future work.
tuned, homogeneous implementations are worthwhile.
Keywordsnarrow stencil computing, (embedded) hybrid work- II. S TATE OF THE A RT
sharing, multi GPU processing, performance evaluation The HPC market is driven mainly by three major challenges:
energy-efciency, computing power, and reliability. Some of
I. I NTRODUCTION them can only be addressed by a trade-off during hardware
Accelerators in heterogeneous systems are mostly used to design, while others can also be addressed plainly by software.
ofoad entire to-be-accelerated computations in a monolithic Since 2007, HPC systems are ranked by energy efciency
manner, meaning that the CPU waits for the accelerator to showing, that heterogeneous systems improve this efciency
nish its calculation [2]. Efforts previously spent on hand- by a wide margin [9]. In such heterogeneous systems, mostly
tuning algorithms to exploit the specic properties of that CPU GPUs distributed by the vendor Nvidia and alternatively the
would thereby be wasted. Several techniques developed by Many-Integrated Core architecture offered by Intel (e.g. no. 1
AMD (see HSA [3]) and Nvidia (see OpenACC [4]) address system TSUBAME-KFC [10]) are utilized as accelerators.
the serial-versus-parallel distribution on a heterogeneous system, Based on such a heterogeneous approach, the major direction
but are missing a simultaneous parallel heterogeneous execution nowadays seems to be the use of a strong CPU, interconnected
of parallel code on both host processor and accelerator. to an accelerator via a high-performance bus (with one notably
Some attempts on exploiting shared computation, leveraging exception being the IBM BlueGene/Q series).
the benets of hand-tuned CPU code and simultaneously Although the Top500 listed systems are driven by perfor-
running accelerators, have been evaluated with promising results mant architectures like x86-64, POWER and SPARC, scientists
[5], [6]. OpenMP seems to take the direction towards a parallel are investigating into the energy-efcient but computationally
heterogeneous computation (see proposed OpenMP clause weak ARM architecture as a potential CPU for future Exascale
hetero() [7]), which likewise shows signicant performance systems [11]. Additionally, two further major trends can be
improvements on some benchmarks [8]. Hence, this paper identied, which drive heterogeneous computing towards the
targets a hybrid approach that is analyzed with regard to actual Exascale target:
660
TABLE I. C ONFIGURATION OF THE MAIN PROCESSORS . Core i7-2600k GeForce GTX560Ti (GF114)
2048 2048 peak single
Attainable [GFlops]
1024 1024
Cores Global 512 512
Processor Frequency Compiler SDK 256 peak single 256 th
(Threads) Memory dwid
128 128 ban
Intel Intel OpenCL peak double am peak double
64
th
64
k stre
3.40 GHz 4 (8) GCC 4.6.2 8 GB 32 dwid 32 pea
Core i7-2600k SDK 1.5 16 ban 16
am
stencil
stencil
Intel Intel OpenCL 8 k stre 8
2.00 GHz 4 (4) GCC 4.3.2 8 GB ea
4 p 4
Xeon E5405 SDK 1.5 2 2
Intel Intel OpenCL 1/8 1/4 1/2 1 2 4 8 16 32 1/8 1/4 1/2 1 2 4 8 16 32
3.33 GHz 2 (4) GCC 4.6.2 8 GB
Core i5-660 SDK 1.5 Operational Intensity [ FLOPs
byte ] Operational Intensity [ FLOPs
byte ]
Xilinx Zynq
677 MHz 2 (2) LLVM 3.4 - 1 GB
XC7Z020-1CLG400C
Fig. 4. Rooine Models:
TABLE II. C ONFIGURATION OF THE ACCELERATORS . Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114)
Accelerator Global
Accelerator Frequency SDK Interconnect Xeon E5405 Tesla C1060
Memory Memory
2048 2048
Attainable [GFlops]
Nvidia GeForce Nvidia OpenCL peak single
822 MHz PCIe 2.0 x16 1 GB 8 GB 1024 1024
GTX560Ti (GF114) 295.20 512 512
Nvidia Nvidia OpenCL Both via PCIe 2.0 x16 256 peak single 256 th
602 MHz 4 GB 8 GB 128 128 wid
Tesla C1060 285.05.09 Host Interface Card and
64 64 mb
Nvidia Nvidia OpenCL Both via PCIe 2.0 x16 th peak double rea
32 wid 32 eak st
675 MHz 256 MB 8 GB and p
GeForce 8600GTS 290.10 Second over DMI 16 mb 16 peak double
trea
stencil
stencil
Epiphany Libs 8 eak s 8
Adapteva Custom AXI BUS p
2014-06-25 4 4
600 MHz 512 KB 1 GB
Epiphany E16G301 to eLink glue-logic 2 2
(E-GCC 4.8.2) 1/8 1/4 1/2 1 2 4 8 16 32 1/8 1/4 1/2 1 2 4 8 16 32
Attainable [GFlops]
CPU 217.6 / 108.8 21.2 95 2.3 128 128
Core i7-2600k 64 peak single 64
th
Nvidia GeForce 32 th 32 dwid
GPU 1263.4 / 105.3 128.3 170 7.4 wid ban
GTX560Ti (GF114) 16 and peak double 16 am
mb stre
8 rea 8 peak
ak st
Intel 4 p e 4
CPU 149.1 / 74.6 25.6 80 1.8
stencil
stencil
Xeon E5405 2 2
1 1
Nvidia
GPU 933.0 / 78.0 102 187.8 5.0 1/2 1/2
Tesla C1060 1/8 1/4 1/2 1 2 4 8 16 1/8 1/4 1/2 1 2 4 8 16
32 32 peak single
16 peak single 16
8 h 8
idt
dw
4 ban 4
h
2 eam 2 idt
interconnected to GPUs featuring OpenCL support) has been 1 peak str peak double
1
eam
ban
dw
str
stencil
stencil
1/2 1/2
developed which contains all of the CPU implementations 1/4 1/4 p
eak
default, aligned, unaligned, aligned not grouped); threaded SSE Operational Intensity [ FLOPs
byte ] Operational Intensity [ FLOPs
byte ]
661
25 25 90
20 80
[nJ/stencil]
20
Speedup
speedup
15 70
10
15 60
5 10 50
40
0
Pth Op Op Op Pth Pth Pth Op Op
5 30
r. P enC enCL enC r. S r. S r. S enC enC 0 20
lain LC CP LC SE SE SE LG LG
C PU U+ PU (al
ign
(al
ign
(al
ign PU PU CPU CPU & ACC ACC CPU CPU & ACC ACC
Op +O e e e (std (im
pen d) d) d) ) g)
enC CL + Op + Op
LG G enC enC
PU PU LG LG
(std
)
(im
g) PU
(std
PU
(im
Fig. 11. Performance and energy efciency results:
) g)
2200x748px 4400x1496px 6600x2244px 8800x2992px Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114)
50 300
Fig. 8. Evaluation of the various implementations / combinations:
[nJ/stencil]
40 250
speedup
Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114) 30 200
20 150
10 100
45
40 0 50
35
Speedup
[nJ/stencil]
) g) 10 130
speedup
2200x748px 4400x1496px 6600x2244px 8800x2992px 8 120
6 110
4 100
2 90
Fig. 9. Evaluation of the various implementations / combinations: 0 80
CPU CPU CPU ACC 2x A CPU CPU CPU ACC 2x A
Intel Xeon E5405 & 2x Nvidia Tesla C1060 & ACC & 2x A
CC
CC & ACC & 2x A
CC
CC
12
10
Fig. 13. Performance and energy efciency results:
Speedup
8
6
4
2
Intel Core i5-660 & 2x Nvidia GeForce 8600GTS
0
Pth Op O O O O P P P P P O 2 O 2
r. P enC penC penC penC penC thr. S thr. S thr. S thr. S thr. S penC *Ope penC *Ope
lai n n
nC L CP L CP L CP L CP L CP SE (a E (a E (a E (a E (a L GP CL G L GP CL G
S S S S
U U + U + U + U + lig lig lig lig lig U P U P
Op 2* Op 2* ned) ned) ned) ned) ned) (std) U (st (img U (im
enC Op enC Ope
L G enC L
PU L G GPU L GP
nC
+ Op +
L
2 + O
enC *Ope penC *Ope
nC
+
L
2
nC
d ) ) g)
OpenCL now provide a signicant speed-up, which results from
P GP L G GP
(st
d)
(
d) U (st img) U (im
g)
U (st P U (
LG
P
d) U (std img) U (im
a more efcient dynamic utilization of the CPU in contrast to
) g)
2200x748px 4400x1496px 6600x2244px the static SSE-tuned code. Focusing just on the GPUs, it is
benecial to use the image-processing primitives to achieve
Fig. 10. Evaluation of the various implementations / combinations: the highest speed-up. For this paper, the implementation and
Intel Core i5-660 & 2x Nvidia GeForce 8600GTS
partitioning with the highest speed-up was used to determine
the maximum hybrid speed-up. Because the evaluated algorithm
(Figs. 4 to 7). It is recognizable that all of the Intel CPUs falls into the streaming category, which does not prot highly
have nearly the same expected attainable GFlops, which is from caches (just from the prefetching mechanism), the average
based on the almost equal peak memory bandwidth (table speed-up of the various image resolutions and the energy
III and Figs. 11 to 13). However, the attached GPUs provide efciencies have been used for further analysis (Figs. 11 to 14).
signicantly more attainable Flops and due to their inequality, The GPU in the rst system comes with a signicant speed-
they will be the decisive criteria for the hybrid approach. The up of about 26.96 (Fig. 11). Using the hybrid variant slightly
system equipped with the Nvidia 8600GTS seems to be more decreases the speed-up as well as the energy-efciency, which
balanced, because the GPU does not even achieve twice the is a result of the hereby-inefcient CPU and its little speed-up
amount of GFlops compared to its CPU (Fig. 6). The ARM compared to the GPU. For stronger GPUs as seen here, it
system is rather unusual, because the accelerator is substantially seems to be of signicance to use the CPU only to shufe data
weaker if compared to its CPU or to the GPUs. In addition, back and forth to keep the GPU busy, instead of partitioning
the Epiphany accelerator shares the CPUs memory (Fig. 7). the to-be-computed data in a hybrid fashion.
Rooine models are a growing instrument for performing The second evaluated system (Fig. 12) comes with two
both, rough and quick-and-simple estimations due to a specic GPUs, both of which create a major speed-up. Contrary to the
ISAs performance regarding various algorithms. Still, they expectation of a doubled performance using two GPUs, we
lack important criteria such as energy efciency and reliability, see a drop in further speed-up as a result of synchronization
which are signicantly growing in their importance. overhead that needs to be handled by the CPU. The latter seems
to be too slow for synchronizing both GPUs, but for reasons
VI. E XPERIMENTAL R ESULTS of rather small data exchange, the PCIe bus cannot be the
In order to have a common baseline, the generated speed-ups limiting factor. This results in no further speed-up if the CPU
were measured against the plain C implementation. Depending is also used in a hybrid computation. It has been distinguished
on the available memory size, various seismic image resolutions that one thread on the CPU needs to take responsibility for
(2200x748, 4400x1496, 6600x2244 and 8800x2992; time steps: one GPU. Also, the energy efciency is signicantly better
3000) were analyzed. Based on the homogeneous variants, when using the GPUs instead of the CPU. With two GPUs,
several resulting hybrid combinations were evaluated, while synchronization and data exchange overhead leads to slightly
three of them are illustrated (Figs. 9 and 10). These show larger energy consumption per stencil and less energy efciency.
that the homogeneous SSE-tuned algorithm runs faster than Still, the resulting efciency is signicantly superior to any
the OpenCL version. This indicates certain overheads, such as computation, including the CPU.
scheduling, which deserves additional introspection. It turns out, As shown in Section V-A, the third system is well balanced
however, that if both mentioned implementations are used in a due to the Flops, resulting in a speed-up, which is almost the
hybrid computation, the smart techniques implemented within same on the CPU as well as on a single GPU. It seems that
662
3.0 200
achieve hybrid computation in a similar manner as presented
[nJ/stencil]
speedup 2.5 180
2.0 160
1.5 140 in this paper. Subsequent research shall be focused especially
1.0 120
0.5 100 on dynamic run-time partitioning (adaptive load balancing)
0.0 80
CPU CPU & ACC ACC CPU CPU & ACC ACC to address recent GPUs with boost support.
Fig. 14. Performance and energy efciency results: R EFERENCES
Xilinx Zynq XC7Z020-1CLG400C & Adapteva Epiphany E16G301
[1] Primeur Magazine, A live report from the Adapteva A-
1 smallest supercomputer in the world launch at ISC14,
www.primeurmagazine.com/weekly/AE-PR-07-14-104.html, June
the CPU is fast enough to handle synchronization and data 2014, ISC 2014 Session: HPC Startups: Innovation Brought to Life.
exchange, leading to a doubled speed-up if both GPUs are used [2] T. Grosser, A. Gremm, S. Veith, G. Heim, W. Rosenstiel, V. Medeiros,
and an even more improved speed-up if hybrid work-sharing and M. Eusebio de Lima, Exploiting heterogeneous computing platforms
by cataloging best solutions for resource intensive seismic applications,
is utilized4 (Fig. 12). However, energy efciency drops when in INTENSIVE 2011, The Third International Conference on Resource
using the hybrid work-sharing approach (Fig. 13), but utilizing Intensive Applications and Services, 2011, p. 3036.
only the two GPUs does not decrease it noticeably. Some could [3] G. Kyriazis, HSA: A Technical Review, 1st ed., AMD, August 2012.
argue that the CPU needs to be included into the calculation of [4] The OpenACC API, 2nd ed., OpenACC, August 2013.
the GPUs energy efciency. But in fact, the GPUs only require [5] S. Ohshima, K. Kise, T. Katagiri, and T. Yuba, Parallel processing
some weak embedded engine, which can restart the GPUs for of matrix multiplication in a cpu and gpu heterogeneous environment,
each time step and realize both eventual synchronization and in In 7th International Meeting on High Performance Computing for
data exchange fast enough. Computational Science (VECPAR06), 2006, pp. 4150.
[6] T. Odajima, T. Boku, T. Hanawa, J. Lee, and M. Sato, Gpu/cpu
Because aggressive compiler optimizations were allowed work sharing with parallel language xcalablemp-dev for parallelized
throughout the implementations, the speed-up of the NEON accelerated computing, in ICPP Workshops12, 2012, pp. 97106.
code on the fourth system is not as signicant as expected [7] T. R. W. Scogland, W. chun Feng, B. Rountree, and B. R. de Supinski,
(Fig. 14). Still, the speed-up on the Xilinx Zynq could be CoreTSAR: Adaptive Worksharing for Heterogeneous Systems, in
improved by changing the memory layout accesses to more ISC14, Leipzig, Germany, June 2014.
nJ
suitable ones. Indicating a high energy-efciency of 133 stencil , [8] T. R. W. Scogland, B. Rountree et al., Heterogeneous task scheduling
the Zynq is quite as energy-efcient as an Intel i5-660, while for accelerated openmp. in IPDPS12, May 2012, pp. 144155.
only the Intel i7-2600k provides a clearly better result. Because [9] W.-c. Feng and K. Cameron, The green500 list: Encouraging sustainable
supercomputing, Computer, vol. 40, no. 12, pp. 5055, Dec. 2007.
of the limited bandwidth of the shared memory bus, the
accelerator does not provide any speed-up. The same applies [10] Green500, The Green500 List - June 2014,
http://green500.org/lists/green201406, 2014.
especially for the hybrid case. An upper-bound speed-up on the
[11] N. Puzovic, Mont-blanc: Towards energy-efcient hpc systems, in
Epiphany can be measured if data fetching is not considered. Conf. Computing Frontiers, ser. CF 12. ACM, 2012, pp. 307308.
If all 16 cores are used, it takes 15861 cycles to compute a [12] P. Dlugosch, D. Brown et al., An efcient and scalable semiconductor
tiny 128x128 image5 . This results in a theoretical speed-up of architecture for parallel automata processing, IEEE Transactions on
43.7 compared to the nave CPU implementation6 . Parallel and Distributed Systems, vol. 99, no. PrePrints, p. 1, 2014.
[13] A. Putnam, A. Cauleld et al., A recongurable fabric for accelerating
VII. C ONCLUSION large-scale datacenter services, in ISCA14, June 2014.
[14] J. J. Dongarra, P. Luszczek, and A. Petitet, The linpack benchmark:
Numerous algorithms have been extensively hand-tuned to Past, present, and future. concurrency and computation: Practice and
t to ISAs. Keeping the cost and time of such optimizations in experience, Concurrency and Computation: Practice and Experience,
mind, the resulting code should be exploited in heterogeneous vol. 15, no. 9, pp. 803820, 2003.
systems. The hybrid approach is thereby a good t to achieve a [15] A. Gara, The long term impact of codesign, in SC Companion12,
tremendous performance boost while utilizing both the already 2012, pp. 22122246.
present CPU and accelerators that are becoming more and [16] O. Yilmaz, Seismic Data Analysis, 2nd ed., ser. Investigations in
more of a commodity these days. This is particularly the case Geophysics. Society Of Exploration Geophysicists, Jan. 2001, vol. 10.
for well-balanced systems in terms of Flops, memory, and bus [17] B. Biondi and G. Shan, Prestack imaging of overturned reections by
reverse time migration, in Expanded Abstracts, Soc. of Expl. Geophys.,
bandwidth. On systems with a stronger imbalance in FLOP ratio 72nd Ann. Internat. Mtg., 2002, pp. 12841287.
between the CPU and accelerator it can be seen that the hybrid
[18] R. Clapp, H. Fu, and O. Lindtjorn, Selecting the right hardware for
approach does not lead to a speed-up due to the communication reverse time migration, The Leading Edge, vol. 29, no. 1, 2010.
and synchronization. In such cases, utilization of multiple [19] M. Perrone, Finding oil with cells: Seismic imaging using a cluster of
accelerators can achieve further speed-up. During the evaluation, cell processors, in Second SHARCNET Symposium on GPU and Cell
it was recognizable that the CPU requires at least one thread per Computing, May 2009.
accelerator to keep the accelerators permanently active, which [20] A. Gremm, Acceleration, clustering, and performance evaluation
decreases the amount of computation being possibly performed of seismic applications, Masters thesis, Eberhard-Karls-Universitat
by the CPU. On the targeted embedded system, the limiting Tubingen, June 2011.
factor was the bus bandwidth between CPU and accelerator. [21] P. Siegl, Hybrid acceleration of a seismic application by combining
traditional methods with opencl, Masters thesis, Eberhard-Karls-
As a next step and based on this study, the authors plan is to Universitat Tubingen, April 2012.
evaluate the new OpenMP clause hetero, which can potentially [22] S. Williams, A. Waterman, and D. Patterson, Rooine: An insightful
visual performance model for multicore architectures, Commun. ACM,
4 This might be attributed to the different ages/generations of the used CPUs vol. 52, no. 4, pp. 6576, Apr. 2009.
and GPUs (i5-660: release year 2010; 8600GTS: 2007), so that the CPU can [23] G. Ofenbeck, R. Steinmann, V. C. Cabezas, D. G. Spampinato, and
easily keep up with synchronization. In comparison, the second system features M. Puschel, Applying the rooine model, in ISPASS14, 2014.
a reverse set-up (E5450: 2007; C1060: 2009), resulting in insufcient speed-up.
5 Using 3 out of 4 SRAM banks; double buffering; 1 time step
6 Image: 2200x748; time steps: 3000; time spent: 348010.967ms
663