Comparing C++ Compilers Parallel-Programming Performance

Comparing C++ Compilers ParallelProgramming Performance
by Jeff Cogswell | December 19, 2013
Intel and g++ compilers give you plenty of options for generating vectorized code. But how well does each platform actually perform?
Many of todays C++ compilers can be used for parallel programming. There are two main ways to program in parallel: Multicore programming Vectorized programming Multicore programming is just what it sounds like: Your code can run simultaneously on multiple cores. There are different ways to make this happen, whereby you specify directives or use extensions to the language so that a loop can, for example, run its iterations simultaneously on as many cores as it can obtain from the processor. Vectorized programming takes place on a single core. The processor core has multiple registers that are storage areas built inside the processor. You can store numbers inside the registers and perform mathematical operations in them. But these registers are much larger than the numbers youre storing. For example, a single-precision floatingpoint number is typically 32 bits. (The number can vary, but there is an IEEE standard that most compiler vendors try to adhere to, which states that single-precision numbers get 32 bits.) But the register, depending on the processor, might be 256 bits; that means you can store eight separate single-precision numbers next to each other within the single register. Then, using a single assembler instruction, you can perform a mathematical operation on all eight numbers together. This is known as SIMD, or Single Instruction, Multiple Data. For example, you might put the numbers 10, 20, 30, and 40 inside register, and your operation might be to double the number. With a single operation, then, you would double all four numbers to get 20, 40, 60, and 80. To accomplish vectorized programming, you carefully code your algorithms so that the code inside the loops can be executed as a vectorized statement. That means sticking to primarily mathematical operations that the processor supports at the vectorization level. Then you can choose to let the compiler autovectorize whichever loops the compiler sees fit. Or, you can specify exactly which loops you want to vectorize, although the compile will skip vectorzing any such loops when it just isnt possible.
Also, you can, of course, use both multicore and vectorized programming at the same time. The compiler can take your C++ loops and create vectorized assembly code for you. But just how good is the vectorized code? And do you have much control over the generated code? First, its obviously important that you RTFM and fully understand compiler options (especially since the defaults may not be what you want or think youre getting). But even then, do you trust that the compiler is generating the best code for you? Thats what Im going to find out here. Im going to look at the actual generated assembly code and share it with you so we can see exactly what were getting.
Prior Tests From Other People

We dont want to re-invent the wheel here. As such, I decided to start with where some of the earlier tests have left off. Heres an excellent test of how g++ can create vectorized code, when it doesnt, and how to fuss with it to get it to do so. In this article, the author digs into the generated assembly code to see if vectorization is occurring. I would say one important take-home message from the piece is that, even if you think youre going to get vectorized code, you still might not. There are two things you can do: First, the compilers can actually tell you if a function is vectorized (its an option, and youll want to turn it on). Second, you still might want to look at the assembly code, not only to make sure youre getting vectorized code, but that the code youre getting is actually better than the serial version. So for our tests, Im going to take several of the authors functions and try them out on the Intel compiler and then compare the resulting code, as well as what we have to do to get vectorized code.
Initial Tests on Intel Compiler

For the first test, I went with the first test presented in the article, compiled with the Intel compiler: #include <stdlib.h> #include <math.h> #define SIZE (1L << 16) void test1(double *a, double *b) { int i; for (i = 0; i < SIZE; i++) { a[i] += b[i]; } } int main() { double *a = new double[SIZE]; double *b = new double[SIZE]; test1(a, b); } After turning on the diagnostics to actually show me whats vectorized, icc -S -vec-report2 test1.cpp I can see that the loop was indeed vectorized with this message: LOOP WAS VECTORIZED (The -S tells the compiler to generate a file containing the assembly code.) But how good is the vectorization? Adding the -s option, we can look at the resulting assembler code. For space reasons, I wont list the whole thing here, just the code inside the loop: movsd (%rsi,%rdx,8), %xmm0
movsd movsd movsd movhpd movhpd movhpd movhpd addpd addpd addpd addpd movaps movaps movaps movaps
16(%rsi,%rdx,8), %xmm1 32(%rsi,%rdx,8), %xmm2 48(%rsi,%rdx,8), %xmm3 8(%rsi,%rdx,8), %xmm0 24(%rsi,%rdx,8), %xmm1 40(%rsi,%rdx,8), %xmm2 56(%rsi,%rdx,8), %xmm3 (%rdi,%rdx,8), %xmm0 16(%rdi,%rdx,8), %xmm1 32(%rdi,%rdx,8), %xmm2 48(%rdi,%rdx,8), %xmm3 %xmm0, (%rdi,%rdx,8) %xmm1, 16(%rdi,%rdx,8) %xmm2, 32(%rdi,%rdx,8) %xmm3, 48(%rdi,%rdx,8)
Were working with double-precision numbers here, which are 64 bits. But how big are our registers? Well, we didnt specify a SIMD architecture. By default, were getting size 128. The first eight lines of this code are juggling our numbers around, and then there are four actual addition operations. Each addition is performed with the opcode added. But there are four of these additionsone each for registers xmm0, xmm1, xmm2, xmm3. Thats because something separate from vectorization is also happening here, something called loop unrolling. Ill discuss that shortly; its sufficient for now to say its an optimization that is independent of the vectorization. Lets see if we can tweak the compiler a bit. I want to target two different generations of SIMD: SSE4.2 and the newer AVX. We can do so using the -x option. Without changing the C++ code, and then adding the option xSSE4.2, we end up with this assembler code: movups movups movups movups addpd addpd addpd addpd movups movups movups movups (%rsi,%rdx,8), %xmm0 16(%rsi,%rdx,8), %xmm1 32(%rsi,%rdx,8), %xmm2 48(%rsi,%rdx,8), %xmm3 (%rdi,%rdx,8), %xmm0 16(%rdi,%rdx,8), %xmm1 32(%rdi,%rdx,8), %xmm2 48(%rdi,%rdx,8), %xmm3 %xmm0, (%rdi,%rdx,8) %xmm1, 16(%rdi,%rdx,8) %xmm2, 32(%rdi,%rdx,8) %xmm3, 48(%rdi,%rdx,8)
And with the option -xAVX we get this: vmovupd (%rsi,%rax,8), %xmm0 vmovupd 32(%rsi,%rax,8), %xmm3 vmovupd 64(%rsi,%rax,8), %xmm6 vmovupd 96(%rsi,%rax,8), %xmm9 vinsertf128 $1, 48(%rsi,%rax,8), %ymm3, %ymm4 vinsertf128 $1, 16(%rsi,%rax,8), %ymm0, %ymm1 vinsertf128 $1, 80(%rsi,%rax,8), %ymm6, %ymm7 vinsertf128 $1, 112(%rsi,%rax,8), %ymm9, %ymm10 vaddpd (%rdi,%rax,8), %ymm1, %ymm2 vaddpd 32(%rdi,%rax,8), %ymm4, %ymm5 vaddpd 64(%rdi,%rax,8), %ymm7, %ymm8
vaddpd vmovupd vmovupd vmovupd vmovupd
96(%rdi,%rax,8), %ymm10, %ymm11 %ymm2, (%rdi,%rax,8) %ymm5, 32(%rdi,%rax,8) %ymm8, 64(%rdi,%rax,8) %ymm11, 96(%rdi,%rax,8)
Without spending too much time explaining all this, the AVX ones are operating with twice the register sizes. The opcodes start with a v to signify AVX. But in these cases, were dealing with unaligned data. We see that with the letter u in the move operations; we can get better performance by aligning it. The author of the LockLess article found the same to be true with g++. In fact, we can use the same restrict keyword, provided we add on the restrict compiler option. Also, we can turn up our diagnostics report to let us know if our data is aligned. By using -vec-report6, we can see status reports such as: reference a has aligned access. And if we forget to align one of our variables, well see in the report: reference b has unaligned access To align the data for Intel, we can use code similar to what the LockLess article used, except a different intrinsic: void test1(double * restrict a, double * restrict b) { int i; __assume_aligned(a, 64); __assume_aligned(b, 64); for (i = 0; i < SIZE; i++) { a[i] += b[i]; } } And now inspecting the assembler, we see that were using MOVAPS instead of MOVUPS. The A means aligned: movaps movaps movaps movaps addpd addpd addpd addpd movaps movaps movaps movaps (%rdi,%rax,8), %xmm0 16(%rdi,%rax,8), %xmm1 32(%rdi,%rax,8), %xmm2 48(%rdi,%rax,8), %xmm3 (%rsi,%rax,8), %xmm0 16(%rsi,%rax,8), %xmm1 32(%rsi,%rax,8), %xmm2 48(%rsi,%rax,8), %xmm3 %xmm0, (%rdi,%rax,8) %xmm1, 16(%rdi,%rax,8) %xmm2, 32(%rdi,%rax,8) %xmm3, 48(%rdi,%rax,8)
But then I encountered a surprise. If any readers can provide an explanation here, I welcome it (and I might put in a call to Intel on this one): if I keep the code as-is, with the alignment all set, and then target the AVX, I get the opcodes that start with Vbut they have a U for unaligned, instead of an A for aligned: vmovupd vmovupd vmovupd vmovupd vaddpd (%rdi,%rax,8), %ymm0 32(%rdi,%rax,8), %ymm2 64(%rdi,%rax,8), %ymm4 96(%rdi,%rax,8), %ymm6 (%rsi,%rax,8), %ymm0, %ymm1
vaddpd vaddpd vaddpd vmovupd vmovupd vmovupd vmovupd
32(%rsi,%rax,8), %ymm2, %ymm3 64(%rsi,%rax,8), %ymm4, %ymm5 96(%rsi,%rax,8), %ymm6, %ymm7 %ymm1, (%rdi,%rax,8) %ymm3, 32(%rdi,%rax,8) %ymm5, 64(%rdi,%rax,8) %ymm7, 96(%rdi,%rax,8)
even though the -vec-report6 stated that the data was aligned. The only reference I can find is a discussion forum on intel.com where an Intel employee states, The compiler never uses VMOVAPD even though it would be valid. From the comments in my previous articles, its clear some of you really know your stuff, so if you know whats up, feel free to chime in.
Loop Unrolling
When we turned up the diagnostic level to 6, we also saw a message: unroll factor set to 4 This refers to the fact that, in addition to vectorizing our code, were also using four separate registers to do four sets of additions, and as such as we four ADDPD calls in our code. This is called loop unrolling, and its a completely separate issue from the vectorization code. I was surprised to see that the Intel compiler did that by default without me asking for it. We can turn off the option with -loopunroll0. Then we see three lines of vectorized addition, which is identical to that in the LockLess report: movaps (%rdi,%rax,8), %xmm0 addpd (%rsi,%rax,8), %xmm0 movaps %xmm0, (%rdi,%rax,8) Because this loop unrolling is a separate issue from vectorization, I wont use it to say that Intels generated vectorized code is somehow better because of it. The g++ compiler also supports unrolling under certain conditions.
Embedded Loops
The LockLess article tackles the issue of embedded loops, which gave the g++ compiler some troubles. Heres some code similar to theirs: void test5(double * restrict a, double * restrict b) { int i,j; __assume_aligned(a, 64); __assume_aligned(b, 64); for (j = 0; j < SIZE; j++) { for (i = 0; i < SIZE; i++) { a[i + j * SIZE] += b[i + j * SIZE]; } } } I get a diagnostic report that the loop cant be vectorized. However, the author of the LockLess piece didnt tackle trying to just get the inner loop to vectorize. With the Intel compiler, we can do so with a pragma:
void test1(double * restrict a, double * restrict b) { int i,j; __assume_aligned(a, 64); __assume_aligned(b, 64); for (j = 0; j < SIZE; j++) { #pragma simd for (i = 0; i < SIZE; i++) { a[i + j * SIZE] += b[i + j * SIZE]; } } } This works; the inner loop gets vectorized. The author instead combined the loops into a single loop, like so: for (i = 0; i < SIZE * SIZE; i++) { x[i] += y[i]; } which in the end is the same as any other loop, just a bigger upper limit. He then declares: So, a rule of thumb might be; two loops bad, one loop good. Thats a common problem in parallel programming, and there are different ways you can tackle it. According to Intel, the default is to try to vectorize the innermost loop; although in my test, I didnt end up with a vectorized inner loop. The compiler tried, but it decided there was a loop dependency and decided not to do it. Only after adding the pragma did it do it. But what about g++? Did the LockLess author miss something? In that article the author was using g++ 4.7. Im using 4.8.1. And it turns out I can force the inner loop to vectorize if I add an optimization level 3 like so: g++ -S -O3 gcc5.cpp (The -S gives me the assembly output in a file.) I can see the vectorized statements in the assembly output: movhpd 8(%rcx,%rax), %xmm0 addpd (%r8,%rax), %xmm0 movapd %xmm0, (%r8,%rax) And I can also see it in a report if I turn on the diagnostics like so: g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cpp which gives me: Vectorizing loop at gcc5.cpp:16 (Line 16 is the first line of the inner loop.) So did the 4.7 not allow it? If youre writing code that uses advanced techniques like vectorization, youre generally going to want to make sure youre running the latest version of the compiler. However, it turns out the 4.7 can vectorize the inner loop. I installed the 4.7 compiler and compiled the above with the same options: g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cpp and indeed the inner loop did vectorize:
16: LOOP VECTORIZED. gcc5.cpp:7: note: vectorized 1 loops in function. So it appears the inner loop can get vectorized, with just a tiny bit of coaxing. (Incidentally, the assembly code generated by the 4.7 and 4.8.1 compilers is a good bit different in ways aside from the vectorization, which is interesting. I dont have space to put the entire listings here. Perhaps you might want to explore that yourself; or if theres interest here I can do an analysis in a future article to find out why and whether the differences could potentially impact your development projects.)
More Operations
A simple add operation isnt all that useful. The processor vectorization includes a large set of different operations from which you can construct sophisticated operations. The LockLess article describes an operation that looks like this: for (i = 0; i < SIZE; i++) { x[i] = ((y[i] > z[i]) ? y[i] : z[i]); } Compiling this on with the Intel compiler results in the following vectorized assembly: movaps movaps movaps movaps maxpd maxpd maxpd maxpd movaps movaps movaps movaps X2(,%rax,8), %xmm0 16+X2(,%rax,8), %xmm1 32+X2(,%rax,8), %xmm2 48+X2(,%rax,8), %xmm3 X3(,%rax,8), %xmm0 16+X3(,%rax,8), %xmm1 32+X3(,%rax,8), %xmm2 48+X3(,%rax,8), %xmm3 %xmm0, X(,%rax,8) %xmm1, 16+X(,%rax,8) %xmm2, 32+X(,%rax,8) %xmm3, 48+X(,%rax,8)
Aside from the unrolled loops, this code is basically the same as what g++ produced. The next step becomes a bit more complex. It looks like a small change, but its significant. The author of the Lockless article changed the algorithm to include the left item, x[i], inside the operation itself, like so: x[i] = ((y[i] > z[i]) ? x[i] : z[i]); With this one, were seeing something different happen from the g++ experiments. The LockLess report states that this didnt vectorize with g++. But here with my tests on the Intel compiler, it did just fine. The code is a bit long due to the complexity and the loop unrolling, but it has several operations that use the packed notation, and the diagnostics stated that vectorization took place. As a final test, the LockLess author performed this operation: double test21(double * restrict a) { size_t i; double *x = __builtin_assume_aligned(a, 16); double y = 0; for (i = 0; i < SIZE; i++) { y += x[i]; } return y; }
When I see this code, an alarm goes off in my mind. Its actually fine, and will indeed vectorize. But if we try to go another step and add multi-core to it, were going to have a problem with the y variable and some race conditions. These race conditions will require whats called a reducer. For now, when were not trying to implement a multicore solution, and only use vectorization on a single core, well be fineand indeed, this code vectorized just fine as well. Whats interesting is, not only did it vectorize, but I saw the same elegant results as with the g++ compiler in the LockLess report. The compiler ultimately did whats called a horizontal operation using an UNPCKHPD opcode.
Conclusions
In general, both compilers give you plenty of options for generating vectorized code. Sometimes you have to do a little work, though, to get the compiler to actually vectorize your code. But it was clear that the Intel compiler was more willing to vectorize the code. The g++ compiler resisted at times, requiring different command-line options and occasionally reworking the code a bit; this is actually no surprise, since one might argue that Intel is staying a step ahead with their compilers, since they can actually be building the compilers for the processors before the processors are even released. The g++ team, however, can only work with whats available to them. (That and they dont have huge amounts of dollars to dump into it like Intel does.) Either way, the g++ compiler did well up against the Intel compiler. I was troubled by how different the generated assembly code was between the 4.7 and 4.8.1 compilersnot just with the vectorization but throughout the code. But thats another comparison for another day.

Comparing C++ Compilers Parallel-Programming Performance

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Comparing C++ Compilers Parallel-Programming Performance

Загружено:

Авторское право:

Доступные форматы

Comparing C++ Compilers ParallelProgramming Performance

by Jeff Cogswell | December 19, 2013

Prior Tests From Other People

Initial Tests on Intel Compiler

vaddpd vmovupd vmovupd vmovupd vmovupd

vaddpd vaddpd vaddpd vmovupd vmovupd vmovupd vmovupd

Вам также может понравиться