Access Patterns Matter, Part 2

A couple of readers pointed out some improvements and corrections to my last post on GPU access patterns. These were pretty significant, so I thought it'd be worth doing a follow up post to see how the change things.

First of all, I meant to operate on both arrays, A and B, but through some sloppy coding I ended up only using A. Incidentally, I did some back-of-the-envelope calculations to figure out the memory bandwidth I was getting, and I was surprised to see that I was getting close to twice the theoretical peak for the cards I was working with. It looks like it's because I was only reading half the data I thought I was. Here are the corrected figures (the experiment is the same other than the small corrections to my code):

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd	2.764 ms	4.524 ms	36.325 ms
MyAdd_2D	10.560 ms	0.763 ms	4.273 ms
MyAdd_2D_unweave	0.740 ms	0.100 ms	2.170 ms
MyAdd_col	2.777 ms	4.527 ms	26.686 ms
MyAdd_2D_col	10.391 ms	0.961 ms	7.723 ms
MyAdd_2D_unweave_col	12.398 ms	0.708 ms	3.413 ms

We're slower across the board, but the overall shape of the data is about the same. Interestingly, the fastest kernels are not much slower than the fastest kernels from before.

Next, reddit user ser999 pointed out that we could forego the branch entirely by doing some clever arithmetic. Instead of doing

if (i % 2 == 0)
    get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j);
else
    get(C, N, i, j) = get(A, N, i, j) - get(B, N, i, j);

we could instead do this:

get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j)*(1 - ((i&1)<<1));

This isn't exactly as clear as the code we had before, but perhaps a sufficiently smart compiler could perform this optimization so the programmer could still write the nicer code. To be fair, my thread divergence optimization wasn't great for code readability either. The important this is, how does this "no branching" version perform? The table below shows the performance along with the "unweaved" version from before for comparison.

Kernel	Tesla C1060	GeForce GTX 460	ATI Radeon HD 6750M
MyAdd_2D_unweave	0.740 ms	0.100 ms	2.170 ms
MyAdd_2D_nobranch	9.969 ms	0.731 ms	3.381 ms
MyAdd_2D_unweave_col	12.398 ms	0.708 ms	3.413 ms
MyAdd_2D_col_nobranch	9.982 ms	0.729 ms	3.465 ms

For row-wise access, the "unweave" variant always wins. For column-wise access, the "nobranch" version wins on the C1060, while the GTX 460 and the ATI card do better with the "unweave" variant. In both the ATI and GTX 460 column-wise case, however, the two perform basically the same.

So why is this? Branches are often pretty expensive, especially when they create thread divergence. However, by removing the branch we always have to do a multiplication, and we also have to convert an integer value into a floating point value. Multiplication in particular is a fairly expensive operation. In the case, the branch isn't so bad.

As before, the code from this post is available at https://github.com/eholk/bench-thread-diverge.