Access Patterns Matter, Part 2
A couple of readers pointed out some improvements and corrections to my last post on GPU access patterns. These were pretty significant, so I thought it'd be worth doing a follow up post to see how the change things.
First of all, I meant to operate on both arrays, A
and B
, but
through some sloppy coding I ended up only using A
. Incidentally, I
did some back-of-the-envelope calculations to figure out the memory
bandwidth I was getting, and I was surprised to see that I was getting
close to twice the theoretical peak for the cards I was working
with. It looks like it's because I was only reading half the data I
thought I was. Here are the corrected figures (the experiment is the
same other than the small corrections to my code):
Kernel | Tesla C1060 | GeForce GTX 460 | ATI Radeon HD 6750M |
MyAdd | 2.764 ms | 4.524 ms | 36.325 ms |
MyAdd_2D | 10.560 ms | 0.763 ms | 4.273 ms |
MyAdd_2D_unweave | 0.740 ms | 0.100 ms | 2.170 ms |
MyAdd_col | 2.777 ms | 4.527 ms | 26.686 ms |
MyAdd_2D_col | 10.391 ms | 0.961 ms | 7.723 ms |
MyAdd_2D_unweave_col | 12.398 ms | 0.708 ms | 3.413 ms |
We're slower across the board, but the overall shape of the data is about the same. Interestingly, the fastest kernels are not much slower than the fastest kernels from before.
Next, reddit user ser999 pointed out that we could forego the branch entirely by doing some clever arithmetic. Instead of doing
if (i % 2 == 0) get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j); else get(C, N, i, j) = get(A, N, i, j) - get(B, N, i, j);
we could instead do this:
get(C, N, i, j) = get(A, N, i, j) + get(B, N, i, j)*(1 - ((i&1)<<1));
This isn't exactly as clear as the code we had before, but perhaps a sufficiently smart compiler could perform this optimization so the programmer could still write the nicer code. To be fair, my thread divergence optimization wasn't great for code readability either. The important this is, how does this "no branching" version perform? The table below shows the performance along with the "unweaved" version from before for comparison.
Kernel | Tesla C1060 | GeForce GTX 460 | ATI Radeon HD 6750M |
MyAdd_2D_unweave | 0.740 ms | 0.100 ms | 2.170 ms |
MyAdd_2D_nobranch | 9.969 ms | 0.731 ms | 3.381 ms |
MyAdd_2D_unweave_col | 12.398 ms | 0.708 ms | 3.413 ms |
MyAdd_2D_col_nobranch | 9.982 ms | 0.729 ms | 3.465 ms |
For row-wise access, the "unweave" variant always wins. For column-wise access, the "nobranch" version wins on the C1060, while the GTX 460 and the ATI card do better with the "unweave" variant. In both the ATI and GTX 460 column-wise case, however, the two perform basically the same.
So why is this? Branches are often pretty expensive, especially when they create thread divergence. However, by removing the branch we always have to do a multiplication, and we also have to convert an integer value into a floating point value. Multiplication in particular is a fairly expensive operation. In the case, the branch isn't so bad.
As before, the code from this post is available at https://github.com/eholk/bench-thread-diverge.