A look at GPU memory transfer
One of the trickier things in programming with multiple devices is managing the transfer of data between devices. This applies whether you're programming a cluster or a machine with a CPU and GPU. Transferring data takes time and the programmer must be careful that the transfer time doesn't overpower any performance gains from parallelizing your algorithm. When talking about transfer time, we usually think of it as having two components: the time due to latency and the time due to bandwidth. The total time to transfer the data is then,
$$ T_\mathit{total} = T_L + T_B $$
where \(T_L\) is the time due to latency and \(T_B\) is the time due to bandwidth. Typically, the \(T_L\) term is a constant. For example, when talking about two computers on the Internet, the latency term might be something like 35ms. When talking about the latency between main memory and the CPU, this term is on the order of hundreds of nanoseconds.
The \(T_B\) term normally depends on the size of the data being transferred. So, if the size of the data is \(S\) and the bandwidth is \(B\), we'd have,
$$ T_B = \frac{S}{B} $$
Sometimes there is a minimum amount of data that you can transfer. For example, many hard drives have a 512 byte sector size. These hard drives transfer data in units of 512 bytes, so even if you only need 4 bytes off of the disk you will still have to spend as much time as you would to copy 512 bytes.
My research group had a hypothesis that there is a similar minimum unit of data transfer for GPUs. Furthermore, we suspected this was a fairly large amount. This would mean for GPU programs we'd want to try to combine transfers to pay the latency overhead as little as possible. It would mean that in some cases we could get away with transfering more data than necessary in order to minimize the number of transfer operations.
In order to test this hypothesis, I wrote a simple program that copies data between the CPU and GPU in varying sizes. We expected to see a line that was basically flat up to a certain threshold size and then see the transfer time increase linearly. Here's an example of what we saw in practice.
EDIT (2022-12-16): This was an embedded Google Spreadsheet chart, but apparently at some point in the last ten years they deprecated this link so now the chart doesn't show up.
This is on a Core i7-2600K with an NVIDIA GTX460 GPU.
We see the general shape we expected to see. Up until about 8K, all transfers take around 6 or 7 microseconds. Afterwards, the transfer time increases linearly.
Though we saw what we expected, in some ways many of the expected implications do not hold. We expected the threshold to be in the range of several megabytes. Instead, the threshold was at 8K. It seems unlikely that your code will benefit from running on the GPU if you only have 8K of data. The second conclusion we expected to make was that it was okay to over-approximate the data to transfer. This is also invalid, because the size of data your program will typically be working at is so far above the threshold that you actually want to minimize the amount of data transferred to minimize the contribution of the bandwidth term to the total transfer time.
Another important lesson from this is that it's important to test your intuition before basing design decisions on it. This test was pretty easy to write, and yet the decisions we would have made based on our assumptions might have had expensive and long-lasting consequences.