Optimizing performance of your CUDA code

Recap of the obvious:

DRAM bank:

On a higher level, a DRAM bank (sketched above) does the following:

  1. The incoming row address is decoded by the Row Decoder, which activates the corresponding row in the DRAM array.

  2. The contents of that row are sensed and amplified by the Sense Amplifier, temporarily stored in Column Latches.

  3. The Column Latches are then passed through a multiplexer (MAX) where the specific columns are selected based the Column address.

The key factor here is that if the next access corresponds to the same row address, then we can save the latency of Step 1 and Step 2 (which are the longest ones), and directly jump to Step 3, by fetching the necessary column from the Multiplexer. This is called a memory coalesced access.

Memory coalescing on GPUs:

Let’s see a few examples of code

int idx = blockDim.x*blockIdx.x + threadIdx.x
C[x] = A[idx] + B[idx]

Banked memories and channels:

Other important considerations:

Credits: