Content:
Occupancy of an SM:
First know the specs of your GPU (e.g., an A100, Code):
cc $ nvcc -o props props.cu && ./props
Number of CUDA devices: 1
Device 0: NVIDIA A100 80GB PCIe
Compute capability: 8.0
Total global memory: 79.1384 GB
Shared memory per block: 48 KB
Shared memory per SM: 164 KB
Registers per block: 65536
Warp size: 32
Max threads per block: 1024
Max threads per SM: 2048
Max threads dimensions: 1024 x 1024 x 64
Max grid size: 2147483647 x 65535 x 65535
Clock rate: 1410000
Total constant memory: 65536
Texture alignment: 512
Multiprocessor count (#SMs): 108
The occupancy per SM is defined as follows: \(\frac{\# \text{active warps per SM}}{\text{max # of warps per SM}}\).
For instance, if I choose a block size of 32 threads, I would have 1024 threads per SM (32 blocks per SM, each has 32 threads), which yields an Occupancy of 1/2, as the maximum number of threads for an A100 is 2048 (64 warps per SM).
Notice how the Shared memory per SM IS NOT EQUAL
to [{shared memory per Block} x {#max number of blocks per SM}]. If your block is using too much shared memory, then it would limit the number of blocks assigned to an SM.
Performance bounds and arithmetic intensity:
Bounds or bottlenecks:
The question you ask: Why can’t your program can’t run faster? Two answers:
- If only my GPU can do more operations/second -> Compute bound.
- Data transfer is not an issue. There is just not enough compute.
- If only my GPU can move more data/second -> Memory bound.
- Some compute cores might be idle, waiting for data.
Arithmetic intensity:
The arithmetic (or computational) intensity is defined as the ratio of floating point operations to bytes accessed from the global memory. For instance, it’s equal to 0.25 flops/byte in the following example.
for (int i = 0; k < width; k++){
Pvalue += M[row*Width+k] * N[k*width + col]
}
An A100, DRAM peak bandwidth is 1555 GB/second, hence for the simple example above, we can barely do 389 GFLOPS (Giga Flops per second). This accounts to only 2% of the (theoretical) peak single-precision operations throughput ~ 19.5k GFLOPS. The peak operation throughput is equivalent to performing 12 ops per second.
Desired compute intensity.
Consider we have the following specs:
- GPU FLOPs: 14 TFLOP/s (14,000 GFLOP/s)
- Memory Bandwidth: 900 GB/s
A desired compute-to-memory-access ratio is:
\[\frac{14{,}000 \,\text{GFLOPS}}{900 \,\text{GB/s}} \approx 15.6 \,\text{FLOP/byte}.\]⚠ Important > Each single-precision floating-point operation (FLOP) operates on 4 bytes (32 bits) of data. Thus:
\[\text{FLOP/byte} \times 4 \rightarrow 15.6 \times 4 \approx 62 \text{Per Floating point access}\]