Content:

GPU architecture:

Fig

Block scheduling:

Synchronisation and transparent scalability:

Warps and SIMD hardware:

warps

Within each thread block assigned to an SM, threads are further organized into warps. A warp is the fundamental unit of thread scheduling in NVIDIA GPUs, consisting of 32 threads executed in a Single Instruction, Multiple Data (SIMD) fashion. Here’s how warps function within the GPU architecture:

Parallel to the VN computer model:

The von Neumann computer model: vn

Executing threads in warps reflects upon this model:,

vn-gpu

Some notes on the analogy:

Control divergence:

Warps scheduling and latency tolerance:

Resource partitioning and occupancy:

Querying device properties:

int main(){

    int devCount;
    cudaGetDeviceCount(&devCount);

    std::cout << "Number of CUDA devices: " << devCount << std::endl;

    cudaDeviceProp devProp;

    for (int i = 0; i < devCount; i++){
        cudaGetDeviceProperties(&devProp, i);
        std::cout << "Device " << i << ": " << devProp.name << std::endl;
        std::cout << "Compute capability: " << devProp.major << "." << devProp.minor << std::endl;
        std::cout << "Total global memory: " << devProp.totalGlobalMem << std::endl;
        std::cout << "Shared memory per block: " << devProp.sharedMemPerBlock << std::endl;
        std::cout << "Registers per block: " << devProp.regsPerBlock << std::endl;
        std::cout << "Warp size: " << devProp.warpSize << std::endl;
        std::cout << "Max threads per block: " << devProp.maxThreadsPerBlock << std::endl;
        std::cout << "Max threads dimensions: " << devProp.maxThreadsDim[0] << " x " << devProp.maxThreadsDim[1] << " x " << devProp.maxThreadsDim[2] << std::endl;
        std::cout << "Max grid size: " << devProp.maxGridSize[0] << " x " << devProp.maxGridSize[1] << " x " << devProp.maxGridSize[2] << std::endl;
        std::cout << "Clock rate: " << devProp.clockRate << std::endl;
        std::cout << "Total constant memory: " << devProp.totalConstMem << std::endl;
        std::cout << "Texture alignment: " << devProp.textureAlignment << std::endl;
        std::cout << "Multiprocessor count: " << devProp.multiProcessorCount << std::endl;
    }
    return 0;
}

Will return the following on my A100 remote machine:

$ nvcc -o props props.cu
$ ./props
Number of CUDA devices: 1
Device 0: NVIDIA A100 80GB PCIe
Compute capability: 8.0
Total global memory: 84974239744
Shared memory per block: 49152
Registers per block: 65536
Warp size: 32
Max threads per block: 1024
Max threads dimensions: 1024 x 1024 x 64
Max grid size: 2147483647 x 65535 x 65535
Clock rate: 1410000
Total constant memory: 65536
Texture alignment: 512
Multiprocessor count: 108A