olibartfast/Parallel Programming Cornerstones for CUDA.md

Parallel Programming Cornerstones in CUDA

This gist outlines the core parallel programming patterns (cornerstones) used in CUDA to leverage modern NVIDIA GPU architectures. These patterns are fundamental for exploiting CUDA’s Single Instruction, Multiple Thread (SIMT) execution model, memory hierarchy, and advanced thread management features for efficient parallel computation.

A key theme in modern CUDA development is the use of highly optimized libraries. For many common patterns, NVIDIA CUB provides state-of-the-art implementations. While the original standalone CUB repository is now archived, CUB is actively maintained as part of the CUDA C++ Core Libraries (CCCL). CCCL unifies CUB, Thrust, and libcu++ into a cohesive standard library included in every CUDA Toolkit. Using libraries like CUB is the recommended best practice for achieving maximum performance and reliability.

The cornerstones covered include:

Reduction
Prefix Scan
Map
Tiled Partitioning
Stencil
Gather/Scatter
Cooperative Groups
Warp-Level Primitives
Stream Parallelism

1. Reduction

Description: Aggregates a dataset into a single value (e.g., sum, min, max) using a hierarchical, parallel approach.

CUDA Relevance: Utilizes shared memory and thread synchronization to perform partial reductions at the block level, minimizing costly global memory accesses. A second kernel or a grid-striding loop often combines these partial results.

Use Cases: Summing arrays, finding maxima, computing vector norms.

Key Considerations

Minimize thread divergence within warps.
Use tree-based reduction for efficiency.
Leverage shared memory to reduce global memory latency.

Modern Practice

Use cub::DeviceReduce from the CCCL-integrated CUB library for production.
Perform reduction in two stages: block-level, then global.

Example: Block-Level Reduction

__global__ void sumReduction(float *input, float *output, int n) {
    extern __shared__ float sdata[];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (i < n) ? input[i] : 0;
    __syncthreads();

    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

2. Prefix Scan (Parallel Scan)

Description: Computes cumulative results across an array (e.g., inclusive/exclusive prefix sum) where each element depends on all prior inputs.

CUDA Relevance: Essential in stream compaction, sorting, and memory allocation.

Use Cases: Running totals, partitioning, histogram equalization.

Key Considerations

Use Hillis-Steele or Blelloch algorithms.
Requires fine-grained synchronization.

Modern Practice

Use cub::DeviceScan, which includes inclusive and exclusive variants.
Supports decoupled look-back for low-latency, memcpy-like performance.

3. Map

Description: Applies a function independently to each element in a dataset.

CUDA Relevance: Ideal for CUDA’s thread-level parallelism.

Use Cases: Element-wise operations like scaling, filtering, or transformations.

Key Considerations

Ensure coalesced memory access.
No synchronization required.

Example

__global__ void mapScale(float *input, float *output, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] * scale;
    }
}

4. Tiled Partitioning

Description: Splits data into tiles that fit in shared memory to reduce global memory traffic and improve data reuse.

CUDA Relevance: Crucial for memory-bound algorithms like GEMM and convolutions.

Use Cases: Tiled matrix multiplication, convolutions.

Key Considerations

Select tile sizes to fit in shared memory.
Requires __syncthreads() for synchronization.

Modern Practice

Use Tensor Cores via cuBLAS or WMMA API.
Tune tile sizes per GPU architecture (e.g., Hopper, Ampere).

5. Stencil

Description: Each element is updated based on neighboring values.

CUDA Relevance: Common in PDE solvers, image processing.

Use Cases: Gaussian blur, edge detection, heat diffusion.

Key Considerations

Cache halos/ghost zones in shared memory.
Handle boundary conditions efficiently.

Modern Practice

Use padding to reduce bank conflicts.
Overlap computation and memory prefetch using cooperative groups.

6. Gather/Scatter

Description:

Gather: Reads from arbitrary indices.
Scatter: Writes to arbitrary indices.

CUDA Relevance: Enables irregular memory access patterns like sparse ops.

Use Cases: Histogramming, graph processing, sparse matrix ops.

Key Considerations

Use atomics (e.g., atomicAdd) for scatter.
Coalesce reads when possible to reduce pressure.

Example: Histogram with Scatter

__global__ void histogram(unsigned int *input, int *bins, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        atomicAdd(&bins[input[idx]], 1);
    }
}

7. Cooperative Groups

Description: Allows group-level synchronization and coordination beyond thread blocks.

CUDA Relevance: Enables grid-wide reductions, persistent kernels.

Use Cases: Grid-level synchronization, producer-consumer models.

Key Considerations

Requires cooperative_groups API.
Use only when needed due to overhead.
Launch kernels with the correct API to enable cooperative execution.

8. Warp-Level Primitives

Description: Enables communication within a warp without shared memory.

CUDA Relevance: Very low-latency; great for warp-local reductions or ballots.

Use Cases: Voting, warp-level reductions.

Key Considerations

Use _sync suffixes (e.g., __shfl_down_sync).
Avoids synchronization overhead.

Example: Warp Reduction

__device__ float warpReduceSum(float val) {
    for (int offset = 16; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

9. Stream Parallelism

Description: Uses CUDA streams to overlap data transfers and computation.

CUDA Relevance: Maximizes throughput by exploiting concurrency.

Use Cases: Pipeline processing of datasets, latency hiding.

Key Considerations

Use cudaMemcpyAsync, pinned memory.
Separate streams allow concurrent kernel launches.

Modern Practice

Use streams for task-level parallelism, not just transfer/compute overlap.

General Optimization Tips

Libraries

Prefer CUB, cuBLAS, cuFFT, or cuDNN over hand-written kernels.

Memory

Coalesce global memory accesses across warps.
Use shared memory as a fast user-managed cache.

Synchronization

Use __syncthreads() and warp-sync intrinsics judiciously.
Avoid unnecessary barriers to maintain pipeline efficiency.

Profiling

Use Nsight Compute or Visual Profiler to identify bottlenecks.
Tune for occupancy, memory throughput, and instruction efficiency.

olibartfast/Parallel Programming Cornerstones for CUDA.md

Select an option

No results found

Select an option

No results found

Parallel Programming Cornerstones in CUDA

1. Reduction

Key Considerations

Modern Practice

Example: Block-Level Reduction

2. Prefix Scan (Parallel Scan)

Key Considerations

Modern Practice

3. Map

Key Considerations

Example

4. Tiled Partitioning

Key Considerations

Modern Practice

5. Stencil

Key Considerations

Modern Practice

6. Gather/Scatter

Key Considerations

Example: Histogram with Scatter

7. Cooperative Groups

Key Considerations

8. Warp-Level Primitives

Key Considerations

Example: Warp Reduction

9. Stream Parallelism

Key Considerations

Modern Practice

General Optimization Tips

Libraries

Memory

Synchronization

Profiling

References