PART 0 Prerequisites (compilation, environment, cluster)
PART I System Diagnostics (Nsight Systems)
S0 Annotate the code (NVTX)
S1 Capture the trace
S1b nsys analyze and automatic reports
S2 Read visual patterns
S3 Fix system-level problems
S4 Identify the dominant kernel
S5 Decide to drill into Nsight Compute
S5b Integrated workflow: System Trace > Profile Kernel
PART II Kernel Diagnostics (Nsight Compute)
Step 0 Before profiling: know your algorithm
Step 0b Source-modification technique (mem-only / math-only)
Step 1 Classify the kernel
Step 1b Use the interactive Roofline
Step 2 Identify the limiting subsystem
Step 2b Use the interactive Memory Chart
Step 3 Quantify the problem and estimate potential speedup
Step 4 Locate in the Source tab
Step 4F Advanced Source Page features
Step 5 Choose and apply a fix
Step 5b Profile Series: test configurations automatically
Step 6 Re-profile and iterate
Step 6b Visual baselines and Source Comparison
PART III Advanced Topics
PART IV Common Pitfalls
ANNEX A Reference commands
ANNEX B Ridge-point tables (FP32, FP16, Tensor)
ANNEX C Video resources and documentation
ANNEX D Cluster practical notes (VNC, X11, modules)
Always compile with optimizations and line information. Never use the -G flag (device debug) for profiling because it disables compiler optimizations and corrupts all metrics.
nvcc -O3 -arch=sm_90 --generate-line-info -Xptxas -v mykernel.cu -o mykernel
The -Xptxas -v flag prints register usage and any spilling at compile time. The output shows the number of registers per thread, static shared memory bytes, and spill stores/loads if any.
For CMake:
# In CMakeLists.txt:
set(CMAKE_CUDA_ARCHITECTURES 90)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-line-info -Xptxas -v")
# Build:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..
ninja
Use the exact architecture of the GPU being profiled: sm_70 (Volta), sm_75 (Turing), sm_80 (A100), sm_86 (RTX 3090), sm_89 (RTX 4090), sm_90 (H100), sm_100 (B200). A wrong architecture produces PTX code that gets re-compiled at launch (JIT) with potentially sub-optimal choices.
The flag -Xptxas -dlcm=ca forces L1 caching for global loads (default on most architectures). The flag -Xptxas -dlcm=cg disables L1 caching and only caches in L2, useful for streaming patterns where L1 does not help. Test both and measure.
Load the appropriate CUDA module (module load cuda/12.x). For Nsight Compute or Nsight Systems GUI, prefer VNC on the compute node (better responsiveness than X11).
VNC procedure: launch vncserver on the compute node (set a password the first time), create an SSH tunnel with ssh -L 5902:localhost:5901 user@node, then connect with TigerVNC Viewer on localhost:5902.
X11 alternative: install MobaXterm (Windows), XQuartz (Mac), or native X server (Linux) and use ssh -Y user@cluster. Latency is higher than VNC, especially for heavy GUIs.
GPU profiling always starts at the system level. There is no point optimizing a kernel if the GPU spends half its time waiting for data from the CPU, or if memory transfers are not overlapped with compute.
Note: Sergey Mashenko (SHARCNET) offers a nuance. For a brand-new kernel whose memory layout is not yet validated, starting directly with Nsight Compute can be relevant to verify coalescing and structure before integrating the kernel into a pipeline. For an existing kernel in a complete pipeline, always start with Nsight Systems.
Without NVTX markers, the Nsight Systems trace is a succession of anonymous colored bars. With markers, each pipeline phase is visually identifiable.
In C/C++:
#include <nvtx3/nvToolsExt.h>
nvtxRangePush("data_loading");
// ... data loading ...
nvtxRangePop();
nvtxRangePush("forward_pass");
myKernel<<<grid, block, 0, stream>>>(args);
nvtxRangePop();
In Python (PyTorch):
torch.cuda.nvtx.range_push("forward")
output = model(input)
torch.cuda.nvtx.range_pop()
NVTX markers also serve as filters in Nsight Compute (see --nvtx-include later).
nsys profile --trace=cuda,nvtx,osrt --output=baseline ./myapp
For multi-GPU or DDP, add nccl:
nsys profile --trace=cuda,nvtx,osrt,nccl -o multi_gpu ./myapp
To limit capture to a specific NVTX range:
nsys profile --nvtx-capture=range@main_loop -o focused ./myapp
Open the .nsys-rep file in nsys-ui.
Before even opening the timeline visually, you can ask Nsight Systems to produce an automatic diagnostic report. This report detects the most common anti-performance patterns (GPU idle, non-overlapped memcpy, excessive synchronization) and presents them as text with recommendations.
nsys analyze baseline.nsys-rep
The cost is zero (a few seconds) and it immediately orients the diagnosis. The report can also be exported as JSON for CI/CD integration:
nsys analyze --format json baseline.nsys-rep > analysis.json
You can also generate specific statistical reports:
nsys stats --report gpukernsum baseline.nsys-rep
nsys stats --report gpumemsizesum baseline.nsys-rep
nsys stats --report cudaapisum baseline.nsys-rep
The nsys analyze report is a complement, not a replacement for the timeline. It gives you in 10 seconds a list of candidate problems that you will then confirm or refute visually. The benefit is not missing a pattern you would not have spotted by eye.
The Nsight Systems trace tells a story. Each visual pattern has a precise meaning.
GPU idle (white bands). The GPU is waiting. Look at what is happening on the CPU side at the same time. Common causes: synchronous data loading, non-pinned memory (munmap/mmap visible on the osrt row), blocking CPU post-processing, excessive synchronization (cudaDeviceSynchronize between every kernel).
Sequential memcpy then compute with no overlap. The program is not using multiple CUDA streams. Copy and compute are serialized when they could overlap. In the trace, blue bars (memcpy) and green bars (kernel) follow each other with no overlap.
Dominant NCCL block. Inter-GPU communication is the bottleneck. Check topology with nvidia-smi topo -m. If GPUs communicate via PCIe rather than NVLink, gradient compression can help. Caution: compression is beneficial on PCIe but can hurt on NVLink where bandwidth is already high and compression overhead dominates.
CPU work grouped at the start, GPU busy afterward. Good pattern: the CPU prepares while the GPU computes. The pipeline is healthy.
Munmap/mmap correlated with GPU idle. Host memory is not pinned. The driver must copy into an internal pinned buffer before DMA transfer, blocking overlap.
CUDA Graphs in the timeline. If the application uses CUDA Graphs, the timeline shows graph launches as unified blocks. A graph that relaunches frequently with a high launch cost (visible as a gap between the end of the previous graph and the start of the next) may indicate a granularity problem. Very short graphs can have a relatively high launch overhead.
Slow data loading. Increase num_workers, enable pin_memory=True and persistent_workers=True in PyTorch DataLoader. Use non-blocking copies (.to(device, non_blocking=True)) with separate CUDA streams for copy and compute. Typical result: 2x throughput gain.
Non-pinned memory. Replace cudaMalloc + cudaMemcpy with cudaMallocHost (or cudaHostAlloc) for host memory. Non-pinned memory prevents memcpy/kernel overlap because the driver must first copy into an internal pinned buffer.
Blocking post-processing. Move post-processing to separate workers, use pre-allocated buffers, and non-blocking copies with CUDA events for synchronization. Typical result: 4x gain on post-processing.
Inter-GPU communication. Check NVLink vs PCIe. Adjust bucket_cap_mb (PyTorch DDP, between 25 and 200 MB). Test gradient compression (FP16, BF16, PowerSGD) only if communication is the limiting factor.
nsys stats --report gpukernsum baseline.nsys-rep
This command produces a table sorted by cumulative time. The kernel at the top of the list is the priority target.
Re-capture after system fixes. If the GPU is busy most of the time and a single kernel accounts for more than 30% of total GPU time, move to Part II.
In recent versions, Nsight Compute integrates a System Trace activity directly. The workflow: in Nsight Compute UI, choose the System Trace activity at launch. Once the timeline is collected, identify the target kernel directly in the timeline. Right-click the kernel and choose Profile Kernel. A new profiling activity opens, pre-configured to profile exactly that kernel.
This workflow avoids manual round-trips between nsys and ncu. It is particularly useful when you do not know the exact kernel name (template kernels with complex demangled names) or when you want to profile a specific instance of a kernel among hundreds of launches.
Before opening Nsight Compute, you must answer two questions on paper. This is the "napkin math": a sanity check of data transferred and arithmetic intensity that tells you in advance what kind of bottleneck to expect.
First question: what is the theoretical arithmetic intensity of my algorithm?
This is the ratio between the number of compute operations and the number of bytes the algorithm must transfer at minimum. This is not a measurement of the kernel; it is a property of the algorithm itself.
Intensity = theoretical FLOP / minimum bytes transferred
For a GEMM of MK by KN matrices in float32:
FLOP = 2 * M * K * N (1 multiply + 1 add per element)
Bytes = (M*K + K*N + M*N) * 4 (read A, read B, write C)
Intensity = 2*M*K*N / ((M*K + K*N + M*N) * 4)
For M=N=K=4096, the intensity is approximately 682 FLOP/byte.
For a reduction (sum of an array of N floats):
FLOP = N
Bytes = N * 4
Intensity = 0.25 FLOP/byte
Second question: should my algorithm be compute-bound or memory-bound?
Compare the intensity to the ridge point of your GPU (see Annex B for complete tables).
Ridge point = Peak FLOPS / Peak bandwidth
If your algorithm's intensity is above the ridge point, the algorithm should be compute-bound. If below, it should be memory-bound.
Why this is essential. If your algorithm has an intensity of 682 (GEMM) but Nsight Compute says the kernel is memory-bound, there is a serious efficiency problem. The kernel is transferring far more data than it should. This diagnosis is impossible without having calculated the theoretical intensity beforehand.
Conversely, if your algorithm has an intensity of 0.25 (reduction) and the kernel is memory-bound, that is normal and expected. Trying to make it compute-bound makes no sense.
Impact of data type. For FP16, each byte carries twice as many values as FP32. The effective arithmetic intensity therefore doubles. For an FP16 GEMM, the intensity rises to approximately 1364 FLOP/byte. Additionally, Tensor Cores process FP16 operations at a much higher throughput (see Annex B FP16 and Tensor Core tables). The ridge point must be recalculated accordingly.
This technique, introduced by Paulius Micikevicius (NVIDIA, GTC 2010), is complementary to profiling. It consists of creating two modified versions of the kernel to separately measure the time spent on memory and the time spent on compute.
Memory-only version. Remove as much arithmetic as possible without changing the memory access pattern. Verify with the profiler that the number of load/store instructions is identical to the original version.
Math-only version. Remove global memory accesses. You must trick the compiler so it does not remove dead code: store the result inside a conditional that never executes but whose condition depends on the computed value and a parameter unknown to the compiler.
__global__ void kernel_math_only(..., int flag) {
// ... all the computation ...
value = temp + coeff * vsq;
if (1 == value * flag)
g_output[out_idx] = value;
}
Watch occupancy. Removing code changes the register count and therefore occupancy. If occupancy changes, the timings are no longer comparable. After modification, check occupancy with the profiler. If needed, add artificial shared memory at launch to force the same occupancy:
kernel<<<grid, block, extra_smem, stream>>>(...)
Interpretation. Compare the full-kernel time to mem-only and math-only:
If full is close to max(mem, math), the mem/math overlap is good and the kernel is bound by whichever is slower. If full is significantly larger than max(mem, math), overlap is poor and latency is a problem. The sum mem+math compared to full gives the degree of overlap. If the sum is close to full, there is virtually no overlap.
This technique gives a diagnosis that the profiler does not provide directly in this form. It is particularly useful for latency-bound kernels where the profiler shows low throughputs everywhere without clearly indicating the cause.
Warm-up matters: skip the first executions where caches and TLB are not yet warm.
ncu --kernel-name "myKernel" --launch-skip 5 --launch-count 1 \
--set full -o baseline ./myapp
Useful filtering options: --kernel-name accepts regex (--kernel-name "kernel(A|B)"). For MPI applications, use --target-processes all. To limit the number of profiled kernels, -c N. To select specific sections instead of the full set: --section ComputeWorkloadAnalysis --section MemoryWorkloadAnalysis (balance between coverage and profiling time).
List available sections and sets:
ncu --list-sections
ncu --list-sets
Prefer Profile mode (complete collection, post-hoc analysis) over Interactive mode for multiple kernels. Profile mode generates an .ncu-rep file that can be analyzed later in the UI without relaunching the application.
Where to look. Open the .ncu-rep file in ncu-ui. The Details page is shown by default. The first section is GPU Speed Of Light Throughput. It contains a bar chart showing SM Throughput (%) and Memory Throughput (%) side by side. Below the chart, the Throughput Breakdown details the sub-metrics of each category.
Read the automatic rule. At the bottom of the Speed Of Light section, one or more rules display automatically (if not yet applied, click Apply Rules at the top of the Details page). The bottleneck rule directly tells you whether the kernel is compute-bound, memory-bound, or latency-bound, and directs you to the next section to inspect with a clickable link. For example:
"This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons."
This is your starting point. The rule does the first triage for you.
Verify and understand the diagnosis. To understand why the rule reaches its conclusion and for ambiguous cases where it is wrong, use the following thresholds on the numbers displayed in the section:
| SM % | Mem % | Verdict |
|---|---|---|
| > 60, Mem < 60 | < 60 | Compute-bound |
| < 60 | > 60, DRAM > 60 | Memory-bound DRAM |
| < 60 | > 60, DRAM < 30 | Internal congestion |
| > 60 | > 60 | Balanced |
| < 40 | < 40 | Latency-bound |
The internal congestion case (Memory high, DRAM low) is the most frequent in tiled kernels. It means the internal memory pipeline (L1, shared memory, L2) is saturated but VRAM is not the problem. This is typically a symptom of bank conflicts, excessive wavefronts, or a volume of internal memory instructions too high relative to compute instructions.
Understanding "Memory Throughput". The percentage shown in Speed-of-Light is not DRAM alone. It is the maximum across all memory subsystems (L1, L2, DRAM, shared). To see the breakdown, click the Throughput Breakdown in the same section. Each sub-metric is listed with its percentage. This is essential to understand which subsystem contributes to the displayed number.
Consistency check with Step 0. If the theoretical intensity is above the ridge point but the kernel is memory-bound, the kernel is inefficient: it wastes bandwidth. If the intensity is below the ridge point but the kernel is compute-bound, the kernel is inefficient: it performs too many unnecessary computations.
The throughput trap. A faster kernel can display lower throughput. This is normal: throughput measures resource utilization, not algorithmic efficiency. If the kernel is 2x faster but uses 2x less bandwidth (better algorithm), throughput drops while actual performance increases. Speed-of-Light does not judge algorithmic efficiency, only resource utilization. Always compare execution time (shown in the result header: gpu__time_duration.sum) alongside the percentages.
Where to find it. In recent versions, Roofline sections (FP32, FP16, FP64, Tensor) are included in the full set. On the Details page, look for the Roofline Chart section. If it does not appear, verify you profiled with --set full. Rooflines can also be collected individually with --section SpeedOfLight_RooflineChart.
How to read it. The X-axis is measured arithmetic intensity (FLOP/byte), the Y-axis is measured performance (GFLOP/s). The horizontal ceiling is peak compute. The sloped ceiling is peak memory. The ridge point is the intersection. The kernel appears as a dot on the chart.
If the dot is on or near the sloped ceiling, the kernel is using bandwidth well and is memory-bound as expected. If the dot is on or near the horizontal ceiling, the kernel is using compute well. If the dot is far below both ceilings, the kernel is latency-bound. This is the same information as the threshold table in Step 1, but visualized.
Baseline/optimized comparison. The power of the interactive roofline is in comparison. After optimizing a kernel, open both reports in the same ncu-ui instance. Use Add Baseline on the first result. On the roofline, both dots appear and you immediately see whether the optimization moved the kernel toward the ceiling (improved throughput) or to the right (improved arithmetic intensity, meaning less bandwidth waste).
Multi-kernel overlay. If a report contains multiple profiled kernels, they all appear on the same roofline. Useful for comparing different kernels and identifying which one is farthest from its ceiling and therefore most promising to optimize.
Manual verification. The ridge point shown in the roofline should match the calculation from Step 0 (Peak FLOPS / Peak bandwidth). If it does not, verify the correct data type is selected in the roofline (FP32 vs FP16 vs Tensor). Since version 2025.1, the ridge-point formula is viewable in the Metric Details tool window by hovering over the point.
You now know the classification. The Speed Of Light rule directed you to the next section. You drill into that section to find which specific component is saturated.
Where to look. Compute Workload Analysis section on the Details page. The bar chart shows the utilization rate of each pipeline. The pipeline with the highest % is the bottleneck.
| Pipeline | Typical instructions |
|---|---|
| FMA | FFMA, FMUL, FADD |
| ALU | IADD3, LEA, IMAD, LOP3 |
| Tensor | HMMA, IMMA, DMMA |
| SFU (XU) | MUFU.SIN/COS/EX2/RCP |
| FP64 | DADD, DMUL, DFMA |
If the FMA pipeline is above 80%, the kernel is at the FP32 compute ceiling. Remaining gains are small. If another pipeline (SFU, ALU) is saturated but not FMA, there are parasitic instructions consuming throughput.
Accidental FP64 trap. In Python (Numba, CuPy), literal constants like 0.5 are FP64 by default. Even a few FP64 instructions can saturate the FP64 pipeline (very slow on consumer GPUs: 1/64 of FP32 throughput on RTX 4090) and create a bottleneck. The Compute Workload Analysis section shows the per-pipeline breakdown: if FP64 appears unexpectedly, cast constants to float32 (np.float32(0.5) or float32(0.5)). In CUDA C++, use the f suffix on literals (0.5f instead of 0.5).
Manual throughput verification. If you need to verify the displayed number or compare with an independent theoretical calculation, the metric sm__sass_thread_inst_executed_op_ffma_pred_on.sum gives the total number of executed FFMAs (at thread level). Each FFMA = 2 FLOP (1 multiply + 1 add). Therefore:
Effective GFLOPS = (ffma_count * 2) / (kernel_duration_in_ns)
Compare with the theoretical peak of your GPU. But the Compute Workload Analysis section already displays the percentage of peak for each pipeline, so this calculation is only needed for independent verification.
Where to look. Memory Workload Analysis section on the Details page. The memory diagram (Memory Chart) shows data flows between DRAM, L2, L1/TEX, and SM with bytes and % of Peak annotated on each arrow. DRAM Throughput is displayed directly in GB/s and as a percentage of peak.
Read the rule. The rule in this section indicates whether the access pattern is problematic and can point to the offending source lines.
Quick interpretation. If % of Peak DRAM is above 75%, the kernel uses bandwidth well and remaining gains are limited. If % of Peak is below 50%, there is waste (non-coalesced accesses, excessive L2 traffic, etc.).
Manual verification. If needed:
Effective BW (GB/s) = dram__bytes.sum / gpu__time_duration.sum
But the Memory Chart already displays this value.
Where to look. Memory Workload Analysis section, Shared Memory and L1/TEX Cache tables. The Memory Chart is particularly useful here (see Step 2b) as it visually shows that traffic is concentrated in internal stages.
This is the most subtle case. DRAM is not the problem; the internal pipeline is.
First indicator: shared loads vs global loads ratio. Look in the memory diagram at the number of shared memory requests compared to global memory requests. A ratio below 5 means poor reuse and sub-optimal tiling. Between 5 and 20 is normal for a tiled kernel. Above 20, the volume of shared instructions is very high. A ratio above 20 with low DRAM means the kernel spends most of its time on LDS/STS instructions instead of FFMA. The LSU pipeline is saturated by internal traffic.
Second indicator: bank conflicts. In the Shared Memory table, the reliable method for measuring bank conflicts is to look at the L1 Wavefronts Shared and L1 Wavefronts Shared Ideal columns in the Source tab. The hardware metric l1tex__data_bank_conflicts_pipe_lsu_mem_shared* is not a pure bank-conflict counter: it includes other arbitration cycles and can overestimate the actual number of conflicts. This is confirmed by the NVIDIA Nsight Compute team. The correct method is to sum L1 Wavefronts Shared Excessive in the Source view, which represents the difference between actual and ideal wavefronts for each instruction.
If the Wavefronts/Requests ratio on shared instructions is greater than 1, each shared access takes multiple passes. A ratio of 3 means shared throughput is divided by 3.
Detailed wavefront interpretation (confirmed by NVIDIA forums). For an LDS.128 instruction (32 threads x 16 bytes = 512 bytes), the theoretical minimum is 4 wavefronts (512 / 128 bytes per wavefront). If Nsight Compute reports 32 wavefronts, there are 28 excessive wavefronts, which corresponds exactly to the bank-conflict counter. When wavefronts are at the theoretical minimum, there is no bank conflict even if multiple threads access the same bank (because full bandwidth is used). A bank conflict is defined as additional wavefronts beyond the theoretical optimum, not simply the fact that multiple threads access the same bank.
Third indicator: FMA vs total instructions ratio. Instruction Statistics section on the Details page. The percentage of instructions that are FFMA is displayed. If the ratio is below 30% in a kernel that is supposed to be compute-intensive (GEMM), the compute units are starved by internal memory instructions.
The Speed Of Light rule directs you to Scheduler Statistics and Warp State Statistics with a clickable link. Follow it.
Scheduler Statistics section. Look at No Eligible (% of cycles where no warp is ready), displayed directly in the diagram. If above 30%, the scheduler spends its time waiting. Also look at Eligible Warps Per Active Cycle. If below 1, there are not enough ready warps to hide latency.
Read the Issue Slot Utilization rule. This rule, in the Scheduler Statistics section, gives a precise diagnosis. It indicates how many cycles elapse on average between two instruction issues per scheduler, how many warps are active vs eligible, and suggests causes. For example: "Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 6.8 cycles."
Warp State Statistics section. Look at the dominant stall reason in the bar chart. Hover over each bar to see the numeric value and the metric definition. Two columns matter: "All stalls" (all stall cycles) and "Not-issued" (stalls that actually reduce throughput because no eligible warp could be scheduled). Focus on "Not-issued" for actionable optimizations.
Read the CPI Stall rule. The rule at the bottom of this section identifies the dominant stall and directly gives the percentage of lost cycles and the cause. For example: "On average each warp of this kernel spends 82.8 cycles being stalled waiting for the local/global instruction queue to be not full. This represents about 75.9% of the total average of 109.1 cycles between issuing two instructions." This is the information you would manually extract from the chart bars, but the rule phrases it in natural language and suggests actions.
Stalls and their meaning:
stall_long_scoreboard: waiting for memory data (LDG/TEX). HBM latency 200-500 cycles, L2 100-200 cycles. Points to a memory-latency problem.
stall_short_scoreboard: compute dependency chain or shared-memory wait (20-30 cycles). Caution: this category also includes everything going through the MIO pipeline: S2R (variable latency, around 20 cycles), CS2R (fixed latency), and DEPBAR injected by the compiler. On Hopper, if you see unexpected short-scoreboard stalls on a CS2R instruction or after a DEPBAR, it is often the compiler inserting an implicit dependency barrier, not a shared-memory problem.
stall_barrier: excessive __syncthreads() cost. If this stall dominates, reduce the number of barriers, group more work between barriers, or replace with __syncwarp() when only intra-warp synchronization is needed.
stall_not_selected: warps are eligible but the scheduler alternates. Good sign: indicates sufficient occupancy to hide latency.
stall_mio_throttle / stall_lg_throttle: memory instruction queues are full. Too many load/store instructions are issued without compute between them. Solutions: interleave compute, use wider loads (LDS.128 instead of LDS.32), or reduce total access count.
stall_math_pipe_throttle: the compute pipeline is saturated. In principle positive (compute-bound), but can indicate an imbalance if combined with low SM%.
Occupancy section. Compare Theoretical Occupancy and Achieved Occupancy, displayed directly in the section. If theoretical is low (below 50%) and achieved is also low, the problem is too many registers or too much shared memory per block. If theoretical is good (above 50%) but achieved is low, the grid is too small. If both are good, occupancy is not the problem.
If theoretical occupancy is limited by registers, note the number of registers per thread (shown in the Launch Statistics section). On all architectures since Turing, each SM has 65,536 32-bit registers split across 4 sub-partitions (16,384 per sub-partition). The maximum warps per sub-partition is:
max_warps_per_sub = 16384 / (registers_per_thread * 32)
With 128 registers/thread: 16384 / (128 * 32) = 4 warps per sub-partition, or 16 per SM out of a max of 64 (Hopper). That is 25% occupancy.
Caution: more occupancy is not always better. A classic case from the NVIDIA forums (A100, CUDA 12.2) shows a kernel with 4x more occupancy running 60% slower because migrating private data to shared memory multiplied memory instructions by 6.5x, creating MIO/LG Throttle saturation. Occupancy is a means of hiding latency, not a goal in itself. A kernel with few warps but a high FFMA/instruction ratio can be faster than a kernel with many warps drowning in memory instructions.
Where to find it. The Memory Chart is the diagram in the Memory Workload Analysis section. It visualizes data flows between physical units: DRAM, L2, L1/TEX, shared memory, and SM.
How to read it. Each arrow between two units shows the volume of data transferred (in bytes or sectors) and throughput (in % of Peak). Thick arrows with high % of Peak indicate saturated paths. Thin or absent arrows indicate under-utilized paths. Since version 2024.4, inactive elements can be hidden to simplify reading. Since version 2025.1, the chart supports zoom and pan.
Why it is useful. The Memory Chart gives in 2 seconds the diagnosis that takes 5 minutes with raw metrics. If the DRAM-to-L2 arrow is thick with high % of Peak, the kernel is DRAM-bound. If the L1-to-SM arrow is thick but the DRAM-to-L2 arrow is thin, traffic is internal (bank conflicts, excessive LDS volume). If the shared-memory-to-SM arrow dominates everything, the kernel is drowning in shared instructions.
What the chart does not show. The chart shows volumes and throughputs, not causes. It tells you where traffic is concentrated, not why. For the why, you must drill into the tables of the same section and into the Source tab. The chart guides; the metrics confirm.
Metric Details interaction. Clicking an element of the chart opens the Metric Details tool window with the exact metric and its value. This is the link between the visual and the number.
You know which subsystem is the bottleneck. Now you quantify the waste and estimate the potential speedup before writing any code. If the potential speedup is below 5%, it is not the right battle.
For some of these quantifications, the Details page gives the number directly. For others, a manual calculation is needed because the tool does not know the properties of your algorithm.
Where to look on the Details page. Memory Workload Analysis section, L1/TEX Cache table, Global Load row. The Sectors per Request column is displayed directly.
Read the rule. The Source Counters rule (on the Details page) identifies source lines with non-coalesced accesses and displays the ratio. For example: "The memory access pattern for loads from L1TEX to global memory is not optimal. The ratio of requested bytes to total bytes moved is 0.25."
Understanding the number. A perfect access generates 4 sectors of 32 bytes per request (128 bytes = a cache line for 32 threads * 4 bytes). Every additional sector is waste.
BW waste = (Sectors/Req - 4) / Sectors/Req
Potential speedup (if DRAM-bound) = Sectors/Req / 4
Example: Sectors/Req = 16. Waste = 75%. Potential speedup = 4x. This is huge; fix this first.
Example: Sectors/Req = 5. Waste = 20%. Potential speedup = 1.25x. Notable but not critical.
If the kernel is not DRAM-bound, the real speedup will be less than the potential because the bottleneck is elsewhere.
What the Details page gives. Sectors/Req = 4 (coalescing OK) but the total number of bytes transferred can be excessive. dram__bytes.sum is shown in the Memory Workload Analysis section.
What the Details page does not give. The tool does not know the theoretical minimum bytes your algorithm should transfer. This calculation is necessarily manual (your napkin math from Step 0).
Traffic overhead = dram__bytes.sum / theoretical_bytes
If this ratio is above 2, the kernel re-reads data it should reuse (lack of tiling) or writes unnecessary intermediate data (kernels that could be fused).
Method with the Source page. Enable the L1 Wavefronts Shared and L1 Wavefronts Shared Ideal columns. Sum L1 Wavefronts Shared Excessive (= actual - ideal) across all LDS and STS instructions. Do not use l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum or the _op_st variant because these hardware counters include arbitration cycles that are not bank conflicts. They systematically overestimate.
Average N-way = sum(Wavefronts Shared) / sum(Wavefronts Shared Ideal)
To estimate the speedup with Amdahl's formula: if the kernel spends a fraction f of its time on shared loads (estimable from Warp State Statistics, proportion of cycles in MIO Throttle or Short Scoreboard attributed to LDS) and bank conflicts multiply that time by n:
Speedup = 1 / ((1 - f) + f/n)
Concrete example: a kernel spends 60% of its time on shared loads and the bank conflicts are 3-way (each load takes 3 passes instead of 1). If the conflicts are eliminated, the shared time drops from 0.6 to 0.2, the total time from 1.0 to 0.6, and the speedup is 1.67x.
Beware of false positives (NVIDIA forums). Cases exist where the Source view shows 50% excessive wavefronts on an LDGSTS instruction while the Details page shows no bank conflicts. Excessive wavefronts can also come from address divergence, not only bank conflicts. Always cross-check both views.
Details vs Source discrepancy. After optimization, the Source view may show zero excessive wavefronts while the Details page still displays non-zero bank conflicts. The remaining bank conflicts are aggregated across all shared instructions in the kernel, including those in the prologue/epilogue. If the instructions in the hot loop are at zero, the residual is generally negligible.
The shared/global ratio is high, the FMA/total ratio is low, but bank conflicts are low. The problem is the raw volume of memory instructions. This diagnosis is not displayed directly by the tool — you must calculate it.
Memory instructions per FFMA = (LDS + STS + LDG + STG) / FFMA
In a well-optimized GEMM with thread coarsening (TM=TN=8), this ratio should be approximately 0.25 (4 FFMA per LDS in the outer product). Without thread coarsening (TM=TN=1), the ratio is close to 1 or higher. Each memory instruction consumes a scheduler slot that could have been an FFMA.
The problem is not the volume of traffic but the time spent waiting for results. Typical latency by access type:
HBM = 200-500 cycles
L2 = 100-200 cycles
Shared = 20-30 cycles
Register = 4-6 cycles (FMA/ALU)
If stalls are dominated by global accesses and occupancy is low, increasing occupancy is the solution. If occupancy is already reasonable (above 50%) and stalls remain high, the problem is code structure: the compiler does not have enough independent instructions to interleave between the load and the consumption.
A case from the NVIDIA forums illustrates the diagnosis: each warp waits on average 92.7 cycles on L1TEX accesses, representing 90.4% of the average 102.5-cycle gap between instructions. The kernel is clearly latency-bound on memory.
Where to look on the Details page. Warp State Statistics section, Avg. Not Predicated Off Threads Per Warp field. The Instruction Statistics section also displays Avg. Executed Instructions vs Avg. Issued Instructions — a significant difference indicates serialization (divergence or bank conflicts).
Waste calculation:
Divergence waste = 1 - (Avg Predicated-On Threads / 32)
Example: Avg Predicated-On Threads = 24 on FFMAs. Waste = 25%. Potential speedup if divergence is eliminated = 32/24 = 1.33x.
Divergence on LDGs has a different impact than divergence on FFMAs. Inactive threads on an LDG still generate the memory transfer (inactive lanes produce useless data). Inactive threads on an FFMA do not consume additional throughput but waste compute potential.
Where to look. The Launch Statistics section on the Details page displays the number of registers per thread. The compilation output (nvcc -Xptxas -v) shows spill stores / spill loads. Since version 2025.1, the metric sass__inst_executed_register_spilling directly counts spilling instructions. In the Source view, the compiler annotates lines with spilling warnings since version 2024.4.
In Nsight Compute, look for STL and LDL instructions in the Source view. Each spill store + spill load pair adds approximately 40-100 cycles (passing through L1 in local memory). If the spilling is in the inner loop (high Instructions Executed on STL/LDL), the impact is multiplied by the number of iterations. If the spilling is in the prologue or epilogue (Instructions Executed = W), the impact is generally negligible.
You arrive in the Source tab with a precise hypothesis from Step 3. You never go there "just to look."
Shortcut from rules. Some rules on the Details page place source markers directly on the responsible SASS lines. If a rule identified a problem (non-coalesced access, bank conflict, dominant stall), click the source reference in the rule text to jump directly to the offending line in the Source tab. This is the fastest path.
Compilation prerequisites. Compile with --generate-line-info (alias -lineinfo) to have CUDA-C / PTX / SASS correlation. Profile with ncu --set full to have all source metrics. Never compile with -G for profiling (disables all device optimizations).
Hotspot navigation. Use the up/down arrows to jump between critical lines (those with the most cycles or the most excessive wavefronts). The Navigation menu at the top of the Source page lets you select the metric to navigate on (for example stall_long_sb to jump to the line with the most long scoreboard stalls). The Live Registers column shows the number of live registers at each SASS line, helping identify register pressure zones.
Before diagnosing, you must be able to navigate SASS. A kernel breaks down into identifiable phases.
The prologue: the first instructions of the kernel. You find S2R (reading threadIdx, blockIdx — variable latency, around 20 cycles), CS2R (reading special registers — fixed latency on Hopper), LDC (reading kernel arguments from constant memory), LEA and IMAD (computing base addresses). Live Registers rise progressively. Instructions Executed equals W (number of warps in the grid). This is never the hotspot.
The main loop: identifiable by a BAR.SYNC (__syncthreads) at the beginning or end, and a BRA that jumps back to a label above. Inside, you find LDG/STS (loading tiles from global memory to shared), a BAR.SYNC, then LDS/FFMA (loading from shared to registers and computing). Instructions Executed are high (W * iterations). This is where 90% of the time is spent.
The epilogue: STG instructions (writing results to global memory), then EXIT. Low Instructions Executed (W). Rarely critical unless the grid is very small.
To calculate the number of loop iterations:
W = (gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z) / 32
iterations = Instructions_Executed(instruction in the loop) / W
SASS instruction dictionary.
| Opcode | Pipeline | Notes |
|---|---|---|
| FFMA, FMUL, FADD | FMA | |
| HFMA2 | FP16 (2 ops/instr) | |
| HMMA, IMMA, DMMA | Tensor Core | |
| IMAD | ALU (integer multiply-add) | |
| LEA | ALU (address computation) | special case of IMAD |
| IADD3 | ALU (3-operand addition) | |
| ISETP | ALU (comparison) | |
| LOP3, SHF | ALU (logic, shift) | |
| MUFU.* | SFU (transcendentals) | sin, cos, rsqrt, exp2, log2 |
| DADD, DMUL, DFMA | FP64 | |
| LDG.E, LDG.E.64, LDG.E.128 | LSU (global load) | size varies with vectorization |
| STG.E, STG.E.64, STG.E.128 | LSU (global store) | |
| LDGSTS | LSU (async global to shared) | Ampere+, bypasses registers (cp.async) |
| LDS, LDS.64, LDS.128 | LSU (shared load) | |
| STS, STS.64, STS.128 | LSU (shared store) | |
| LDL, STL | LSU (local memory) | spills + local arrays too large for registers |
| LDC | IDC (constant memory) | |
| ATOM, RED | LSU (atomics) | |
| BAR.SYNC | CBU (__syncthreads) | |
| BRA | CBU (branch) | |
| BSSY, BSYNC | CBU (divergence) | |
| DEPBAR | CBU (async barrier) | cp.async dependencies, Ampere+ |
| EXIT | CBU (kernel end) | |
| S2R | MIO (threadIdx etc, variable latency) | |
| CS2R | MIO (special registers, fixed latency) | clock, performance counters |
| MOV | register copy | no dedicated pipeline, executed on ALU |
| PRMT | ALU (byte permutation) | useful for swizzle/byte shuffle |
| SEL | ALU (conditional select) | ternary a ? b : c |
| F2F, F2I, I2F, I2I | ALU (type conversion) | |
| SHFL | LSU (intra-warp shuffle) | exchange between threads without shared memory |
Predicates: @P0 FFMA ... means "execute if P0 is true". @!P0 LDG.E ... means "execute if P0 is false". This is the GPU's mechanism for handling conditions without branching. Predicates are set by ISETP instructions.
Memory suffixes: .E = extended addressing (64-bit). .128 = vectorized 128-bit access (float4). .U = unaligned.
The 64-bit address computation pattern. A very common sequence in any kernel is computing a 64-bit address from 32-bit components:
IMAD R6.CC, R3, R5, c[0x0][0x20]
IMAD.HI.X R7, R3, R5, c[0x0][0x24]
LD.E R2, [R6]
The first instruction computes R6 = low32(R3R5) + low32(param), saving the carry in CC. The second computes R7 = high32(R3R5) + high32(param) + carry. The pair R7:R6 forms the 64-bit address. The suffixes .CC (write carry) and .HI.X (upper half + carry-in) are the key to understanding this pattern. c[0x0][0x20] is typically the first kernel parameter (base pointer of an array) stored in constant memory bank 0.
To identify the structure: find all BAR.SYNC and all BRA. BAR.SYNC instructions separate phases (loading / computing). BRA instructions that jump to a label above are loops. BRA instructions that jump to a label below are conditional jumps (loop exit, boundary conditions). The mental model is:
prologue > loop(LDG/STS > BAR.SYNC > LDS/FFMA > BAR.SYNC > BRA) > epilogue > EXIT
Hypothesis: Sectors/Req > 4 on global loads.
Fast path. The Source Counters rule on the Details page often identifies the offending source lines and places focus metrics. Click the reference to jump directly to the line in the Source tab.
Detailed path. Enable the L2 Theoretical Sectors Global Excessive column in the SASS view. Sort in descending order. The LDG and STG instructions at the top of the list are the culprits.
Click on the offending LDG instruction. The CUDA-C view shows the corresponding source line. Note the address register of the instruction (for example LDG.E R10, [R16]). Trace back in the SASS to find how R16 was computed: it is a chain of LEA/IMAD/IADD3 reflecting your indexing. If the final index is of the form threadIdx.y * stride + threadIdx.x and stride is not a multiple of 32 words (128 bytes), adjacent threads are not accessing contiguous addresses.
Hypothesis: L1 Wavefronts Shared Excessive > 0.
Enable the L1 Wavefronts Shared and L1 Wavefronts Shared Ideal columns. The difference between these two columns (Excessive column) gives the exact number of additional wavefronts due to bank conflicts, instruction by instruction. Sort by L1 Wavefronts Shared Excessive in descending order.
The LDS and STS instructions at the top are the problematic accesses. For each instruction, note the base register. For example:
LDS R30, [R58+0x000] // base R58
LDS R31, [R58+0x040] // base R58
LDS R32, [R58+0x080] // base R58
LDS R40, [R59+0x000] // base R59
LDS R41, [R59+0x040] // base R59
Group the LDS by base register. Compare the excessive counts between groups. The group with the most excessive wavefronts corresponds to the shared array that has the bank conflicts.
Look at the offsets: if offsets are multiples of 0x80 (128 bytes), threads periodically fall on the same banks (shared memory has 32 banks of 4 bytes, so a 128-byte cycle). Strides of 0x40 (64 bytes = 16 banks) create systematic 2-way conflicts.
To identify which shared array is involved: trace back to the beginning of the kernel and find the STS instructions that write to the same base register. The CUDA-C correlation on those STS will tell you As[...] or Bs[...].
Hypothesis: Long Scoreboard dominates, stalls are high.
Fast path. In the Navigation menu of the Source page, select stall_long_sb then click the "jump to max" arrow. The page jumps directly to the line with the most long scoreboard stalls. This is the consumer: the instruction waiting for a result.
Detailed path. Sort by Attributed Stalls in descending order. The instruction at the top is the consumer. Look at its Scoreboard Dependencies column: it is waiting for an SBx (for example SB0).
Trace back in the SASS to find the instruction that set this SB0. It is typically an LDG located higher up. Count the number of SASS instructions between the LDG (SB set) and the consumer (SB wait). This number is the intra-warp hiding. For HBM latency (200-500 cycles), this is almost never sufficient on its own: it is occupancy (other warps take over during the wait) that does the heavy lifting.
Subtlety on stall attribution. The stall is always attributed to the instruction that waits (the consumer), not the one that produces the result (the producer). Additionally, the compiler can inject DEPBAR (dependency barriers) that do not appear in the source code. These DEPBAR create stalls that seem orphaned — they appear on an instruction with no obvious dependency. If you see a short scoreboard stall on a CS2R or just after a DEPBAR, it is probably the compiler inserting a barrier to force completion of earlier instructions. This is not a problem to optimize — it is a scheduling artifact.
To measure whether other warps effectively take over: in Scheduler Statistics, look at Issued Warp Per Active Cycle. If this value is close to 1, the scheduler always finds an eligible warp despite the stalls: inter-warp hiding is working. If this value is below 0.5, the scheduler does not have enough eligible warps.
Hypothesis: low occupancy due to registers, presence of local memory.
Look for STL (Store Local) and LDL (Load Local) instructions in the SASS view. Check their Instructions Executed. If they are in the inner loop (Instructions Executed = W * iterations), each iteration pays the cost of spilling. If they are in the prologue (Instructions Executed = W), the impact is small.
Look at the Live Registers column. The peak of live registers is where the compiler ran out of space. Typically, this is a point where many LDGs are in flight simultaneously (all destination registers are live at the same time because the data has not yet been consumed).
Since version 2024.4, the compiler directly annotates lines in the Source view with spilling warnings. Look for these annotations — they indicate the responsible variables.
Hypothesis: low Avg. Predicated-On Threads on critical instructions.
Enable the Avg. Predicated-On Threads Executed column. Look at the values on FFMAs and LDGs in the inner loop.
32 means all threads active, no problem. 28-31 means a few border threads inactive, normal. 16-27 means significant divergence to investigate. Below 16 means severe divergence.
To find the cause: look for BSSY/BSYNC in the inner loop. The zone between BSSY and BSYNC is a hardware-managed divergence zone. Trace back to the ISETP that sets the predicate used by the divergent branch.
Regex and value search. Since version 2024.4, the Source page search bar supports regular expressions and metric-value searches. You can search for all LDS instructions with more than 100 excessive wavefronts, or all FFMAs with a predicated-on below 24. This is a massive accelerator for large kernels.
Source View Profiles. You can save a set of active columns as a named profile (for example "bank_conflicts" with columns Wavefronts Shared, Wavefronts Shared Ideal, Wavefronts Shared Excessive) and apply it automatically when opening any report. This avoids reconfiguring columns every session.
Inline Tables. For kernels with heavy inlining, Inline Tables (since 2024.4) show inlined functions with hyperlinks to source line numbers and SASS addresses. This is essential when SASS-to-source correlation seems incorrect — a SASS instruction can be attributed to the wrong source line due to inlining or compiler code motion.
Reliability of SASS-to-source correlation. The CUDA-C / SASS correlation is not always reliable, especially when the compiler does aggressive inlining, code motion, or software pipelining. If a SASS instruction seems attributed to an unrelated source line, check the Inline Tables and the intermediate PTX view. PTX is often better correlated to source than the final SASS.
The tool identified the problem and the line of code. Now you fix it.
Non-coalesced accesses. Reorganize the indexing so that adjacent threads (consecutive threadIdx.x) access contiguous addresses in memory. If the indexing is of the form array[threadIdx.y * stride + threadIdx.x], the stride must be a multiple of 32 elements (128 bytes in float32). If this is impossible (imposed data structure), use shared memory as a transposition buffer.
Bank conflicts. Add one column of padding to the shared array (for example float As[TILE][TILE+1] instead of float As[TILE][TILE]). Alternatively, reorganize accesses so that adjacent threads access different banks.
LSU saturation. Increase thread coarsening: each thread computes more output elements, which increases the FFMA/LDS ratio. In a GEMM, going from TM=TN=1 to TM=TN=8 reduces the memory instructions per FFMA ratio from 1+ to approximately 0.25.
Unmasked latency. If occupancy is low due to registers, use launch_bounds or -maxrregcount to limit registers and increase occupancy. If occupancy is already reasonable, increase the number of independent accesses per thread (prefetching, multi-buffering) to give the compiler more instructions to interleave.
Divergence. Reorganize data or the thread-to-data mapping so that threads within the same warp follow the same path. If divergence is unavoidable (boundary conditions), minimize it by restricting it to border warps.
Spilling. Increase the register limit if occupancy allows it. Restructure the code to reduce the number of simultaneously live variables (consume loaded data before loading new data).
The problem. Choosing the best block size, the best shared memory / L1 split, or the best number of registers per thread requires testing many combinations. Doing it manually (recompile, relaunch, re-profile for each combination) is tedious.
The solution. Profile Series lets you automatically profile the same kernel with different launch configurations in a single session. In the UI, when stopped on a kernel launch in Interactive mode, choose Profile Series from the Profile menu. A dialog opens that lets you enable and configure the parameters to vary: block size (blockDim.x, blockDim.y, blockDim.z), shared memory size, etc.
For each combination of values, a complete profile is collected. Results appear in the same report, each with a description indicating the modified parameters. You can then compare results to identify the optimal combination.
When to use it. After identifying and fixing a specific problem (Step 5), and before moving to the next iteration. Profile Series is particularly useful for kernels where block size has a strong impact on occupancy and access patterns.
After each correction, re-profile with the same collection parameters. Compare execution time (gpu__time_duration.sum) and key metrics.
The ADO cycle (Analysis-Driven Optimization). The process is cyclical: profile, identify the most important limiter, fix, re-profile, identify the new most important limiter, fix, and so on. At each iteration, the limiter changes because you solved the previous one. The process stops when the potential speedup of the next optimization is below your threshold (typically 5-10%), or when the kernel reaches a high percentage of the theoretical peak (75%+ of peak memory for a memory-bound kernel, 80%+ of peak compute for a compute-bound kernel).
Verifying optimality. For a memory-bound kernel, compare the achieved bandwidth (shown in Memory Workload Analysis) with the achievable bandwidth of your GPU. If both are close (say within 10%), the kernel is near-optimal and it is time to move on. For a compute-bound kernel, compare the achieved FMA or Tensor throughput with the peak. The NVIDIA ADO Part 3 article shows an example where the averaging kernel reaches 823 GB/s on a V100 with 740 GB/s measured by bandwidthTest — the kernel is optimal, there is nothing left to gain.
Baselines in the UI. Open the baseline report (before optimization) and the optimized report in the same ncu-ui instance. Select the baseline report, then click Add Baseline (at the top of the Details page or in the Profile menu). Switch back to the optimized report: each metric on the Details page now displays two values — the current value and the percentage change relative to the baseline. Improvements appear in green, regressions in red.
On the Roofline, both dots appear simultaneously, letting you see whether the optimization moved the kernel toward the ceiling.
To remove baselines, use Clear Baselines in the Profile menu.
Source Comparison. Since version 2025.1, the Source page supports direct SASS comparison between two results with a difference heatmap. To enable it, open both reports, set the baseline, then on the Source page of the current result, activate comparison mode. Instruction-by-instruction differences are colored and you can choose to compare by Opcode or by Full Instruction (Diff By menu).
This is the natural tool for understanding exactly what the compiler changed between two versions of your code, and for verifying that your modification has the expected effect at the SASS level.
The L2 cache is shared across all SMs. Its size varies by architecture (40 MB on H100). A kernel with a working set larger than L2 will see frequent misses. The metric lts__t_sectors_srcunit_tex_op_read.sum in the Memory Workload Analysis section shows L2 traffic volume. Compare with DRAM traffic: a high L2 traffic / DRAM traffic ratio indicates a good hit rate. A ratio close to 1 indicates L2 is not helping for this kernel.
For kernels with a streaming pattern, disabling L1 caching with -Xptxas -dlcm=cg can reduce L1 pollution from data that will not be reused.
Atomics (ATOM, RED in SASS) are serialized when multiple threads target the same address. The Memory Workload Analysis section shows the volume of atomic transactions. If atomic throughput is the bottleneck, consider hierarchical reductions (warp-shuffle then atomic per warp instead of atomic per thread).
Tensor Cores are identified in SASS by HMMA (FP16), IMMA (INT8), DMMA (FP64 on A100+) instructions. The Compute Workload Analysis section shows Tensor pipeline utilization. If the code uses FP16 or BF16 but the Tensor pipeline is at 0%, operations are not mapped to Tensor Cores — verify the use of wmma or mma.sync in PTX, or cuBLAS / cuDNN APIs with the corresponding data types.
Arithmetic intensity changes dramatically with Tensor Cores. The Tensor ridge point is much higher than the FP32 ridge point (see Annex B). A kernel that is compute-bound in FP32 can become memory-bound in FP16 Tensor because peak compute explodes but bandwidth remains the same.
PM Sampling collects metrics periodically during kernel execution, producing an intra-kernel timeline. It is in the PM Sampling section of the Details page. Unlike standard metrics that are averages over the entire kernel duration, PM Sampling shows temporal evolution.
This is useful for kernels that change behavior during execution: a kernel that starts by loading data (memory-bound) then switches to a compute phase (compute-bound) will have misleading average metrics that correspond to neither phase.
The PM Sampling: Warp States section is distinct from standard PM Sampling. It periodically samples warp states (stall reasons) over the kernel's duration. You can see if a kernel starts with Long Scoreboard stalls (data loading) then switches to Math Pipe Throttle stalls (compute phase) then returns to Long Scoreboard stalls (writing results).
Metrics from different groups come from different passes. The timeline is therefore reconstructed from multiple executions of the kernel. For the data to be consistent, the kernel must be deterministic.
If some SMs finish well before others, the tail of the kernel is wasted time. This imbalance is visible in PM Sampling (some SMs become idle before others) and in the Achieved Occupancy vs Theoretical Occupancy metric if the grid is not a multiple of the number of SMs.
Nsight Compute supports four replay modes:
Kernel Replay: the default mode. The kernel is replayed multiple times, with save/restore of written memory between passes. Simple and reliable, but the save/restore overhead can be high for kernels that write a lot of memory.
Application Replay: the entire application is relaunched multiple times. No memory save/restore, but the relaunch cost is paid for each pass. The application must be deterministic. Useful for kernels that interact with the host during execution (they hang in Kernel Replay because the host does not respond to subsequent passes).
Range Replay: a range of API calls and kernels is captured and replayed. Metrics are associated with the entire range, not an individual kernel. Supports concurrent kernels and CUDA Graphs. Since version 2025.1, instruction-level source metrics are available in Range Replay.
Application Range Replay: like Range Replay but the application is relaunched for each pass instead of capturing and replaying the range. No save/restore, no limitation on supported APIs. The cost is that of relaunching the entire application.
The choice of mode depends on the kernel. For a standard isolated kernel, Kernel Replay suffices. For CUDA Graphs or concurrent kernels, Range Replay or Application Range Replay are necessary.
Nsight Compute is extensible. You can create your own sections (.section files in Protocol Buffer format) and your own analysis rules (.py files in Python).
Custom sections. Each section is a text file that defines a unique identifier, a display name, the metrics to collect, and the visualization (tables, bar charts, timelines). Files are in the sections/ directory of the Nsight Compute installation. You can modify existing sections or create new ones. Available metrics are listed by ncu --query-metrics. You can also define derived metrics (arithmetic combinations of existing metrics) directly in the section file.
Custom rules. A rule is a Python file that implements at minimum get_identifier() and apply(handle). The apply function receives an NvRules context that gives access to collected metrics. The rule can emit messages (info, warning, error), focus metrics, source markers, and speedup estimates via the Frontend interface.
Minimal rule example:
import NvRules
def get_identifier():
return "MyCustomRule"
def get_name():
return "My Custom Rule"
def get_description():
return "Checks if DRAM throughput is below 50% of peak."
def apply(handle):
ctx = NvRules.get_context(handle)
action = ctx.range_by_idx(0).action_by_idx(0)
dram_pct = action.metric_by_name(
"dram__throughput.avg.pct_of_peak_sustained_elapsed"
).as_double()
fe = ctx.frontend()
if dram_pct < 50.0:
fe.message(NvRules.IFrontend.MsgType_WARNING,
"DRAM throughput is only {:.1f}% of peak.".format(dram_pct))
else:
fe.message(NvRules.IFrontend.MsgType_MSG,
"DRAM throughput is {:.1f}% of peak.".format(dram_pct))
Existing rules (shipped with the tool) are in the same sections/ directory and illustrate advanced usage: tables, charts, cross-links between sections, speedup estimates.
Team use case. A team can create rules specific to their code patterns (for example: detect an FFMA/LDS ratio below 1 in GEMM kernels, alert if spilling exceeds a threshold). These rules are distributed as simple Python files and are applied automatically at each profiling session.
The Nsight Compute Python API allows you to programmatically open an .ncu-rep file and extract all metrics. This is the tool for automating data extraction, building dashboards, and doing performance regression.
import ncu_report
report = ncu_report.load_report("baseline.ncu-rep")
for range_idx in range(report.num_ranges()):
r = report.range(range_idx)
for action_idx in range(r.num_actions()):
action = r.action(action_idx)
kernel_name = action.name()
duration = action.metric_by_name("gpu__time_duration.sum").as_double()
sm_pct = action.metric_by_name(
"sm__throughput.avg.pct_of_peak_sustained_elapsed"
).as_double()
print(f"{kernel_name}: {duration:.2f} ns, SM {sm_pct:.1f}%")
Combined with custom NvRules, the Python API allows building a CI/CD pipeline that automatically profiles critical kernels, extracts metrics, compares them to a baseline, and fails the build if a regression is detected.
For CI/CD, the typical workflow is: profile in CLI with ncu --set full -o current.ncu-rep, then use the Python API to extract key metrics and compare them to a stored baseline. Metrics to track are typically gpu__time_duration.sum (execution time), dram__bytes.sum (memory traffic), sm__throughput.avg.pct_of_peak_sustained_elapsed (SM utilization), and any kernel-specific metric.
On Hopper and later architectures, Green Contexts allow partitioning a GPU's SMs between multiple CUDA contexts (via MIG or MPS). When profiling on a cluster with MIG enabled, the kernel sees only a subset of SMs. This affects achievable occupancy, max warps, and effective bandwidth.
Nsight Compute (since 2025.1) displays the TPC mask in the Launch Statistics section and in the Resources window, indicating which TPCs are assigned to the context. If you profile on a MIG partition and metrics seem abnormally low, verify that you are comparing with the partition's peak, not the full GPU's.
CUDA Graphs pose specific challenges for profiling. In Kernel Replay mode, individual graph nodes can be profiled separately. In Graph mode (enabled in profiling options), the entire graph is profiled as a single entity, preserving concurrent behavior between nodes.
Graph mode is necessary when graph nodes must execute in parallel for correct functioning or performance. Instruction-level source metrics are not available in Graph mode.
In the Nsight Systems timeline, graphs appear as unified blocks. To identify which node is the bottleneck, first profile in Node mode (individual kernel) to obtain detailed metrics for each node, then in Graph mode to verify concurrent behavior.
Compiling with -G for profiling. The -G flag (device debug) disables all compiler optimizations. Collected metrics reflect unoptimized code, not production code. Never use -G for profiling.
Forgetting --generate-line-info. Without this flag, source/SASS correlation is not available and the Source tab is useless.
Profiling the first executions of a kernel. The first executions have cold caches and an unwarmed TLB. Use --launch-skip to skip the first iterations.
Comparing throughputs instead of times. A faster kernel can have lower throughput if the algorithm is better (transfers less data). Always compare gpu__time_duration.sum between versions.
Targeting maximum occupancy. More occupancy is not always better. If increasing occupancy forces the compiler to spill or requires migrating private data to shared memory, the cost of additional memory instructions can cancel the benefit of better latency hiding.
Using hardware bank conflict counters. The l1tex__data_bank_conflicts_pipe_lsu_mem_shared* metrics overestimate actual bank conflicts. Use excessive wavefronts in the Source view instead.
Instrumenting the kernel with timer reads. CS2R instructions for reading globaltimer or clockhi/clocklo force the compiler to insert DEPBAR that serialize the pipeline. The instrumented kernel behaves differently from the non-instrumented kernel. The additional stalls observed (short scoreboard on CS2R) are instrumentation artifacts, not problems of the original kernel.
Profiling on a MIG partition unknowingly. If the GPU is partitioned with MIG, the kernel sees only a subset of SMs and bandwidth. Throughput metrics are correct relative to the partition but are not comparable to the full GPU specs. Check the TPC mask in Launch Statistics.
Confusing the Speed Of Light Memory metric with DRAM alone. Memory Throughput in SOL is the max of all memory subsystems (L1, L2, DRAM, shared), not DRAM only. Open the Throughput Breakdown to see the detail.
Ignoring automatic rules. The rules on the Details page are the first thing to read after profiling. They identify the dominant limiter, point to the section to inspect, and often identify the offending source line. Ignoring rules and going directly to raw metrics is a waste of time in the majority of cases.
Blindly trusting automatic rules. Rules are heuristics based on thresholds. They can be wrong in edge cases (kernel balanced between compute and memory, kernel with very different phases). Always verify the rule's diagnosis with your own reasoning and manual thresholds.
Nsight Systems capture a trace:
nsys profile --trace=cuda,nvtx,osrt -o baseline ./myapp
Nsight Systems automatic report:
nsys analyze baseline.nsys-rep
Nsight Systems kernel stats:
nsys stats --report gpukernsum baseline.nsys-rep
Nsight Compute full profile:
ncu --kernel-name "myKernel" --launch-skip 5 --launch-count 1 \
--set full -o baseline ./myapp
Nsight Compute specific sections:
ncu --kernel-name "myKernel" --launch-skip 5 --launch-count 1 \
--section SpeedOfLight --section MemoryWorkloadAnalysis \
-o quick ./myapp
Nsight Compute NVTX filtering:
ncu --nvtx --nvtx-include "range@main_loop" \
--set full -o focused ./myapp
Nsight Compute MPI applications:
mpirun -np 4 ncu --target-processes all --set full \
-o mpi_profile ./myapp
Nsight Compute list sections and sets:
ncu --list-sections
ncu --list-sets
Nsight Compute list available metrics:
ncu --query-metrics
Nsight Compute specific metrics in CLI:
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
./myapp
| GPU | Peak GFLOPS | Peak BW (GB/s) | Ridge (FLOP/B) |
|---|---|---|---|
| V100 SXM2 | 15 700 | 900 | 17.4 |
| A100 SXM | 19 500 | 2 039 | 9.6 |
| H100 SXM | 66 900 | 3 350 | 20.0 |
| RTX 4090 | 82 600 | 1 008 | 81.9 |
| GPU | Peak GFLOPS | Peak BW (GB/s) | Ridge (FLOP/B) |
|---|---|---|---|
| V100 SXM2 | 31 400 | 900 | 34.9 |
| A100 SXM | 78 000 | 2 039 | 38.3 |
| H100 SXM | 133 800 | 3 350 | 39.9 |
| RTX 4090 | 165 200 | 1 008 | 163.9 |
| GPU | Peak TFLOPS | Peak BW (GB/s) | Ridge (FLOP/B) |
|---|---|---|---|
| V100 SXM2 | 125 | 900 | 138.9 |
| A100 SXM | 312 | 2 039 | 153.0 |
| H100 SXM | 989 | 3 350 | 295.2 |
An algorithm with arithmetic intensity above the ridge point should be compute-bound. Below, it should be memory-bound. If the profiler shows the opposite, there is an efficiency problem to investigate.
ADO (Analysis-Driven Optimization) series by Robert Crovella, NVIDIA:
Part 1 Preparing for Analysis: https://developer.nvidia.com/blog/analysis-driven-optimization-preparing-for-analysis-with-nvidia-nsight-compute-part-1/
Part 2 Analyzing and Improving Performance: https://developer.nvidia.com/blog/analysis-driven-optimization-analyzing-and-improving-performance-with-nvidia-nsight-compute-part-2
Part 3 Finishing the Analysis: https://developer.nvidia.com/blog/analysis-driven-optimization-finishing-the-analysis-with-nvidia-nsight-compute-part-3
NVIDIA Nsight Compute videos:
SOL Analysis: https://www.youtube.com/watch?v=uHN5fpfu8As
Guided Analysis: https://www.youtube.com/watch?v=04dJ-aePYpE
Memory Analysis: https://www.youtube.com/watch?v=GCkdiHk6fUY
Introduction to Nsight Compute (NCSA): https://www.youtube.com/watch?v=nYSdsJE2zMs
GPU Performance Analysis (CUDA Training Series): https://www.youtube.com/watch?v=nhTjq0P9uc8
Blog posts:
Using Nsight Compute to Inspect your Kernels: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/
Presentations:
Paulius Micikevicius Analysis-Driven Optimization (GTC 2010): https://www.nvidia.com/content/gtc-2010/pdfs/2012_gtc2010.pdf
Official documentation:
Nsight Compute User Guide: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
Nsight Compute Profiling Guide: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
Nsight Compute Customization Guide: https://docs.nvidia.com/nsight-compute/CustomizationGuide/index.html
NvRules API Reference: https://docs.nvidia.com/nsight-compute/NvRulesAPI/index.html
Nsight Systems User Guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html
Forums:
Nsight Compute Forum: https://forums.developer.nvidia.com/c/developer-tools/nsight-compute/114
Load CUDA and tools:
module load cuda/12.x
VNC on the compute node:
# On the compute node:
vncserver :1
# On your local machine:
ssh -L 5902:localhost:5901 user@node
# Then connect TigerVNC Viewer on localhost:5902
X11 alternative:
ssh -Y user@cluster
ncu-ui # or nsys-ui
X11 latency is higher than VNC, especially for heavy interfaces like ncu-ui with large reports. VNC is preferable for extended use.
Transfer reports for local analysis:
scp user@cluster:path/to/baseline.ncu-rep ./
ncu-ui baseline.ncu-rep
This is often the most comfortable option: collect on the cluster via CLI, analyze locally in the UI.