Nsight Compute - a primer on profiling

every time i run ncu for profiling, my head spins with the amount of data thrown at me, so here's a checklist for my future reference of how to study a profiling report

The only things that matter are

  • DRAM bandwidth
  • Compute throughput
  • Latency / occupancy

Profiling is just figuring out which one is saturated.

Step 1 — Find the expensive kernels

Open Summary → Duration and sort descending. Only analyze kernels that dominate runtime.

Rule of thumb:

  • focus on kernels that contribute >5-10% total runtime

ignore everything else. Optimization effort scales with time share.

Step 2 — Identify the bottleneck

Go to Speed Of Light Throughput

Look at:

  • Memory Throughput
  • Compute (SM) Throughput

Decision rule:

memory-bound: Memory > 60%  and Compute < 30% 
compute-bound: Compute > 60%
latency/occupancy-bound: Both < 30%

Step 3 — Check if the hardware is already saturated

Go to Memory Workload Analysis -> Max Bandwidth

If DRAM is saturated, kernel-level tweaks will not help.

Only improvements:

  • reduce memory traffic
  • fuse kernels
  • increase arithmetic intensity

Step 4 — Check occupancy / latency hiding

Go to Occupancy

Look at:

  • Achieved Occupancy
  • Waves Per SM

Rules of thumb:

  • Waves per SM < 2 → latency likely exposed
  • Low occupancy → register/shared-mem pressure
  • High occupancy → latency not the issue

If occupancy is healthy, stop worrying about block sizes.

Step 5 — Check memory locality

Go to Memory Workload Analysis

Look at:

  • L2 Hit Rate
  • L1 Hit Rate

High L2 hit → data reuse
Low L2 hit → streaming workload

Streaming kernels are typically bandwidth-bound.

Step 6 — Decide the optimization strategy

Once you know the bottleneck:

  1. Memory bound
  • reduce global memory passes
  • fuse kernels
  • improve coalescing
  • tile into shared memory
  1. Compute bound
  • reduce instruction count
  • increase ILP / vectorization
  • use tensor cores / faster math
  1. Latency bound
  • increase occupancy
  • reduce register usage
  • increase work per thread

Mental model

Profiling is about which hardware resource is saturated?

Once that is clear, the optimization path becomes obvious. Everything else in the Nsight report is supporting evidence.

Final Checklist:

  1. Sort by duration (in summary). For kernels with the highest duration:
  2. Determine bound type (SpeedOfLight -> sm vs memory throughput)
  3. Check bandwidth usage (Memory Workload Analysis -> Max bandwidth)
  4. Check occupancy (Occupancy -> Achieved Occupancy)
  5. Decide if optimization is algorithmic or kernel-level

(living-doc)