float2 vs float4 in GPUs
continuing my journey through llm.c, I saw in train_gpt2_fp32.cu line 74-75
// use of float4 leads to using 128-bit LDG / STG instructions in SASS, // very helpful in memory-bound kernels like encoder_forward
At first glance, this seems counterintuitive. why is float2 (64 bits) slower than float4 (128 bits). shouldn't it be more time for the bigger data transfer?
why is float4 more efficient than float2?
Part 1
Found some answers here:
-
Stack Overflow: memory controller / vector transaction
At the instruction level, a multi-word vector load or store only requires a single instruction to be issued, so the bytes per instruction ratio is higher and total instruction latency for a particular memory transaction is lower. At the memory controller level, a vector sized transaction request from a warp results in a larger net memory throughput per transaction, so the bytes per transaction ratio is higher. Fewer transaction requests reduces memory controller contention and can produce higher overall memory bandwidth utilisation.
Let’s break this down.
but before that a simple legend:
- float = 32 bits = 4 bytes
- float2 = 64 bits = 8 bytes
- float4 = 128 bits = 16 bytes
2 best ways to look at this would be: Assembly and Memory
- Assembly
the scheduler has to do a bunch of work to fetch, decode and issue instructions
// PTX Pseudo-code
// Loading 128 bits via float2 requires TWO instructions:
ld.global.v2.f32 {%r1, %r2}, [ptr]; // Load first 64 bits
ld.global.v2.f32 {%r3, %r4}, [ptr+8]; // Load second 64 bits
// Loading 128 bits via float4 requires ONE instruction:
ld.global.v4.f32 {%r1, %r2, %r3, %r4}, [ptr]; // Load all 128 bits
Every instruction I'm issuing consumes a slot in the scheduler pipeline. By using float4, I'm moving the same amount of data with half the instruction count. Issue bandwidth per warp scheduler is finite, so halving the number of load instructions directly reduces pressure on the LSU pipeline. The scoreboard tracks these as separate memory instructions, not as bytes. Also, fewer instructions mean smaller code size, putting less pressure on the Instruction Cache (I-Cache).
- Memory
if we look at it from a per warp basis:
On Ampere-class GPUs, L2 cache lines are 128 bytes and memory transactions are internally broken into 32-byte sectors. A warp’s accesses are coalesced into these units.
For float2:
Total Data: 8 bytes × 32 threads = 256 bytes.
Transactions: In the ideal, fully coalesced case, this maps to 2 128-byte cache lines (internally composed of 32-byte sectors).
To match the data volume of a single float4 load (512 bytes), we would need to issue this instruction twice.
Total Cost for 512 bytes in float2 = 2 Instructions, 4 Transactions.
For float4:
Total Data: 16 bytes × 32 threads = 512 bytes.
Transactions: In the ideal, fully coalesced case, this maps to 4 128-byte cache lines (internally composed of 32-byte sectors).
Total Cost for 512 bytes in float4 = 1 Instruction, 4 Transactions.
The saturation is the same (4 transactions for 512 bytes), but the instruction overhead is halved.
The total DRAM traffic is identical. The improvement does not come from higher memory bandwidth usage per transaction, but from reducing the number of issued load instructions required to generate that traffic.
Caveat: Vectorization does not change coalescing rules. If accesses are misaligned or strided, both float2 and float4 will degrade similarly. The benefit assumes contiguous, properly aligned accesses. Proper 16-byte alignment is required for LDG.128; misalignment may force split transactions and eliminate the benefit.
At this point i kind of understand this concept but i still need a little more clarity cuz it doesnt make sense that something can have the same memory bus saturation levels but just cuz the instruction count is lesser that it has lesser overhead.
so to answer that i researched what instructions actually do, and here's what i found.
GPUs hide DRAM latency through memory-level parallelism: many memory requests must be in flight simultaneously.
Each SM maintains a scoreboard that tracks outstanding load/store instructions, not bytes. A warp can only have a limited number of pending memory instructions before it stalls.
If a warp can have N pending memory instructions, then:
- Using LDG.64, each instruction contributes 256 bytes per warp (8B × 32 threads).
- Using LDG.128, each instruction contributes 512 bytes per warp (16B × 32 threads).
For a fixed scoreboard limit N, LDG.128 places twice as many bytes in flight per entry as LDG.64. This increases in-flight memory per warp without increasing total DRAM traffic.
float4 is beneficial when kernels are memory-bound (like in our llm.c example) and limited by load/store issue bandwidth or outstanding memory instruction capacity. Vectorized loads reduce instruction count and increase bytes per scoreboard entry, improving in-flight memory under warp-level limits. On recent NVIDIA architectures (Volta, Turing, Ampere, Hopper), this optimization is most impactful in bandwidth-bound kernels and negligible in compute-bound ones.