gfx906 KV-Cache Read/Write Kernel Study
This note benchmarks KV-cache layouts on gfx906 for decode-like kernels.
Layouts tested
HSD:[head][seq][dim](dim contiguous inside sequence position)HDS:[head][dim][seq](seq contiguous for each dim lane)
Measured write behavior (new-token update)
Measured on real gfx906 (float cache, dim=128, heads=32, seq=4096):
write_hsd_x4: ~357.6 GB/swrite_hsd_x1: ~357.6 GB/swrite_hds_x4: ~54.4 GB/swrite_hds_x1: ~14.0 GB/s
Takeaway:
- For decode token writes,
HSDis dramatically better thanHDS. HDSwrites are highly strided and expensive.
Measured read behavior depends on traversal pattern
A) Dot-style decode traversal (per-seq dot over dim)
Kernel pattern: each block handles one (head, seq) row and threads span dim.
read_dot_hsd_x4: ~1.76 TB/sread_dot_hds_x4: ~0.37 TB/s
Takeaway:
- For attention-score style decode reads,
HSDis the right layout.
B) Dim-fixed streaming over seq
Kernel pattern: each thread keeps fixed dim lane and streams seq.
read_hsd_x1: ~45.0 GB/sread_hsd_x4: ~41.5 GB/sread_hds_x4: ~73.7 GB/s
Takeaway:
- If the kernel is explicitly dim-fixed streaming over sequence,
HDScan be better.
ISA mapping confirmed
Disassembly confirms expected vector paths:
- scalar read/write:
global_load_dword,global_store_dword - vector read/write:
global_load_dwordx4,global_store_dwordx4
Recommended default for LLM decode kernels on gfx906
- Keep canonical KV layout as
HSD([head][seq][dim]). - Use
x4vectorized loads/stores when naturally aligned. - Optimize decode math kernels around dot-style traversal (per-seq rows), where
HSDis strong. - Only use
HDSwhen a specific kernel is dim-fixed seq-streaming and dominates runtime.
Practical implication
For general-purpose LLM kernels, optimizing around HSD gives strong read+write balance.
HDS is a specialized alternative, not a universal default.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html