gfx906 Latency-Hiding Ops (Measured)
This note summarizes instruction patterns on gfx906 (MI50/MI60) that are most useful for hiding latency in quant/dequant-style kernels.
Scope
Focus is on:
- wave-lane exchange (
DPP,DS permute) - LDS width (
b32/b64/b128) - global load width (
dwordvsdwordx4) - scheduling behavior (
s_waitcntplacement)
All kernels were compiled for --offload-arch=gfx906 and validated with emitted ISA.
Key measured findings
1) Use DPP first for row-local shuffles
v_mov_b32_dpp row shift (row_shr:1) vs LDS+barrier equivalent:
dpp_row_shr: ~1778to1784Gxchg/slds_row_shr: ~906Gxchg/s
Takeaway: for row-local lane movement, DPP gives about ~2x throughput and removes barrier overhead.
2) Use ds_bpermute_b32 for general in-wave exchange
XOR-neighbor exchange benchmark:
ds_bpermute_b32: ~962to970Gxchg/s- LDS store+load+barriers equivalent: ~
905to907Gxchg/s
Takeaway: ds_bpermute_b32 is consistently better than LDS exchange when shuffle pattern is not DPP-friendly.
3) Prefer wide LDS ops for staging
Pure LDS streaming kernels (instruction forms confirmed in ISA):
ds_read/write_b32(l1): typically ~1.9to3.9TB/sds_read/write_b64(l2): typically ~4.3to8.8TB/sds_read/write_b128(l4): typically ~9.5to11.2TB/s
Takeaway: b128 LDS accesses are the strongest baseline for LDS-heavy staging paths.
4) Wide global loads help when memory path is healthy
Compiler emits:
- scalar path:
global_load_dword - vector path:
global_load_dwordx4
In uncongested runs, dwordx4 outperformed scalar (~867-873 GB/s vs ~814 GB/s).
On a shared machine, global-memory numbers varied run-to-run; treat this as directionally positive, not a fixed constant.
Scheduling behavior that matters
In ILP kernels, compiler issues multiple loads first and delays waits:
- VMEM: staged
s_waitcnt vmcnt(3..0) - LDS: staged
s_waitcnt lgkmcnt(...)
That pattern is the core latency-hiding mechanism on gfx906: keep multiple memory operations in flight before consuming results.
What is not available on gfx906 (relevant to hiding)
Assembler probes on gfx906 rejected:
s_clauses_waitcnt_depctrs_delay_alu
So latency control is mainly through ILP/wave occupancy and careful s_waitcnt timing, not newer explicit dependency-control instructions.
Practical checklist
- Row-local shuffle: use
v_mov_b32_dpp. - Arbitrary in-wave shuffle: use
ds_bpermute_b32/ds_permute_b32. - LDS staging: default to
ds_read/write_b128where alignment allows. - Global staging: prefer
global_load_dwordx4for contiguous packed data. - Structure loops to issue multiple independent loads before first use.
- Avoid immediate waits after each load; let compiler keep VMEM/LDS queues populated.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
- LLVM AMDGPU modifier syntax: https://llvm.org/docs/AMDGPUModifierSyntax.html