gfx906 dot4/dot8 Exploration (2026-02-21)

This page documents how dot4/dot8 instructions behave on gfx906 (MI50/MI60 class), including semantics, limits, theoretical TOPS, and real measurements.

1) Instruction Mapping and Semantics

Primary source:

LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html

Mapped intrinsics:

llvm.amdgcn.sdot4 -> v_dot4_i32_i8
llvm.amdgcn.udot4 -> v_dot4_u32_u8
llvm.amdgcn.sdot8 -> v_dot8_i32_i4
llvm.amdgcn.udot8 -> v_dot8_u32_u4

Semantics:

dot4 uses two packed i32 operands that each hold 4x8-bit values.
dot8 uses two packed i32 operands that each hold 8x4-bit values.
Both add into a 32-bit accumulator (src2).
Fourth intrinsic operand is clamp enable (i1).

Per-target syntax confirms availability on gfx906:

https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html

Contrast:

v_mfma* is not listed on gfx906 syntax page (but appears on gfx908):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html

2) Codegen Validation on Real Host

Host:

fox@192.168.1.28 (ROCm installed)
rocminfo: 4x gfx906, each 60 CUs, 1725 MHz

Direct compile test:

clang -target amdgcn-amd-amdhsa -mcpu=gfx906 -nogpulib -O3 -S

Observed lowering:

__builtin_amdgcn_sdot4 -> v_dot4_i32_i8
__builtin_amdgcn_udot4 -> v_dot4_u32_u8
__builtin_amdgcn_sdot8 -> v_dot8_i32_i4
__builtin_amdgcn_udot8 -> v_dot8_u32_u4
clamp flag emits ... clamp modifier.

3) Clamp and Overflow Behavior (Measured)

Measured with small HIP kernels on gfx906:

sdot4 positive overflow:
- no clamp: wraps
- clamp: saturates to INT_MAX (0x7fffffff)
sdot4 negative overflow:
- no clamp: wraps
- clamp: saturates to INT_MIN (0x80000000)
udot4 overflow:
- no clamp: wraps
- clamp: saturates to UINT_MAX (0xffffffff)
sdot8 overflow-ish case:
- no clamp: wraps
- clamp: saturates to INT_MAX

Takeaway:

Accumulator is 32-bit and can overflow.
Use clamp when saturating behavior is required.

4) Theoretical Throughput (MI50 config from host)

Using measured host-reported config (60 CUs @ 1725 MHz):

dot4 theoretical:
- 26.496 TMAC/s
- 52.992 TOPS (counting MAC as 2 ops)
dot8 theoretical:
- 52.992 TMAC/s
- 105.984 TOPS (counting MAC as 2 ops)

Formula used:

TMAC/s = CU * 64 lanes * MACs_per_instruction * clock
TOPS = 2 * TMAC/s

5) Real Throughput Measurements (All 4 GPUs)

Benchmark A: dependency-chained accumulator

blocks=2048, threads=256, iters=65536
Across all 4 cards:
- sdot4: ~21.7 to 22.3 TOPS
- udot4: ~22.25 to 22.63 TOPS
- sdot8: ~43.5 to 44.4 TOPS
- udot8: ~44.5 to 44.6 TOPS

Benchmark B: ILP4 (4 independent accumulators)

same launch geometry
Across all 4 cards:
- sdot4_ilp4: ~43.0 to 44.4 TOPS
- sdot8_ilp4: ~85.3 to 86.2 TOPS

Interpretation:

dot8 is ~2x dot4 throughput in both patterns.
ILP materially improves achieved throughput by reducing dependency stalls.
ILP4 results are roughly ~81% of simple theoretical peak.

6) Practical Optimization Guidance

Use dot8 when quantization/layout supports 4-bit packing; it delivers about 2x dot4 arithmetic density.
Keep multiple independent accumulators per thread to reduce dependency throttling.
Track 32-bit accumulator range; enable clamp where saturation is needed.
On gfx906, optimize around v_dot* and memory behavior; do not assume MFMA.

References

LLVM AMDGPU usage/reference:
https://llvm.org/docs/AMDGPUUsage.html
LLVM gfx906 instruction syntax:
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM gfx908 instruction syntax (contrast):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
ROCm GPU architecture specs:
https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html

Wiki GFX906