FP32 vs Quant-Dequant + Dot on gfx906 (Measured)
Date: 2026-02-21
Hardware: AMD Instinct MI50/MI60 (gfx906), 60 CUs, 1725 MHz
Host: fox@192.168.1.28 (ROCm installed)
Question:
- If activations start in FP32, is it worth quantizing/dequantizing them on gfx906 to compute with
dot4/dot2, or is pure FP32 better?
Experiment Setup
Three paths were benchmarked on-device with HIP:
pure_fp32
- FP32 values stay FP32.
- Compute via FP32 FMA only.
qdq_int8_dot4
- In kernel hot loop: FP32 activation -> INT8 quantize (pack) ->
__builtin_amdgcn_sdot4-> dequantize.
qdq_fp16_dot2
- In kernel hot loop: FP32 activation -> FP16 conversion ->
__builtin_amdgcn_fdot2.
All paths were normalized to the same arithmetic payload per loop iteration (8 MACs/thread/iter), and reported as effective TOPS (counting MAC as 2 ops).
Core Result (On-the-Fly QDQ in Hot Loop)
Stable best results across cards (after reruns):
pure_fp32: ~5.95 TOPSqdq_fp16_dot2: ~4.19 TOPSqdq_int8_dot4: ~2.00 TOPS
Conclusion for this scenario:
- When activations start as FP32 and conversion is done in the hot loop, pure FP32 wins.
dot2is slower than FP32.dot4is much slower than FP32.
Amortized Conversion Check (Conversion Once, Reuse Many Times)
A second benchmark converted once outside the hot loop, then reused converted values:
fp32_reuse: ~13.0 TOPSdot4_reuse: ~21.7 TOPSdot2_reuse: ~21.9 TOPS
Interpretation:
- If conversion cost is amortized by reuse (GEMM-like behavior), dot paths can outperform pure FP32.
- If conversion/deconversion is paid every use, they do not.
Practical Recommendation
- For per-use FP32 activations:
- Use pure FP32 on gfx906.
- For high-reuse kernels (where conversion is amortized):
- Dot paths (
dot4/dot2) can be worthwhile. - Optimize for reuse depth before deciding.
- Do not rely on theoretical dot throughput alone:
- End-to-end cost is dominated by conversion/packing when done in the hot path.
Instruction Validation Notes
Codegen validation on gfx906:
__builtin_amdgcn_sdot4->v_dot4_i32_i8__builtin_amdgcn_fdot2->v_dot2_f32_f16
Related references:
- LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html