Introduction

Welcome to the GFX906 Wiki! This repository serves as a clean, markdown-based documentation hub for the GFX906 project.

In this book you will find:

An overview of the project and its goals
Step‑by‑step getting‑started instructions
Detailed usage guides and reference material
Contribution guidelines for community members

Feel free to explore the chapters listed in the SUMMARY.md file and start learning about GFX906. Happy reading!

Getting Started

Welcome to the GFX906 Wiki! This guide will help you set up the project locally and start exploring the documentation.

Prerequisites

Git – to clone the repository.
Rust (stable) – required for building mdbook.
mdBook – can be installed via cargo install mdbook (the GitHub Actions workflow does this automatically).

Clone the Repository

git clone https://github.com/yourusername/wiki-gfx906.git
cd wiki-gfx906

Build the Book Locally

Install mdbook if you haven't already:
```
cargo install mdbook
```
Build the book:
```
mdbook build
```
The generated static site will appear in book/.
Preview it locally:
```
mdbook serve
```
Open http://localhost:3000 in your browser to view the wiki.

Adding New Content

Create a new markdown file in src/, e.g.:
```
touch src/new_section.md
```
Edit src/SUMMARY.md to include the new file:
```
- [New Section](./new_section.md)
```
Re‑run mdbook build or mdbook serve to see your changes.

Contributing

Fork the repository.
Create a feature branch.
Commit your changes with clear messages.
Open a Pull Request targeting main.

All contributions are welcome! See the Contributing page for detailed guidelines.

The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically builds the book and publishes it to GitHub Pages on every push to main. No manual steps are required.

Happy documenting! 🎉

Usage

This page explains how to work with the GFX906 Wiki using mdBook. It covers building the book locally, previewing it, and adding new content.

Prerequisites

Git – to clone the repository.
Rust (stable) – required for cargo and mdbook.
mdBook – install with:

cargo install mdbook

Building the Book

From the repository root:

mdbook build wiki-gfx906

The generated static site will be placed in wiki-gfx906/book/.

Previewing Locally

Run a local development server that watches for changes:

mdbook serve wiki-gfx906

Open http://localhost:3000 in your browser. The server automatically reloads when you edit markdown files.

Adding New Content

Create a markdown file in wiki-gfx906/src/, e.g.:
```
touch wiki-gfx906/src/advanced_topics.md
```
Edit SUMMARY.md to include the new file. For example, add:
```
- [Advanced Topics](./advanced_topics.md)
```
Write your content in the new file using standard Markdown syntax.
Rebuild or serve the book to see the changes.

Common Patterns

Code Blocks

fn main() {
    println!("Hello, mdBook!");
}

Inline Code

Use backticks for inline code.

Contributing

If you want to contribute improvements:

Fork the repository.
Create a feature branch.
Make your changes.
Open a Pull Request targeting main.

All contributions are welcome. See the Contributing page for detailed guidelines.

Deploying to GitHub Pages

The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically:

Installs mdbook.
Builds the book.
Deploys the book/ directory to GitHub Pages on every push to main.

No manual steps are required after the initial setup.

Enjoy writing and reading the GFX906 documentation! 🎉

Reference

This reference provides a quick overview of the most important concepts, commands, and configuration options for the GFX906 wiki built with mdBook.

Project Structure

wiki-gfx906/
├── book/               # Generated static site (output of `mdbook build`)
├── src/                # Source markdown files
│   ├── SUMMARY.md      # Table of contents for the book
│   ├── intro.md
│   ├── getting_started.md
│   ├── usage.md
│   ├── reference.md    # ← This file
│   └── contributing.md
├── book.toml           # mdBook configuration
└── .github/
    └── workflows/
        └── mdbook.yml  # GitHub Actions CI/CD pipeline

mdBook Commands

Command	Description
`mdbook build`	Compile the book into static HTML files under `book/`.
`mdbook serve`	Run a local development server (default: http://localhost:3000) that watches for changes.
`cargo install mdbook`	Install the `mdbook` binary (required for the above commands).

Configuration (`book.toml`)

[book]
title = "Wiki GFX906"
author = ["Your Name"]
description = "Documentation and wiki for gfx906 project"
language = "en"
multilingual = false
src = "src"

[output.html]
default-theme = "light"
preferred-dark-theme = "navy"
git-repository-url = "https://github.com/yourusername/wiki-gfx906"
edit-url-template = "https://github.com/yourusername/wiki-gfx906/edit/main/{path}"

src – Directory that holds the markdown source files.
default-theme / preferred-dark-theme – Control the visual theme of the generated site.
edit-url-template – Enables the “Edit on GitHub” link for each page.

GitHub Actions Workflow (`.github/workflows/mdbook.yml`)

Step	Purpose
`actions/checkout@v4`	Checks out the repository.
`actions/setup-rust@v1`	Installs the stable Rust toolchain.
`cargo install mdbook`	Installs `mdbook` on the runner.
`mdbook build wiki-gfx906`	Generates the static site.
`actions/upload-pages-artifact@v3`	Packages the `book/` directory for deployment.
`actions/deploy-pages@v3`	Publishes the artifact to GitHub Pages.

The workflow triggers on pushes to main and can also be started manually via workflow_dispatch.

Adding New Content

Create a Markdown file in src/ (e.g., src/new_topic.md).
Add an entry in src/SUMMARY.md:
```
- [New Topic](./new_topic.md)
```
Rebuild (mdbook build) or serve (mdbook serve) to see the changes.

Frequently Asked Questions

Do I need to run mdbook build on every push?
No. The GitHub Actions workflow automatically builds and deploys on every push to main.
How can I change the theme?
Edit default-theme and preferred-dark-theme in book.toml and rebuild.
Where are the generated files stored?
In the book/ directory, which is ignored by Git by default.

Contributing Guidelines

Refer to src/contributing.md for the full contribution process, commit style, and code‑of‑conduct.

Happy documenting! 🎉

Studies (2026-02-21)

Collected exploration notes from February 21, 2026:

MI50/MI60 architecture baseline
dot4/dot8 behavior and limits
special ISA useful for quant/dequant
latency-hiding operations
LDS layout standard for LLM blocks
KV-cache read/write layout study
FP32 vs QDQ dot study

MI50/MI60 (gfx906) Architecture Baseline

This note captures the highest-value architectural facts I could verify from primary sources for optimization work on:

Radeon Instinct MI60 (gfx906)
Radeon Instinct MI50 16GB (gfx906)
Radeon Instinct MI50 32GB (gfx906) (listed in current ROCm tables)

Data was cross-checked on February 21, 2026.

1) Ground-Truth SKU/Resource Table (ROCm Hardware Specs)

From AMD ROCm’s Instinct architecture table:

GPU	LLVM target	VRAM (GiB)	CUs	Wavefront	LDS/CU	L2	L1 Vector	L1 Scalar	L1 I$	VGPR file	SGPR file	GFXIP
MI60	gfx906	32	64	64	64 KiB	4 MiB	16 KiB	16 KiB / 3 CUs	32 KiB / 3 CUs	256 KiB	12.5 KiB	9.0
MI50 (32GB)	gfx906	32	60	64	64 KiB	4 MiB	16 KiB	16 KiB / 3 CUs	32 KiB / 3 CUs	256 KiB	12.5 KiB	9.0
MI50 (16GB)	gfx906	16	60	64	64 KiB	4 MiB	16 KiB	16 KiB / 3 CUs	32 KiB / 3 CUs	256 KiB	12.5 KiB	9.0

Optimization implication:

Treat MI60 and MI50 as the same ISA/feature family (gfx906) with CU-count and memory-capacity differences as the main SKU split.

2) Launch-Era Product Capabilities (AMD 2018 IR Release)

From AMD’s MI50/MI60 launch release (Nov 6, 2018):

MI60 peak throughput (theoretical): FP16 29.5 TFLOPS, FP32 14.8 TFLOPS, FP64 7.4 TFLOPS.
MI50 peak throughput (theoretical): FP16 26.8 TFLOPS, FP32 13.4 TFLOPS, FP64 6.7 TFLOPS.
Both boards: 300W envelope.
Memory/interconnect claims: up to 1 TB/s HBM2 bandwidth, MI60 at 32GB HBM2 ECC, MI50 at 16GB HBM2 ECC (launch), dual IF links up to 200 GB/s P2P, and PCIe Gen4 x16 up to 64 GB/s host link bandwidth.

Date clarification:

The 2018 launch material lists MI50 as 16GB.
The current ROCm table includes both MI50 16GB and MI50 32GB entries.

3) Compute-Unit and Scheduling Model (HIP Hardware Docs)

ROCm HIP hardware documentation (GCN-oriented model) highlights:

Wavefront model is 64 lanes for this class of architecture.
CU execution core is modeled as four SIMD16 vector units.
Sequencer organization allows up to 40 resident wavefronts per CU (4 pools x up to 10 each), subject to resource limits.
Per-cycle issue model can include one instruction to each SIMD path, plus scalar/branch/LDS paths.
Resource constraints that gate occupancy: wave slots, VGPR, SGPR, LDS.

Optimization implication:

Occupancy is constrained by register and LDS pressure before nominal wave-slot maxima in many real kernels.

4) Memory Hierarchy and Data Movement Facts

4.1 Caches/LDS behavior

From HIP hardware docs and Vega 7nm ISA:

LDS is a software-managed on-CU scratchpad with 32 banks, 4-byte bank width.
LDS bank conflicts are a first-order performance limiter for shared-memory-heavy kernels.
Vector L1 is per-CU, write-through, with 64-byte line granularity and typical 16 KiB size.
L2 is shared and is the coherence point for GPU memory traffic.
Vega 7nm ISA documents a 4 MiB shared L2 and CU-array scalar/instruction front-end caches.

4.2 Interconnect and host link

From AMD 2018 release details:

Dual Infinity Fabric Links (xGMI) per GPU with stated up to 200 GB/s aggregate P2P bandwidth.
PCIe Gen4 x16 stated up to 64 GB/s host-device transport peak.

Optimization implication:

Multi-GPU collectives can benefit significantly when topology actually uses IF links.
Host-staging and transfer overlap should assume PCIe constraints unless direct GPU-GPU paths are active.

5) ISA/Compiler-Surface Constraints Specific to gfx906

From LLVM AMDGPU usage/reference and per-target assembler docs:

gfx906 target IDs are published as:
- gfx906:sramecc-:xnack-
- gfx906:sramecc-:xnack+
sramecc not available on gfx906 in this target model.
xnack is compiler-visible and relevant for demand-paging/page-migration behavior.
wavefrontsize64 is the relevant mode for this generation.
Current LLVM gfx906 assembler syntax docs list v_dot* instructions (for example v_dot2_f32_f16, v_dot4_i32_i8, v_dot8_i32_i4).
v_mfma* instructions are not listed on the LLVM gfx906 instruction page; they are listed on newer targets (for example gfx908 docs).

Optimization implication:

Build artifacts must match the intended XNACK mode (xnack- vs xnack+) for predictable paging/fault behavior and performance.
Prefer instruction paths actually listed for gfx906 (v_dot* and standard vector paths), and do not assume MFMA availability on MI50/MI60.

6) Deep-Learning Instruction Path (gfx906-safe view)

From LLVM gfx906 assembler docs and AMD launch material:

gfx906 shows mixed-precision/dot instruction forms (v_dot*) in LLVM assembler syntax docs.
AMD launch material describes optimized deep-learning operations (DLOPS) for MI50/MI60.
MFMA (v_mfma*) should not be assumed for gfx906 based on current per-target LLVM docs.

Optimization implication:

For MI50/MI60, prioritize kernels that exploit documented gfx906 dot/mixed-precision instruction paths plus efficient memory behavior.

7) Practical Optimization Baseline Checklist

Use this as the default starting point for kernel tuning on MI50/MI60:

Target compile: Use --offload-arch=gfx906:xnack- or gfx906:xnack+ explicitly (do not leave ambiguous across environments).
Launch geometry: Workgroup sizes in multiples of 64. Sweep occupancy-sensitive block sizes while watching VGPR/LDS pressure.
Register/LDS budget: Keep LDS layouts bank-friendly (avoid many lanes hitting same bank). Track whether VGPR or LDS is the first occupancy limiter.
Memory behavior: Coalesce global accesses for L1/L2 efficiency. Prefer data reuse in LDS where bank conflicts remain controlled.
Multi-GPU: Verify actual IF-link topology; optimize collectives/partitioning for P2P when present.
Math path: Prefer gfx906-documented dot/mixed-precision paths (v_dot*) and avoid assuming MFMA availability.

8) References (Primary Sources)

AMD ROCm GPU architecture specs (Instinct table):
https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
AMD ROCm HIP hardware implementation:
https://rocm.docs.amd.com/projects/HIP/en/latest/understand/hardware_implementation.html
LLVM AMDGPU usage/reference (target features, restrictions, target IDs):
https://llvm.org/docs/AMDGPUUsage.html
LLVM gfx906 instruction syntax (per-target assembler reference):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM gfx908 instruction syntax (contrast target showing MFMA forms):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
AMD IR launch release (Nov 6, 2018), MI60/MI50: https://ir.amd.com/news-events/press-releases/detail/859/amd-unveils-worlds-first-7nm-datacenter-gpus----powering-the-next-era-of-artificial-intelligence-cloud-computing-and-high-performance-computing-hpc
AMD Vega 7nm Shader ISA PDF: https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf

gfx906 dot4/dot8 Exploration (2026-02-21)

This page documents how dot4/dot8 instructions behave on gfx906 (MI50/MI60 class), including semantics, limits, theoretical TOPS, and real measurements.

1) Instruction Mapping and Semantics

Primary source:

LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html

Mapped intrinsics:

llvm.amdgcn.sdot4 -> v_dot4_i32_i8
llvm.amdgcn.udot4 -> v_dot4_u32_u8
llvm.amdgcn.sdot8 -> v_dot8_i32_i4
llvm.amdgcn.udot8 -> v_dot8_u32_u4

Semantics:

dot4 uses two packed i32 operands that each hold 4x8-bit values.
dot8 uses two packed i32 operands that each hold 8x4-bit values.
Both add into a 32-bit accumulator (src2).
Fourth intrinsic operand is clamp enable (i1).

Per-target syntax confirms availability on gfx906:

https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html

Contrast:

v_mfma* is not listed on gfx906 syntax page (but appears on gfx908):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html

2) Codegen Validation on Real Host

Host:

fox@192.168.1.28 (ROCm installed)
rocminfo: 4x gfx906, each 60 CUs, 1725 MHz

Direct compile test:

clang -target amdgcn-amd-amdhsa -mcpu=gfx906 -nogpulib -O3 -S

Observed lowering:

__builtin_amdgcn_sdot4 -> v_dot4_i32_i8
__builtin_amdgcn_udot4 -> v_dot4_u32_u8
__builtin_amdgcn_sdot8 -> v_dot8_i32_i4
__builtin_amdgcn_udot8 -> v_dot8_u32_u4
clamp flag emits ... clamp modifier.

3) Clamp and Overflow Behavior (Measured)

Measured with small HIP kernels on gfx906:

sdot4 positive overflow:
- no clamp: wraps
- clamp: saturates to INT_MAX (0x7fffffff)
sdot4 negative overflow:
- no clamp: wraps
- clamp: saturates to INT_MIN (0x80000000)
udot4 overflow:
- no clamp: wraps
- clamp: saturates to UINT_MAX (0xffffffff)
sdot8 overflow-ish case:
- no clamp: wraps
- clamp: saturates to INT_MAX

Takeaway:

Accumulator is 32-bit and can overflow.
Use clamp when saturating behavior is required.

4) Theoretical Throughput (MI50 config from host)

Using measured host-reported config (60 CUs @ 1725 MHz):

dot4 theoretical:
- 26.496 TMAC/s
- 52.992 TOPS (counting MAC as 2 ops)
dot8 theoretical:
- 52.992 TMAC/s
- 105.984 TOPS (counting MAC as 2 ops)

Formula used:

TMAC/s = CU * 64 lanes * MACs_per_instruction * clock
TOPS = 2 * TMAC/s

5) Real Throughput Measurements (All 4 GPUs)

Benchmark A: dependency-chained accumulator

blocks=2048, threads=256, iters=65536
Across all 4 cards:
- sdot4: ~21.7 to 22.3 TOPS
- udot4: ~22.25 to 22.63 TOPS
- sdot8: ~43.5 to 44.4 TOPS
- udot8: ~44.5 to 44.6 TOPS

Benchmark B: ILP4 (4 independent accumulators)

same launch geometry
Across all 4 cards:
- sdot4_ilp4: ~43.0 to 44.4 TOPS
- sdot8_ilp4: ~85.3 to 86.2 TOPS

Interpretation:

dot8 is ~2x dot4 throughput in both patterns.
ILP materially improves achieved throughput by reducing dependency stalls.
ILP4 results are roughly ~81% of simple theoretical peak.

6) Practical Optimization Guidance

Use dot8 when quantization/layout supports 4-bit packing; it delivers about 2x dot4 arithmetic density.
Keep multiple independent accumulators per thread to reduce dependency throttling.
Track 32-bit accumulator range; enable clamp where saturation is needed.
On gfx906, optimize around v_dot* and memory behavior; do not assume MFMA.

References

LLVM AMDGPU usage/reference:
https://llvm.org/docs/AMDGPUUsage.html
LLVM gfx906 instruction syntax:
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM gfx908 instruction syntax (contrast):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
ROCm GPU architecture specs:
https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html

gfx906 Special ISA for Quant/Dequant (MI50/MI60)

This note focuses on gfx906 instructions that are especially useful for quantization/dequantization and data movement pipelines.

Verified instruction support on gfx906

I validated support with llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906.

Instruction	Status on gfx906	Why it matters
`v_dot4_i32_i8`	supported	int8x4 dot-accumulate
`v_dot8_i32_i4`	supported	int4x8 dot-accumulate
`v_dot2_f32_f16`	supported	fp16x2 dot into fp32
`v_dot4c_i32_i8`	not supported	cannot rely on `dot4c` lowering
`v_dot8c_i32_i4`	not supported	cannot rely on `dot8c` lowering
`v_pack_b32_f16`	supported	pack 2xf16 into one dword
`v_cvt_pkrtz_f16_f32`	supported	direct pack+convert f32->2xf16
`v_pk_add_f16`/`v_pk_mul_f16`/`v_pk_fma_f16`	supported	packed fp16 math (2 lanes/op)
`v_mov_b32_dpp`	supported	wave-lane rearrange without LDS
`ds_bpermute_b32` / `ds_permute_b32`	supported	lane gather/scatter style exchange
`v_perm_b32`	supported	byte permutation within registers
`v_bfe_i32`	supported	fast nibble/bitfield extraction
`v_lshl_or_b32`	supported	pack/insert bits efficiently
SDWA forms (`*_sdwa`)	supported	byte/word select in ALU/convert ops

Complete SDWA variant sweep on gfx906

I extracted all v_*_sdwa mnemonics from LLVM GFX9 syntax docs (AMDGPUAsmGFX9).
Total mnemonics found: 239.
I assembled each mnemonic with llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906 using a multi-template operand probe.
Result: 239/239 assembled successfully, 0 unsupported, 0 unresolved.
Runtime spot checks on hardware passed for representative SDWA ops:
- v_cvt_f32_i32_sdwa
- v_add_u32_sdwa

Interpretation: all documented GFX9 SDWA opcode variants are available on gfx906 at instruction level. This is opcode-availability coverage, not an exhaustive test of every legal modifier combination.

What the compiler emitted in real qdq kernels

Built and disassembled HIP kernels on real gfx906 (hipcc -O3 --offload-arch=gfx906 -S).

FP32 -> INT8 pack4 path emitted:
- v_rndne_f32, v_cvt_i32_f32, v_med3_i32 (saturating clamp to [-128,127])
- v_lshlrev_b32, v_perm_b32, v_or3_b32 (packing)
INT8 unpack + dequant path emitted:
- v_cvt_f32_i32_sdwa ... src0_sel:BYTE_{0..3} (byte extract + sign-extend + convert)
INT4 unpack + dequant path emitted:
- v_bfe_i32 for nibble extraction + sign extension, then v_cvt_f32_i32
Wave shuffle path (__shfl_xor) emitted:
- ds_bpermute_b32
Packed fp16 math path emitted:
- v_pk_fma_f16
FP32 -> packed fp16 storage path emitted:
- v_cvt_f16_f32 + v_pack_b32_f16

High-value instruction families for qdq work

Dot instructions (v_dot4_*, v_dot8_*, v_dot2_f32_f16)

Use when data is already packed/quantized (or conversion cost is amortized).

SDWA instructions (*_sdwa)

Best for byte/word extraction directly inside ALU/convert op (helps i8 dequant).

Bitfield/pack ops (v_bfe_*, v_lshl_or_b32, v_perm_b32, shifts/ands)

Core tools for nibble/byte unpack and repack (especially int4/int8 layouts).

Packed fp16 ops (v_pack_b32_f16, v_cvt_pkrtz_f16_f32, v_pk_*_f16)

Useful bridge path when dequantizing into fp16 or doing fp16 pre/post transforms.

Wave data movement (v_mov_b32_dpp, ds_bpermute_b32, ds_permute_b32)

Useful for lane remap/reorder without global memory traffic.

Practical limits and caveats

dot4c/dot8c are not available on gfx906; only use dot4/dot8 forms.
gfx906 dot instructions are available, but v_mfma* instructions are not listed for this target.
SDWA selects byte/word sublanes (BYTE_0..3, WORD_0..1, DWORD), not arbitrary bitfields.
DPP/DS lane ops are wave-level operations; they are not global cross-wave data movement.
clamp behavior matters for integer dot/arith overflow paths; enable only when required.

References

LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM GFX9 full syntax (instruction families + SDWA/DPP forms): https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
LLVM AMDGPU modifier syntax (DPP/SDWA/op_sel/clamp): https://llvm.org/docs/AMDGPUModifierSyntax.html
LLVM AMDGPU usage (dot intrinsics and lowering notes): https://llvm.org/docs/AMDGPUUsage.html

gfx906 Latency-Hiding Ops (Measured)

This note summarizes instruction patterns on gfx906 (MI50/MI60) that are most useful for hiding latency in quant/dequant-style kernels.

Scope

Focus is on:

wave-lane exchange (DPP, DS permute)
LDS width (b32/b64/b128)
global load width (dword vs dwordx4)
scheduling behavior (s_waitcnt placement)

All kernels were compiled for --offload-arch=gfx906 and validated with emitted ISA.

Key measured findings

1) Use DPP first for row-local shuffles

v_mov_b32_dpp row shift (row_shr:1) vs LDS+barrier equivalent:

dpp_row_shr: ~1778 to 1784 Gxchg/s
lds_row_shr: ~906 Gxchg/s

Takeaway: for row-local lane movement, DPP gives about ~2x throughput and removes barrier overhead.

2) Use `ds_bpermute_b32` for general in-wave exchange

XOR-neighbor exchange benchmark:

ds_bpermute_b32: ~962 to 970 Gxchg/s
LDS store+load+barriers equivalent: ~905 to 907 Gxchg/s

Takeaway: ds_bpermute_b32 is consistently better than LDS exchange when shuffle pattern is not DPP-friendly.

3) Prefer wide LDS ops for staging

Pure LDS streaming kernels (instruction forms confirmed in ISA):

ds_read/write_b32 (l1): typically ~1.9 to 3.9 TB/s
ds_read/write_b64 (l2): typically ~4.3 to 8.8 TB/s
ds_read/write_b128 (l4): typically ~9.5 to 11.2 TB/s

Takeaway: b128 LDS accesses are the strongest baseline for LDS-heavy staging paths.

4) Wide global loads help when memory path is healthy

Compiler emits:

scalar path: global_load_dword
vector path: global_load_dwordx4

In uncongested runs, dwordx4 outperformed scalar (~867-873 GB/s vs ~814 GB/s). On a shared machine, global-memory numbers varied run-to-run; treat this as directionally positive, not a fixed constant.

Scheduling behavior that matters

In ILP kernels, compiler issues multiple loads first and delays waits:

VMEM: staged s_waitcnt vmcnt(3..0)
LDS: staged s_waitcnt lgkmcnt(...)

That pattern is the core latency-hiding mechanism on gfx906: keep multiple memory operations in flight before consuming results.

What is not available on gfx906 (relevant to hiding)

Assembler probes on gfx906 rejected:

s_clause
s_waitcnt_depctr
s_delay_alu

So latency control is mainly through ILP/wave occupancy and careful s_waitcnt timing, not newer explicit dependency-control instructions.

Practical checklist

Row-local shuffle: use v_mov_b32_dpp.
Arbitrary in-wave shuffle: use ds_bpermute_b32 / ds_permute_b32.
LDS staging: default to ds_read/write_b128 where alignment allows.
Global staging: prefer global_load_dwordx4 for contiguous packed data.
Structure loops to issue multiple independent loads before first use.
Avoid immediate waits after each load; let compiler keep VMEM/LDS queues populated.

References

LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
LLVM AMDGPU modifier syntax: https://llvm.org/docs/AMDGPUModifierSyntax.html

gfx906 LDS Layout Standard for LLM Blocks

This note defines a practical LDS layout standard for gfx906 kernels (GEMM/attention-style tiles) based on measured bank-pressure behavior.

Why this matters

Many LLM kernels read one operand row-wise and another operand effectively column-wise from LDS. On gfx906, column-style accesses can collapse bandwidth if row stride aliases LDS banks.

Measured result (key experiment)

Microbenchmark on real gfx906 using ds_read_b128/ds_write_b128:

contiguous vec4 access baseline: ~4257 GB/s
column-style access with ld=32 vec4: ~1865 GB/s
same column-style access with ld=33 vec4 padding: ~3974 GB/s

Interpretation:

ld=32 (power-of-two stride) is a bad default for column-like LDS reads.
adding one vec4 of padding per row (ld=33) recovers most bandwidth.

Instruction forms confirmed

Disassembly for all variants used:

ds_write_b128
ds_read_b128

So the improvement is layout/bank behavior, not a different opcode path.

Layout standard for gfx906

Use 16-byte vectorized LDS payloads (uint4/float4/packed int blocks).
Keep base LDS buffers 16-byte aligned.
For tiles consumed row-wise only: use natural row stride.
For tiles that will be consumed column-wise (or transposed access), use padded leading dimension in vec4 units: ld_vec = logical_ld_vec + 1.
Prefer ds_read/write_b128 staging paths over scalar LDS traffic.

Recommended defaults

A-like operand (row-consumed): no pad needed.
B-like operand (column-consumed by waves): +1 vec4 pad per row.
If LDS budget is tight, test +1 first before more complex swizzles.

Practical formula

If a row has K_vec vec4 elements, allocate:

stride_vec = K_vec for row-only reads
stride_vec = K_vec + 1 for column-like reuse

LDS footprint increase is modest (~1/K_vec fractional overhead) and often worth it.

Caveat

Numbers come from controlled microbenchmarks; final kernel gains depend on occupancy, VMEM pressure, and math mix. Still, this +1 rule is a strong first choice on gfx906.

References

LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html

gfx906 KV-Cache Read/Write Kernel Study

This note benchmarks KV-cache layouts on gfx906 for decode-like kernels.

Layouts tested

HSD: [head][seq][dim] (dim contiguous inside sequence position)
HDS: [head][dim][seq] (seq contiguous for each dim lane)

Measured write behavior (new-token update)

Measured on real gfx906 (float cache, dim=128, heads=32, seq=4096):

write_hsd_x4: ~357.6 GB/s
write_hsd_x1: ~357.6 GB/s
write_hds_x4: ~54.4 GB/s
write_hds_x1: ~14.0 GB/s

Takeaway:

For decode token writes, HSD is dramatically better than HDS.
HDS writes are highly strided and expensive.

Measured read behavior depends on traversal pattern

A) Dot-style decode traversal (per-seq dot over dim)

Kernel pattern: each block handles one (head, seq) row and threads span dim.

read_dot_hsd_x4: ~1.76 TB/s
read_dot_hds_x4: ~0.37 TB/s

Takeaway:

For attention-score style decode reads, HSD is the right layout.

B) Dim-fixed streaming over seq

Kernel pattern: each thread keeps fixed dim lane and streams seq.

read_hsd_x1: ~45.0 GB/s
read_hsd_x4: ~41.5 GB/s
read_hds_x4: ~73.7 GB/s

Takeaway:

If the kernel is explicitly dim-fixed streaming over sequence, HDS can be better.

ISA mapping confirmed

Disassembly confirms expected vector paths:

scalar read/write: global_load_dword, global_store_dword
vector read/write: global_load_dwordx4, global_store_dwordx4

Recommended default for LLM decode kernels on gfx906

Keep canonical KV layout as HSD ([head][seq][dim]).
Use x4 vectorized loads/stores when naturally aligned.
Optimize decode math kernels around dot-style traversal (per-seq rows), where HSD is strong.
Only use HDS when a specific kernel is dim-fixed seq-streaming and dominates runtime.

Practical implication

For general-purpose LLM kernels, optimizing around HSD gives strong read+write balance. HDS is a specialized alternative, not a universal default.

References

LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html

FP32 vs Quant-Dequant + Dot on gfx906 (Measured)

Date: 2026-02-21
Hardware: AMD Instinct MI50/MI60 (gfx906), 60 CUs, 1725 MHz
Host: fox@192.168.1.28 (ROCm installed)

Question:

If activations start in FP32, is it worth quantizing/dequantizing them on gfx906 to compute with dot4/dot2, or is pure FP32 better?

Experiment Setup

Three paths were benchmarked on-device with HIP:

pure_fp32

FP32 values stay FP32.
Compute via FP32 FMA only.

qdq_int8_dot4

In kernel hot loop: FP32 activation -> INT8 quantize (pack) -> __builtin_amdgcn_sdot4 -> dequantize.

qdq_fp16_dot2

In kernel hot loop: FP32 activation -> FP16 conversion -> __builtin_amdgcn_fdot2.

All paths were normalized to the same arithmetic payload per loop iteration (8 MACs/thread/iter), and reported as effective TOPS (counting MAC as 2 ops).

Core Result (On-the-Fly QDQ in Hot Loop)

Stable best results across cards (after reruns):

pure_fp32: ~5.95 TOPS
qdq_fp16_dot2: ~4.19 TOPS
qdq_int8_dot4: ~2.00 TOPS

Conclusion for this scenario:

When activations start as FP32 and conversion is done in the hot loop, pure FP32 wins.
dot2 is slower than FP32.
dot4 is much slower than FP32.

Amortized Conversion Check (Conversion Once, Reuse Many Times)

A second benchmark converted once outside the hot loop, then reused converted values:

fp32_reuse: ~13.0 TOPS
dot4_reuse: ~21.7 TOPS
dot2_reuse: ~21.9 TOPS

Interpretation:

If conversion cost is amortized by reuse (GEMM-like behavior), dot paths can outperform pure FP32.
If conversion/deconversion is paid every use, they do not.

Practical Recommendation

For per-use FP32 activations:

Use pure FP32 on gfx906.

For high-reuse kernels (where conversion is amortized):

Dot paths (dot4/dot2) can be worthwhile.
Optimize for reuse depth before deciding.

Do not rely on theoretical dot throughput alone:

End-to-end cost is dominated by conversion/packing when done in the hot path.

Instruction Validation Notes

Codegen validation on gfx906:

__builtin_amdgcn_sdot4 -> v_dot4_i32_i8
__builtin_amdgcn_fdot2 -> v_dot2_f32_f16

Related references:

LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html
LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html

Contributing

Thank you for considering contributing to the GFX906 Wiki! This guide outlines how you can help improve the documentation.

Easy Contribution

If you want to simply add or edit some markdown files without touching the console, follow this method.

To edit a markdown file on the wiki:

Open it on github and click the "edit" button. You will be prompted to create a fork if you don't already have a fork. Do so.
Make your edits and be sure to keep the markdown clean as outlined in Making Changes.
Commit and repeat this process until you are done.
Submit a pull request to the "main" branch

To add your own existing markdown files:

Click the Fork button on the top right of the repository page.
Click "Add file" on the main page of your fork. Click "Upload Files". Make sure your files contain clean markdown. See Making Changes.
Upload all files you want to add and click "Commit Changes". See Commits
Update the SUMMARY.md file following the process above to add links to your files to the sidebar. Please try to respect the existing structure of the sidebar.
Submit a pull request to the "main" branch

Local deployment

Fork the repository
Click the Fork button on the top right of the repository page.

Clone your fork

git clone https://github.com/<your-username>/wiki-gfx906.git
cd wiki-gfx906

Create a feature branch
```
git checkout -b <branch-name>
```
Install mdBook (if you haven’t already)
```
cargo install mdbook
```
Build and preview locally
```
mdbook serve
```
Open http://localhost:3000 in your browser to see your changes live.

Making Changes

Add or edit content in the src/ directory.
Update SUMMARY.md to include any new pages you add.
Keep markdown clean and consistent:
- Use headings (#, ##, …) to structure sections.
- Prefer fenced code blocks with language identifiers.
- Use relative links for internal navigation.

Commit Messages

Write clear, concise commit messages. Follow this format:

<type>: <short description>

<optional longer description>

Common <type> values:

docs: documentation updates
fix: typo or small correction
feat: new page or major addition

Pull Request Process

Push your branch to your fork:
```
git push origin <branch-name>
```
Open a Pull Request against the main branch of the upstream repository.
- Provide a descriptive title and summary of changes.
- Link to any relevant issue(s) (e.g., Closes #42).
Review – maintainers will review your PR. Respond to feedback promptly.

Code of Conduct

We expect all contributors to behave respectfully. Harassment and discrimination of any kind will not be tolerated. See the CODE_OF_CONDUCT.md for details.

License

By contributing you agree that your contributions will be licensed under the same license as the project (see LICENSE).

Happy documenting! 🎉

How to install ROCm 7.x on Linux

The steps outlined here are based on this thread. There is also a video guide for Ubuntu 24.04 LTS here.

Approach

AMD has stopped shipping the tensor files for gfx906 with the newer ROCm releases - despite being compatible! This is a simple workaround, wherein we can add the missing tensor files.

ROCm Quick install

Go to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html and copy & paste the outlined commands.
During the installation, you may be prompted to add a key if you have secure boot enabled.
After completing the install, do NOT reboot yet.

Getting the missing tensor files

The missing tensor files can be found in the arch repository here: https://archlinux.org/packages/extra/x86_64/rocblas/ Despite being for ROCm 6.4 it'll work:

Download the rocblas file: https://archlinux.org/packages/extra/x86_64/rocblas/download/
Go to the location you downloaded it to and extract it:

cd Downloads/ && unzstd rocblas-6.4.4-1-x86_64.pkg.tar.zst

There should be two folders, opt/ and usr/
Copy all the files containing the string "gfx906" to ´/opt/rocm/lib/rocblas/library´ (sudo privileges required):

sudo cp opt/rocm/lib/rocblas/library/*gfx906* /opt/rocm/lib/rocblas/library

Now reboot
If you enrolled a key for secure boot, you will get a blue screen with some options. Select "Enroll MOK" and type in the password you assigned earlier.
Check if it worked by running ´sudo update-alternatives --display rocm`

Post-install (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html)

Configure the system linker by specifying where to find the shared objects (.so files) for ROCm applications:

sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig

Add the paths to your bash

echo 'export PATH=$PATH:/opt/rocm-7.2.0/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib' >> ~/.bashrc
source ~/.bashrc

That's it, enjoy!

How to install ComfyUI on Linux

The following instructions are aimed at Ubuntu 24.04 LTS using ROCm 7.2 If you are using a different distro I recommend uv and setting a Python 3.12 virtual environment i.e. uv venv venv --python 3.12

Install

Clone the repository:

git clone https://github.com/comfyanonymous/ComfyUI.git

Change to the ComfyUI directory and create a python virtual environment:

cd ComfyUI
python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

Update pip:

pip install --upgrade pip wheel setuptools

Install PyTorch wheels, you can experiment with different versions for more stability or newer features:

pip install torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/rocm7.1

Install requirements for ComfyUI:

pip install -r requirements.txt

Verify it

Run 'python3 main.py` to check it installed properly, you can exit after as we still have a few more steps.

Creating a script to make it easier to run

Use your favourite text editor to create the script with a name you like in a location you want e.g.

cd ~
nano run-comfyui.sh

Insert the following, adjust the ROCm environment accordingly:

#!/bin/bash
export PATH=$PATH:/opt/rocm-7.2.0/bin
export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib

cd ComfyUI
source venv/bin/activate
python3 main.py --use-split-cross-attention --disable-smart-memory --front-end-version Comfy-Org/ComfyUI_frontend@latest

Make the script executable:

chmod +x run-comfyui.sh

Note:

These are the parameters that seem to work the best for Z-Image Turbo, but more testing is needed - also with other models. You may come across many other environment variables, but I haven't seen any perceivable differences on gfx906. Feel free to remove --front-end-version if you experience problems with the latest version.

Fixing missing ROCm environment paths:

If for any reason you get a missing tensor files error in ComfyUI, please check the "installing_ROCm_7.x" guide to obtain them. If you still encounter the error, it means the environment is not set properly. You can also manually add the files to ComyUI:

sudo cp ~/rocblas7.1/rocblas/*gfx906* ~/ComfyUI/venv/lib/python3.12/site-packages/torch/lib/rocblas/library/

(Optional but recommended) Install ComfyUI Manager

cd ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager comfyui-manager

That's it, now you can run ComfyUI by running the script `./run-comfyui.sh`

Nightly ROCm builds with gfx90x support (including gfx906): https://therock-nightly-tarball.s3.amazonaws.com