Introduction
Welcome to the GFX906 Wiki! This repository serves as a clean, markdown-based documentation hub for the GFX906 project.
In this book you will find:
- An overview of the project and its goals
- Step‑by‑step getting‑started instructions
- Detailed usage guides and reference material
- Contribution guidelines for community members
Feel free to explore the chapters listed in the SUMMARY.md file and start learning about GFX906. Happy reading!
Getting Started
Welcome to the GFX906 Wiki! This guide will help you set up the project locally and start exploring the documentation.
Prerequisites
- Git – to clone the repository.
- Rust (stable) – required for building
mdbook. - mdBook – can be installed via
cargo install mdbook(the GitHub Actions workflow does this automatically).
Clone the Repository
git clone https://github.com/yourusername/wiki-gfx906.git
cd wiki-gfx906
Build the Book Locally
-
Install
mdbookif you haven't already:cargo install mdbook -
Build the book:
mdbook buildThe generated static site will appear in
book/. -
Preview it locally:
mdbook serveOpen
http://localhost:3000in your browser to view the wiki.
Adding New Content
-
Create a new markdown file in
src/, e.g.:touch src/new_section.md -
Edit
src/SUMMARY.mdto include the new file:- [New Section](./new_section.md) -
Re‑run
mdbook buildormdbook serveto see your changes.
Contributing
- Fork the repository.
- Create a feature branch.
- Commit your changes with clear messages.
- Open a Pull Request targeting
main.
All contributions are welcome! See the Contributing page for detailed guidelines.
Deploy to GitHub Pages
The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically builds the book and publishes it to GitHub Pages on every push to main. No manual steps are required.
Happy documenting! 🎉
Usage
This page explains how to work with the GFX906 Wiki using mdBook. It covers building the book locally, previewing it, and adding new content.
Prerequisites
- Git – to clone the repository.
- Rust (stable) – required for
cargoandmdbook. - mdBook – install with:
cargo install mdbook
Building the Book
From the repository root:
mdbook build wiki-gfx906
The generated static site will be placed in wiki-gfx906/book/.
Previewing Locally
Run a local development server that watches for changes:
mdbook serve wiki-gfx906
Open http://localhost:3000 in your browser. The server automatically reloads when you edit markdown files.
Adding New Content
-
Create a markdown file in
wiki-gfx906/src/, e.g.:touch wiki-gfx906/src/advanced_topics.md -
Edit
SUMMARY.mdto include the new file. For example, add:- [Advanced Topics](./advanced_topics.md) -
Write your content in the new file using standard Markdown syntax.
-
Rebuild or serve the book to see the changes.
Common Patterns
Code Blocks
fn main() { println!("Hello, mdBook!"); }
Inline Code
Use backticks for inline code.
Links
- Relative link to another chapter:
[Getting Started](./getting_started.md) - External link: https://github.com/
Contributing
If you want to contribute improvements:
- Fork the repository.
- Create a feature branch.
- Make your changes.
- Open a Pull Request targeting
main.
All contributions are welcome. See the Contributing page for detailed guidelines.
Deploying to GitHub Pages
The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically:
- Installs
mdbook. - Builds the book.
- Deploys the
book/directory to GitHub Pages on every push tomain.
No manual steps are required after the initial setup.
Enjoy writing and reading the GFX906 documentation! 🎉
Reference
This reference provides a quick overview of the most important concepts, commands, and configuration options for the GFX906 wiki built with mdBook.
Project Structure
wiki-gfx906/
├── book/ # Generated static site (output of `mdbook build`)
├── src/ # Source markdown files
│ ├── SUMMARY.md # Table of contents for the book
│ ├── intro.md
│ ├── getting_started.md
│ ├── usage.md
│ ├── reference.md # ← This file
│ └── contributing.md
├── book.toml # mdBook configuration
└── .github/
└── workflows/
└── mdbook.yml # GitHub Actions CI/CD pipeline
mdBook Commands
| Command | Description |
|---|---|
mdbook build | Compile the book into static HTML files under book/. |
mdbook serve | Run a local development server (default: http://localhost:3000) that watches for changes. |
cargo install mdbook | Install the mdbook binary (required for the above commands). |
Configuration (book.toml)
[book]
title = "Wiki GFX906"
author = ["Your Name"]
description = "Documentation and wiki for gfx906 project"
language = "en"
multilingual = false
src = "src"
[output.html]
default-theme = "light"
preferred-dark-theme = "navy"
git-repository-url = "https://github.com/yourusername/wiki-gfx906"
edit-url-template = "https://github.com/yourusername/wiki-gfx906/edit/main/{path}"
src– Directory that holds the markdown source files.default-theme/preferred-dark-theme– Control the visual theme of the generated site.edit-url-template– Enables the “Edit on GitHub” link for each page.
GitHub Actions Workflow (.github/workflows/mdbook.yml)
| Step | Purpose |
|---|---|
actions/checkout@v4 | Checks out the repository. |
actions/setup-rust@v1 | Installs the stable Rust toolchain. |
cargo install mdbook | Installs mdbook on the runner. |
mdbook build wiki-gfx906 | Generates the static site. |
actions/upload-pages-artifact@v3 | Packages the book/ directory for deployment. |
actions/deploy-pages@v3 | Publishes the artifact to GitHub Pages. |
The workflow triggers on pushes to main and can also be started manually via workflow_dispatch.
Adding New Content
- Create a Markdown file in
src/(e.g.,src/new_topic.md). - Add an entry in
src/SUMMARY.md:- [New Topic](./new_topic.md) - Rebuild (
mdbook build) or serve (mdbook serve) to see the changes.
Frequently Asked Questions
-
Do I need to run
mdbook buildon every push?
No. The GitHub Actions workflow automatically builds and deploys on every push tomain. -
How can I change the theme?
Editdefault-themeandpreferred-dark-themeinbook.tomland rebuild. -
Where are the generated files stored?
In thebook/directory, which is ignored by Git by default.
Contributing Guidelines
Refer to src/contributing.md for the full contribution process, commit style, and code‑of‑conduct.
Happy documenting! 🎉
Studies (2026-02-21)
Collected exploration notes from February 21, 2026:
- MI50/MI60 architecture baseline
- dot4/dot8 behavior and limits
- special ISA useful for quant/dequant
- latency-hiding operations
- LDS layout standard for LLM blocks
- KV-cache read/write layout study
- FP32 vs QDQ dot study
MI50/MI60 (gfx906) Architecture Baseline
This note captures the highest-value architectural facts I could verify from primary sources for optimization work on:
Radeon Instinct MI60 (gfx906)Radeon Instinct MI50 16GB (gfx906)Radeon Instinct MI50 32GB (gfx906)(listed in current ROCm tables)
Data was cross-checked on February 21, 2026.
1) Ground-Truth SKU/Resource Table (ROCm Hardware Specs)
From AMD ROCm’s Instinct architecture table:
| GPU | LLVM target | VRAM (GiB) | CUs | Wavefront | LDS/CU | L2 | L1 Vector | L1 Scalar | L1 I$ | VGPR file | SGPR file | GFXIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MI60 | gfx906 | 32 | 64 | 64 | 64 KiB | 4 MiB | 16 KiB | 16 KiB / 3 CUs | 32 KiB / 3 CUs | 256 KiB | 12.5 KiB | 9.0 |
| MI50 (32GB) | gfx906 | 32 | 60 | 64 | 64 KiB | 4 MiB | 16 KiB | 16 KiB / 3 CUs | 32 KiB / 3 CUs | 256 KiB | 12.5 KiB | 9.0 |
| MI50 (16GB) | gfx906 | 16 | 60 | 64 | 64 KiB | 4 MiB | 16 KiB | 16 KiB / 3 CUs | 32 KiB / 3 CUs | 256 KiB | 12.5 KiB | 9.0 |
Optimization implication:
- Treat MI60 and MI50 as the same ISA/feature family (
gfx906) with CU-count and memory-capacity differences as the main SKU split.
2) Launch-Era Product Capabilities (AMD 2018 IR Release)
From AMD’s MI50/MI60 launch release (Nov 6, 2018):
- MI60 peak throughput (theoretical): FP16 29.5 TFLOPS, FP32 14.8 TFLOPS, FP64 7.4 TFLOPS.
- MI50 peak throughput (theoretical): FP16 26.8 TFLOPS, FP32 13.4 TFLOPS, FP64 6.7 TFLOPS.
- Both boards: 300W envelope.
- Memory/interconnect claims: up to 1 TB/s HBM2 bandwidth, MI60 at 32GB HBM2 ECC, MI50 at 16GB HBM2 ECC (launch), dual IF links up to 200 GB/s P2P, and PCIe Gen4 x16 up to 64 GB/s host link bandwidth.
Date clarification:
- The 2018 launch material lists MI50 as 16GB.
- The current ROCm table includes both MI50 16GB and MI50 32GB entries.
3) Compute-Unit and Scheduling Model (HIP Hardware Docs)
ROCm HIP hardware documentation (GCN-oriented model) highlights:
- Wavefront model is 64 lanes for this class of architecture.
- CU execution core is modeled as four SIMD16 vector units.
- Sequencer organization allows up to 40 resident wavefronts per CU (4 pools x up to 10 each), subject to resource limits.
- Per-cycle issue model can include one instruction to each SIMD path, plus scalar/branch/LDS paths.
- Resource constraints that gate occupancy: wave slots, VGPR, SGPR, LDS.
Optimization implication:
- Occupancy is constrained by register and LDS pressure before nominal wave-slot maxima in many real kernels.
4) Memory Hierarchy and Data Movement Facts
4.1 Caches/LDS behavior
From HIP hardware docs and Vega 7nm ISA:
- LDS is a software-managed on-CU scratchpad with 32 banks, 4-byte bank width.
- LDS bank conflicts are a first-order performance limiter for shared-memory-heavy kernels.
- Vector L1 is per-CU, write-through, with 64-byte line granularity and typical 16 KiB size.
- L2 is shared and is the coherence point for GPU memory traffic.
- Vega 7nm ISA documents a 4 MiB shared L2 and CU-array scalar/instruction front-end caches.
4.2 Interconnect and host link
From AMD 2018 release details:
- Dual Infinity Fabric Links (xGMI) per GPU with stated up to 200 GB/s aggregate P2P bandwidth.
- PCIe Gen4 x16 stated up to 64 GB/s host-device transport peak.
Optimization implication:
- Multi-GPU collectives can benefit significantly when topology actually uses IF links.
- Host-staging and transfer overlap should assume PCIe constraints unless direct GPU-GPU paths are active.
5) ISA/Compiler-Surface Constraints Specific to gfx906
From LLVM AMDGPU usage/reference and per-target assembler docs:
gfx906target IDs are published as:gfx906:sramecc-:xnack-gfx906:sramecc-:xnack+
srameccnot available ongfx906in this target model.xnackis compiler-visible and relevant for demand-paging/page-migration behavior.wavefrontsize64is the relevant mode for this generation.- Current LLVM gfx906 assembler syntax docs list
v_dot*instructions (for examplev_dot2_f32_f16,v_dot4_i32_i8,v_dot8_i32_i4). v_mfma*instructions are not listed on the LLVM gfx906 instruction page; they are listed on newer targets (for example gfx908 docs).
Optimization implication:
- Build artifacts must match the intended XNACK mode (
xnack-vsxnack+) for predictable paging/fault behavior and performance. - Prefer instruction paths actually listed for gfx906 (
v_dot*and standard vector paths), and do not assume MFMA availability on MI50/MI60.
6) Deep-Learning Instruction Path (gfx906-safe view)
From LLVM gfx906 assembler docs and AMD launch material:
- gfx906 shows mixed-precision/dot instruction forms (
v_dot*) in LLVM assembler syntax docs. - AMD launch material describes optimized deep-learning operations (DLOPS) for MI50/MI60.
- MFMA (
v_mfma*) should not be assumed for gfx906 based on current per-target LLVM docs.
Optimization implication:
- For MI50/MI60, prioritize kernels that exploit documented gfx906 dot/mixed-precision instruction paths plus efficient memory behavior.
7) Practical Optimization Baseline Checklist
Use this as the default starting point for kernel tuning on MI50/MI60:
- Target compile:
Use
--offload-arch=gfx906:xnack-orgfx906:xnack+explicitly (do not leave ambiguous across environments). - Launch geometry: Workgroup sizes in multiples of 64. Sweep occupancy-sensitive block sizes while watching VGPR/LDS pressure.
- Register/LDS budget: Keep LDS layouts bank-friendly (avoid many lanes hitting same bank). Track whether VGPR or LDS is the first occupancy limiter.
- Memory behavior: Coalesce global accesses for L1/L2 efficiency. Prefer data reuse in LDS where bank conflicts remain controlled.
- Multi-GPU: Verify actual IF-link topology; optimize collectives/partitioning for P2P when present.
- Math path:
Prefer gfx906-documented dot/mixed-precision paths (
v_dot*) and avoid assuming MFMA availability.
8) References (Primary Sources)
- AMD ROCm GPU architecture specs (Instinct table):
https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html - AMD ROCm HIP hardware implementation:
https://rocm.docs.amd.com/projects/HIP/en/latest/understand/hardware_implementation.html - LLVM AMDGPU usage/reference (target features, restrictions, target IDs):
https://llvm.org/docs/AMDGPUUsage.html - LLVM gfx906 instruction syntax (per-target assembler reference):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html - LLVM gfx908 instruction syntax (contrast target showing MFMA forms):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html - AMD IR launch release (Nov 6, 2018), MI60/MI50: https://ir.amd.com/news-events/press-releases/detail/859/amd-unveils-worlds-first-7nm-datacenter-gpus----powering-the-next-era-of-artificial-intelligence-cloud-computing-and-high-performance-computing-hpc
- AMD Vega 7nm Shader ISA PDF: https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf
gfx906 dot4/dot8 Exploration (2026-02-21)
This page documents how dot4/dot8 instructions behave on gfx906 (MI50/MI60 class), including semantics, limits, theoretical TOPS, and real measurements.
1) Instruction Mapping and Semantics
Primary source:
- LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html
Mapped intrinsics:
llvm.amdgcn.sdot4->v_dot4_i32_i8llvm.amdgcn.udot4->v_dot4_u32_u8llvm.amdgcn.sdot8->v_dot8_i32_i4llvm.amdgcn.udot8->v_dot8_u32_u4
Semantics:
- dot4 uses two packed
i32operands that each hold 4x8-bit values. - dot8 uses two packed
i32operands that each hold 8x4-bit values. - Both add into a 32-bit accumulator (
src2). - Fourth intrinsic operand is clamp enable (
i1).
Per-target syntax confirms availability on gfx906:
- https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
Contrast:
v_mfma*is not listed on gfx906 syntax page (but appears on gfx908):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
2) Codegen Validation on Real Host
Host:
fox@192.168.1.28(ROCm installed)rocminfo: 4xgfx906, each 60 CUs, 1725 MHz
Direct compile test:
clang -target amdgcn-amd-amdhsa -mcpu=gfx906 -nogpulib -O3 -S
Observed lowering:
__builtin_amdgcn_sdot4->v_dot4_i32_i8__builtin_amdgcn_udot4->v_dot4_u32_u8__builtin_amdgcn_sdot8->v_dot8_i32_i4__builtin_amdgcn_udot8->v_dot8_u32_u4- clamp flag emits
... clampmodifier.
3) Clamp and Overflow Behavior (Measured)
Measured with small HIP kernels on gfx906:
sdot4positive overflow:- no clamp: wraps
- clamp: saturates to
INT_MAX(0x7fffffff)
sdot4negative overflow:- no clamp: wraps
- clamp: saturates to
INT_MIN(0x80000000)
udot4overflow:- no clamp: wraps
- clamp: saturates to
UINT_MAX(0xffffffff)
sdot8overflow-ish case:- no clamp: wraps
- clamp: saturates to
INT_MAX
Takeaway:
- Accumulator is 32-bit and can overflow.
- Use clamp when saturating behavior is required.
4) Theoretical Throughput (MI50 config from host)
Using measured host-reported config (60 CUs @ 1725 MHz):
- dot4 theoretical:
26.496 TMAC/s52.992 TOPS(counting MAC as 2 ops)
- dot8 theoretical:
52.992 TMAC/s105.984 TOPS(counting MAC as 2 ops)
Formula used:
TMAC/s = CU * 64 lanes * MACs_per_instruction * clockTOPS = 2 * TMAC/s
5) Real Throughput Measurements (All 4 GPUs)
Benchmark A: dependency-chained accumulator
blocks=2048,threads=256,iters=65536- Across all 4 cards:
sdot4: ~21.7 to 22.3 TOPSudot4: ~22.25 to 22.63 TOPSsdot8: ~43.5 to 44.4 TOPSudot8: ~44.5 to 44.6 TOPS
Benchmark B: ILP4 (4 independent accumulators)
- same launch geometry
- Across all 4 cards:
sdot4_ilp4: ~43.0 to 44.4 TOPSsdot8_ilp4: ~85.3 to 86.2 TOPS
Interpretation:
- dot8 is ~2x dot4 throughput in both patterns.
- ILP materially improves achieved throughput by reducing dependency stalls.
- ILP4 results are roughly ~81% of simple theoretical peak.
6) Practical Optimization Guidance
- Use dot8 when quantization/layout supports 4-bit packing; it delivers about 2x dot4 arithmetic density.
- Keep multiple independent accumulators per thread to reduce dependency throttling.
- Track 32-bit accumulator range; enable clamp where saturation is needed.
- On gfx906, optimize around
v_dot*and memory behavior; do not assume MFMA.
References
- LLVM AMDGPU usage/reference:
https://llvm.org/docs/AMDGPUUsage.html - LLVM gfx906 instruction syntax:
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html - LLVM gfx908 instruction syntax (contrast):
https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html - ROCm GPU architecture specs:
https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
gfx906 Special ISA for Quant/Dequant (MI50/MI60)
This note focuses on gfx906 instructions that are especially useful for quantization/dequantization and data movement pipelines.
Verified instruction support on gfx906
I validated support with llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906.
| Instruction | Status on gfx906 | Why it matters |
|---|---|---|
v_dot4_i32_i8 | supported | int8x4 dot-accumulate |
v_dot8_i32_i4 | supported | int4x8 dot-accumulate |
v_dot2_f32_f16 | supported | fp16x2 dot into fp32 |
v_dot4c_i32_i8 | not supported | cannot rely on dot4c lowering |
v_dot8c_i32_i4 | not supported | cannot rely on dot8c lowering |
v_pack_b32_f16 | supported | pack 2xf16 into one dword |
v_cvt_pkrtz_f16_f32 | supported | direct pack+convert f32->2xf16 |
v_pk_add_f16/v_pk_mul_f16/v_pk_fma_f16 | supported | packed fp16 math (2 lanes/op) |
v_mov_b32_dpp | supported | wave-lane rearrange without LDS |
ds_bpermute_b32 / ds_permute_b32 | supported | lane gather/scatter style exchange |
v_perm_b32 | supported | byte permutation within registers |
v_bfe_i32 | supported | fast nibble/bitfield extraction |
v_lshl_or_b32 | supported | pack/insert bits efficiently |
SDWA forms (*_sdwa) | supported | byte/word select in ALU/convert ops |
Complete SDWA variant sweep on gfx906
- I extracted all
v_*_sdwamnemonics from LLVM GFX9 syntax docs (AMDGPUAsmGFX9). - Total mnemonics found:
239. - I assembled each mnemonic with
llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906using a multi-template operand probe. - Result:
239/239assembled successfully,0unsupported,0unresolved. - Runtime spot checks on hardware passed for representative SDWA ops:
v_cvt_f32_i32_sdwav_add_u32_sdwa
Interpretation: all documented GFX9 SDWA opcode variants are available on gfx906 at instruction level.
This is opcode-availability coverage, not an exhaustive test of every legal modifier combination.
What the compiler emitted in real qdq kernels
Built and disassembled HIP kernels on real gfx906 (hipcc -O3 --offload-arch=gfx906 -S).
- FP32 -> INT8 pack4 path emitted:
v_rndne_f32,v_cvt_i32_f32,v_med3_i32(saturating clamp to [-128,127])v_lshlrev_b32,v_perm_b32,v_or3_b32(packing)
- INT8 unpack + dequant path emitted:
v_cvt_f32_i32_sdwa ... src0_sel:BYTE_{0..3}(byte extract + sign-extend + convert)
- INT4 unpack + dequant path emitted:
v_bfe_i32for nibble extraction + sign extension, thenv_cvt_f32_i32
- Wave shuffle path (
__shfl_xor) emitted:ds_bpermute_b32
- Packed fp16 math path emitted:
v_pk_fma_f16
- FP32 -> packed fp16 storage path emitted:
v_cvt_f16_f32+v_pack_b32_f16
High-value instruction families for qdq work
- Dot instructions (
v_dot4_*,v_dot8_*,v_dot2_f32_f16)
- Use when data is already packed/quantized (or conversion cost is amortized).
- SDWA instructions (
*_sdwa)
- Best for byte/word extraction directly inside ALU/convert op (helps i8 dequant).
- Bitfield/pack ops (
v_bfe_*,v_lshl_or_b32,v_perm_b32, shifts/ands)
- Core tools for nibble/byte unpack and repack (especially int4/int8 layouts).
- Packed fp16 ops (
v_pack_b32_f16,v_cvt_pkrtz_f16_f32,v_pk_*_f16)
- Useful bridge path when dequantizing into fp16 or doing fp16 pre/post transforms.
- Wave data movement (
v_mov_b32_dpp,ds_bpermute_b32,ds_permute_b32)
- Useful for lane remap/reorder without global memory traffic.
Practical limits and caveats
dot4c/dot8care not available ongfx906; only usedot4/dot8forms.gfx906dot instructions are available, butv_mfma*instructions are not listed for this target.- SDWA selects byte/word sublanes (
BYTE_0..3,WORD_0..1,DWORD), not arbitrary bitfields. - DPP/DS lane ops are wave-level operations; they are not global cross-wave data movement.
clampbehavior matters for integer dot/arith overflow paths; enable only when required.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 full syntax (instruction families + SDWA/DPP forms): https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
- LLVM AMDGPU modifier syntax (DPP/SDWA/op_sel/clamp): https://llvm.org/docs/AMDGPUModifierSyntax.html
- LLVM AMDGPU usage (dot intrinsics and lowering notes): https://llvm.org/docs/AMDGPUUsage.html
gfx906 Latency-Hiding Ops (Measured)
This note summarizes instruction patterns on gfx906 (MI50/MI60) that are most useful for hiding latency in quant/dequant-style kernels.
Scope
Focus is on:
- wave-lane exchange (
DPP,DS permute) - LDS width (
b32/b64/b128) - global load width (
dwordvsdwordx4) - scheduling behavior (
s_waitcntplacement)
All kernels were compiled for --offload-arch=gfx906 and validated with emitted ISA.
Key measured findings
1) Use DPP first for row-local shuffles
v_mov_b32_dpp row shift (row_shr:1) vs LDS+barrier equivalent:
dpp_row_shr: ~1778to1784Gxchg/slds_row_shr: ~906Gxchg/s
Takeaway: for row-local lane movement, DPP gives about ~2x throughput and removes barrier overhead.
2) Use ds_bpermute_b32 for general in-wave exchange
XOR-neighbor exchange benchmark:
ds_bpermute_b32: ~962to970Gxchg/s- LDS store+load+barriers equivalent: ~
905to907Gxchg/s
Takeaway: ds_bpermute_b32 is consistently better than LDS exchange when shuffle pattern is not DPP-friendly.
3) Prefer wide LDS ops for staging
Pure LDS streaming kernels (instruction forms confirmed in ISA):
ds_read/write_b32(l1): typically ~1.9to3.9TB/sds_read/write_b64(l2): typically ~4.3to8.8TB/sds_read/write_b128(l4): typically ~9.5to11.2TB/s
Takeaway: b128 LDS accesses are the strongest baseline for LDS-heavy staging paths.
4) Wide global loads help when memory path is healthy
Compiler emits:
- scalar path:
global_load_dword - vector path:
global_load_dwordx4
In uncongested runs, dwordx4 outperformed scalar (~867-873 GB/s vs ~814 GB/s).
On a shared machine, global-memory numbers varied run-to-run; treat this as directionally positive, not a fixed constant.
Scheduling behavior that matters
In ILP kernels, compiler issues multiple loads first and delays waits:
- VMEM: staged
s_waitcnt vmcnt(3..0) - LDS: staged
s_waitcnt lgkmcnt(...)
That pattern is the core latency-hiding mechanism on gfx906: keep multiple memory operations in flight before consuming results.
What is not available on gfx906 (relevant to hiding)
Assembler probes on gfx906 rejected:
s_clauses_waitcnt_depctrs_delay_alu
So latency control is mainly through ILP/wave occupancy and careful s_waitcnt timing, not newer explicit dependency-control instructions.
Practical checklist
- Row-local shuffle: use
v_mov_b32_dpp. - Arbitrary in-wave shuffle: use
ds_bpermute_b32/ds_permute_b32. - LDS staging: default to
ds_read/write_b128where alignment allows. - Global staging: prefer
global_load_dwordx4for contiguous packed data. - Structure loops to issue multiple independent loads before first use.
- Avoid immediate waits after each load; let compiler keep VMEM/LDS queues populated.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
- LLVM AMDGPU modifier syntax: https://llvm.org/docs/AMDGPUModifierSyntax.html
gfx906 LDS Layout Standard for LLM Blocks
This note defines a practical LDS layout standard for gfx906 kernels (GEMM/attention-style tiles) based on measured bank-pressure behavior.
Why this matters
Many LLM kernels read one operand row-wise and another operand effectively column-wise from LDS. On gfx906, column-style accesses can collapse bandwidth if row stride aliases LDS banks.
Measured result (key experiment)
Microbenchmark on real gfx906 using ds_read_b128/ds_write_b128:
- contiguous vec4 access baseline: ~
4257 GB/s - column-style access with
ld=32vec4: ~1865 GB/s - same column-style access with
ld=33vec4 padding: ~3974 GB/s
Interpretation:
ld=32(power-of-two stride) is a bad default for column-like LDS reads.- adding one vec4 of padding per row (
ld=33) recovers most bandwidth.
Instruction forms confirmed
Disassembly for all variants used:
ds_write_b128ds_read_b128
So the improvement is layout/bank behavior, not a different opcode path.
Layout standard for gfx906
- Use 16-byte vectorized LDS payloads (
uint4/float4/packed int blocks). - Keep base LDS buffers 16-byte aligned.
- For tiles consumed row-wise only: use natural row stride.
- For tiles that will be consumed column-wise (or transposed access), use padded leading dimension in vec4 units:
ld_vec = logical_ld_vec + 1. - Prefer
ds_read/write_b128staging paths over scalar LDS traffic.
Recommended defaults
- A-like operand (row-consumed): no pad needed.
- B-like operand (column-consumed by waves):
+1vec4 pad per row. - If LDS budget is tight, test
+1first before more complex swizzles.
Practical formula
If a row has K_vec vec4 elements, allocate:
stride_vec = K_vecfor row-only readsstride_vec = K_vec + 1for column-like reuse
LDS footprint increase is modest (~1/K_vec fractional overhead) and often worth it.
Caveat
Numbers come from controlled microbenchmarks; final kernel gains depend on occupancy, VMEM pressure, and math mix.
Still, this +1 rule is a strong first choice on gfx906.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
gfx906 KV-Cache Read/Write Kernel Study
This note benchmarks KV-cache layouts on gfx906 for decode-like kernels.
Layouts tested
HSD:[head][seq][dim](dim contiguous inside sequence position)HDS:[head][dim][seq](seq contiguous for each dim lane)
Measured write behavior (new-token update)
Measured on real gfx906 (float cache, dim=128, heads=32, seq=4096):
write_hsd_x4: ~357.6 GB/swrite_hsd_x1: ~357.6 GB/swrite_hds_x4: ~54.4 GB/swrite_hds_x1: ~14.0 GB/s
Takeaway:
- For decode token writes,
HSDis dramatically better thanHDS. HDSwrites are highly strided and expensive.
Measured read behavior depends on traversal pattern
A) Dot-style decode traversal (per-seq dot over dim)
Kernel pattern: each block handles one (head, seq) row and threads span dim.
read_dot_hsd_x4: ~1.76 TB/sread_dot_hds_x4: ~0.37 TB/s
Takeaway:
- For attention-score style decode reads,
HSDis the right layout.
B) Dim-fixed streaming over seq
Kernel pattern: each thread keeps fixed dim lane and streams seq.
read_hsd_x1: ~45.0 GB/sread_hsd_x4: ~41.5 GB/sread_hds_x4: ~73.7 GB/s
Takeaway:
- If the kernel is explicitly dim-fixed streaming over sequence,
HDScan be better.
ISA mapping confirmed
Disassembly confirms expected vector paths:
- scalar read/write:
global_load_dword,global_store_dword - vector read/write:
global_load_dwordx4,global_store_dwordx4
Recommended default for LLM decode kernels on gfx906
- Keep canonical KV layout as
HSD([head][seq][dim]). - Use
x4vectorized loads/stores when naturally aligned. - Optimize decode math kernels around dot-style traversal (per-seq rows), where
HSDis strong. - Only use
HDSwhen a specific kernel is dim-fixed seq-streaming and dominates runtime.
Practical implication
For general-purpose LLM kernels, optimizing around HSD gives strong read+write balance.
HDS is a specialized alternative, not a universal default.
References
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
- LLVM GFX9 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX9.html
FP32 vs Quant-Dequant + Dot on gfx906 (Measured)
Date: 2026-02-21
Hardware: AMD Instinct MI50/MI60 (gfx906), 60 CUs, 1725 MHz
Host: fox@192.168.1.28 (ROCm installed)
Question:
- If activations start in FP32, is it worth quantizing/dequantizing them on gfx906 to compute with
dot4/dot2, or is pure FP32 better?
Experiment Setup
Three paths were benchmarked on-device with HIP:
pure_fp32
- FP32 values stay FP32.
- Compute via FP32 FMA only.
qdq_int8_dot4
- In kernel hot loop: FP32 activation -> INT8 quantize (pack) ->
__builtin_amdgcn_sdot4-> dequantize.
qdq_fp16_dot2
- In kernel hot loop: FP32 activation -> FP16 conversion ->
__builtin_amdgcn_fdot2.
All paths were normalized to the same arithmetic payload per loop iteration (8 MACs/thread/iter), and reported as effective TOPS (counting MAC as 2 ops).
Core Result (On-the-Fly QDQ in Hot Loop)
Stable best results across cards (after reruns):
pure_fp32: ~5.95 TOPSqdq_fp16_dot2: ~4.19 TOPSqdq_int8_dot4: ~2.00 TOPS
Conclusion for this scenario:
- When activations start as FP32 and conversion is done in the hot loop, pure FP32 wins.
dot2is slower than FP32.dot4is much slower than FP32.
Amortized Conversion Check (Conversion Once, Reuse Many Times)
A second benchmark converted once outside the hot loop, then reused converted values:
fp32_reuse: ~13.0 TOPSdot4_reuse: ~21.7 TOPSdot2_reuse: ~21.9 TOPS
Interpretation:
- If conversion cost is amortized by reuse (GEMM-like behavior), dot paths can outperform pure FP32.
- If conversion/deconversion is paid every use, they do not.
Practical Recommendation
- For per-use FP32 activations:
- Use pure FP32 on gfx906.
- For high-reuse kernels (where conversion is amortized):
- Dot paths (
dot4/dot2) can be worthwhile. - Optimize for reuse depth before deciding.
- Do not rely on theoretical dot throughput alone:
- End-to-end cost is dominated by conversion/packing when done in the hot path.
Instruction Validation Notes
Codegen validation on gfx906:
__builtin_amdgcn_sdot4->v_dot4_i32_i8__builtin_amdgcn_fdot2->v_dot2_f32_f16
Related references:
- LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html
- LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
Contributing
Thank you for considering contributing to the GFX906 Wiki! This guide outlines how you can help improve the documentation.
Easy Contribution
If you want to simply add or edit some markdown files without touching the console, follow this method.
To edit a markdown file on the wiki:
-
Open it on github and click the "edit" button. You will be prompted to create a fork if you don't already have a fork. Do so.
-
Make your edits and be sure to keep the markdown clean as outlined in Making Changes.
-
Commit and repeat this process until you are done.
-
Submit a pull request to the "main" branch
To add your own existing markdown files:
-
Click the Fork button on the top right of the repository page.
-
Click "Add file" on the main page of your fork. Click "Upload Files". Make sure your files contain clean markdown. See Making Changes.
-
Upload all files you want to add and click "Commit Changes". See Commits
-
Update the SUMMARY.md file following the process above to add links to your files to the sidebar. Please try to respect the existing structure of the sidebar.
-
Submit a pull request to the "main" branch
Local deployment
-
Fork the repository
Click the Fork button on the top right of the repository page. -
Clone your fork
git clone https://github.com/<your-username>/wiki-gfx906.git cd wiki-gfx906 -
Create a feature branch
git checkout -b <branch-name> -
Install mdBook (if you haven’t already)
cargo install mdbook -
Build and preview locally
mdbook serveOpen http://localhost:3000 in your browser to see your changes live.
Making Changes
- Add or edit content in the
src/directory. - Update
SUMMARY.mdto include any new pages you add. - Keep markdown clean and consistent:
- Use headings (
#,##, …) to structure sections. - Prefer fenced code blocks with language identifiers.
- Use relative links for internal navigation.
- Use headings (
Commit Messages
Write clear, concise commit messages. Follow this format:
<type>: <short description>
<optional longer description>
Common <type> values:
docs: documentation updatesfix: typo or small correctionfeat: new page or major addition
Pull Request Process
-
Push your branch to your fork:
git push origin <branch-name> -
Open a Pull Request against the
mainbranch of the upstream repository.- Provide a descriptive title and summary of changes.
- Link to any relevant issue(s) (e.g.,
Closes #42).
-
Review – maintainers will review your PR. Respond to feedback promptly.
Code of Conduct
We expect all contributors to behave respectfully. Harassment and discrimination of any kind will not be tolerated. See the CODE_OF_CONDUCT.md for details.
License
By contributing you agree that your contributions will be licensed under the same license as the project (see LICENSE).
Happy documenting! 🎉
How to install ROCm 7.x on Linux
The steps outlined here are based on this thread. There is also a video guide for Ubuntu 24.04 LTS here.
Approach
AMD has stopped shipping the tensor files for gfx906 with the newer ROCm releases - despite being compatible! This is a simple workaround, wherein we can add the missing tensor files.
ROCm Quick install
-
Go to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html and copy & paste the outlined commands.
-
During the installation, you may be prompted to add a key if you have secure boot enabled.
-
After completing the install, do NOT reboot yet.
Getting the missing tensor files
The missing tensor files can be found in the arch repository here: https://archlinux.org/packages/extra/x86_64/rocblas/ Despite being for ROCm 6.4 it'll work:
-
Download the rocblas file: https://archlinux.org/packages/extra/x86_64/rocblas/download/
-
Go to the location you downloaded it to and extract it:
cd Downloads/ && unzstd rocblas-6.4.4-1-x86_64.pkg.tar.zst
-
There should be two folders,
opt/andusr/ -
Copy all the files containing the string "gfx906" to ´/opt/rocm/lib/rocblas/library´ (sudo privileges required):
sudo cp opt/rocm/lib/rocblas/library/*gfx906* /opt/rocm/lib/rocblas/library
-
Now reboot
-
If you enrolled a key for secure boot, you will get a blue screen with some options. Select "Enroll MOK" and type in the password you assigned earlier.
-
Check if it worked by running ´sudo update-alternatives --display rocm`
Post-install (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html)
- Configure the system linker by specifying where to find the shared objects (.so files) for ROCm applications:
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig
- Add the paths to your bash
echo 'export PATH=$PATH:/opt/rocm-7.2.0/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib' >> ~/.bashrc
source ~/.bashrc
That's it, enjoy!
How to install ComfyUI on Linux
The following instructions are aimed at Ubuntu 24.04 LTS using ROCm 7.2
If you are using a different distro I recommend uv and setting a Python 3.12 virtual environment i.e. uv venv venv --python 3.12
Install
- Clone the repository:
git clone https://github.com/comfyanonymous/ComfyUI.git
- Change to the ComfyUI directory and create a python virtual environment:
cd ComfyUI
python3 -m venv venv
- Activate the virtual environment:
source venv/bin/activate
- Update pip:
pip install --upgrade pip wheel setuptools
- Install PyTorch wheels, you can experiment with different versions for more stability or newer features:
pip install torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/rocm7.1
- Install requirements for ComfyUI:
pip install -r requirements.txt
Verify it
- Run 'python3 main.py` to check it installed properly, you can exit after as we still have a few more steps.
Creating a script to make it easier to run
- Use your favourite text editor to create the script with a name you like in a location you want e.g.
cd ~
nano run-comfyui.sh
- Insert the following, adjust the ROCm environment accordingly:
#!/bin/bash
export PATH=$PATH:/opt/rocm-7.2.0/bin
export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib
cd ComfyUI
source venv/bin/activate
python3 main.py --use-split-cross-attention --disable-smart-memory --front-end-version Comfy-Org/ComfyUI_frontend@latest
- Make the script executable:
chmod +x run-comfyui.sh
Note:
These are the parameters that seem to work the best for Z-Image Turbo, but more testing is needed - also with other models. You may come across many other environment variables, but I haven't seen any perceivable differences on gfx906.
Feel free to remove --front-end-version if you experience problems with the latest version.
Fixing missing ROCm environment paths:
If for any reason you get a missing tensor files error in ComfyUI, please check the "installing_ROCm_7.x" guide to obtain them. If you still encounter the error, it means the environment is not set properly. You can also manually add the files to ComyUI:
sudo cp ~/rocblas7.1/rocblas/*gfx906* ~/ComfyUI/venv/lib/python3.12/site-packages/torch/lib/rocblas/library/
(Optional but recommended) Install ComfyUI Manager
cd ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager comfyui-manager
That's it, now you can run ComfyUI by running the script ./run-comfyui.sh
Nightly ROCm builds with gfx90x support (including gfx906): https://therock-nightly-tarball.s3.amazonaws.com