Introduction

Welcome to the GFX906 Wiki! This repository serves as a clean, markdown-based documentation hub for the GFX906 project.

In this book you will find:

  • An overview of the project and its goals
  • Step‑by‑step getting‑started instructions
  • Detailed usage guides and reference material
  • Contribution guidelines for community members

Feel free to explore the chapters listed in the SUMMARY.md file and start learning about GFX906. Happy reading!

Getting Started

Welcome to the GFX906 Wiki! This guide will help you set up the project locally and start exploring the documentation.

Prerequisites

  • Git – to clone the repository.
  • Rust (stable) – required for building mdbook.
  • mdBook – can be installed via cargo install mdbook (the GitHub Actions workflow does this automatically).

Clone the Repository

git clone https://github.com/yourusername/wiki-gfx906.git
cd wiki-gfx906

Build the Book Locally

  1. Install mdbook if you haven't already:

    cargo install mdbook
    
  2. Build the book:

    mdbook build
    

    The generated static site will appear in book/.

  3. Preview it locally:

    mdbook serve
    

    Open http://localhost:3000 in your browser to view the wiki.

Adding New Content

  1. Create a new markdown file in src/, e.g.:

    touch src/new_section.md
    
  2. Edit src/SUMMARY.md to include the new file:

    - [New Section](./new_section.md)
    
  3. Re‑run mdbook build or mdbook serve to see your changes.

Contributing

  • Fork the repository.
  • Create a feature branch.
  • Commit your changes with clear messages.
  • Open a Pull Request targeting main.

All contributions are welcome! See the Contributing page for detailed guidelines.

Deploy to GitHub Pages

The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically builds the book and publishes it to GitHub Pages on every push to main. No manual steps are required.


Happy documenting! 🎉

Usage

This page explains how to work with the GFX906 Wiki using mdBook. It covers building the book locally, previewing it, and adding new content.

Prerequisites

  • Git – to clone the repository.
  • Rust (stable) – required for cargo and mdbook.
  • mdBook – install with:
cargo install mdbook

Building the Book

From the repository root:

mdbook build wiki-gfx906

The generated static site will be placed in wiki-gfx906/book/.

Previewing Locally

Run a local development server that watches for changes:

mdbook serve wiki-gfx906

Open http://localhost:3000 in your browser. The server automatically reloads when you edit markdown files.

Adding New Content

  1. Create a markdown file in wiki-gfx906/src/, e.g.:

    touch wiki-gfx906/src/advanced_topics.md
    
  2. Edit SUMMARY.md to include the new file. For example, add:

    - [Advanced Topics](./advanced_topics.md)
    
  3. Write your content in the new file using standard Markdown syntax.

  4. Rebuild or serve the book to see the changes.

Common Patterns

Code Blocks

fn main() {
    println!("Hello, mdBook!");
}

Inline Code

Use backticks for inline code.

  • Relative link to another chapter: [Getting Started](./getting_started.md)
  • External link: https://github.com/

Contributing

If you want to contribute improvements:

  1. Fork the repository.
  2. Create a feature branch.
  3. Make your changes.
  4. Open a Pull Request targeting main.

All contributions are welcome. See the Contributing page for detailed guidelines.

Deploying to GitHub Pages

The repository includes a GitHub Actions workflow (.github/workflows/mdbook.yml) that automatically:

  1. Installs mdbook.
  2. Builds the book.
  3. Deploys the book/ directory to GitHub Pages on every push to main.

No manual steps are required after the initial setup.


Enjoy writing and reading the GFX906 documentation! 🎉

Reference

This reference provides a quick overview of the most important concepts, commands, and configuration options for the GFX906 wiki built with mdBook.

Project Structure

wiki-gfx906/
├── book/               # Generated static site (output of `mdbook build`)
├── src/                # Source markdown files
│   ├── SUMMARY.md      # Table of contents for the book
│   ├── intro.md
│   ├── getting_started.md
│   ├── usage.md
│   ├── reference.md    # ← This file
│   └── contributing.md
├── book.toml           # mdBook configuration
└── .github/
    └── workflows/
        └── mdbook.yml  # GitHub Actions CI/CD pipeline

mdBook Commands

CommandDescription
mdbook buildCompile the book into static HTML files under book/.
mdbook serveRun a local development server (default: http://localhost:3000) that watches for changes.
cargo install mdbookInstall the mdbook binary (required for the above commands).

Configuration (book.toml)

[book]
title = "Wiki GFX906"
author = ["Your Name"]
description = "Documentation and wiki for gfx906 project"
language = "en"
multilingual = false
src = "src"

[output.html]
default-theme = "light"
preferred-dark-theme = "navy"
git-repository-url = "https://github.com/yourusername/wiki-gfx906"
edit-url-template = "https://github.com/yourusername/wiki-gfx906/edit/main/{path}"
  • src – Directory that holds the markdown source files.
  • default-theme / preferred-dark-theme – Control the visual theme of the generated site.
  • edit-url-template – Enables the “Edit on GitHub” link for each page.

GitHub Actions Workflow (.github/workflows/mdbook.yml)

StepPurpose
actions/checkout@v4Checks out the repository.
actions/setup-rust@v1Installs the stable Rust toolchain.
cargo install mdbookInstalls mdbook on the runner.
mdbook build wiki-gfx906Generates the static site.
actions/upload-pages-artifact@v3Packages the book/ directory for deployment.
actions/deploy-pages@v3Publishes the artifact to GitHub Pages.

The workflow triggers on pushes to main and can also be started manually via workflow_dispatch.

Adding New Content

  1. Create a Markdown file in src/ (e.g., src/new_topic.md).
  2. Add an entry in src/SUMMARY.md:
    - [New Topic](./new_topic.md)
    
  3. Rebuild (mdbook build) or serve (mdbook serve) to see the changes.

Frequently Asked Questions

  • Do I need to run mdbook build on every push?
    No. The GitHub Actions workflow automatically builds and deploys on every push to main.

  • How can I change the theme?
    Edit default-theme and preferred-dark-theme in book.toml and rebuild.

  • Where are the generated files stored?
    In the book/ directory, which is ignored by Git by default.

Contributing Guidelines

Refer to src/contributing.md for the full contribution process, commit style, and code‑of‑conduct.


Happy documenting! 🎉

Studies (2026-02-21)

Collected exploration notes from February 21, 2026:

  • MI50/MI60 architecture baseline
  • dot4/dot8 behavior and limits
  • special ISA useful for quant/dequant
  • latency-hiding operations
  • LDS layout standard for LLM blocks
  • KV-cache read/write layout study
  • FP32 vs QDQ dot study

MI50/MI60 (gfx906) Architecture Baseline

This note captures the highest-value architectural facts I could verify from primary sources for optimization work on:

  • Radeon Instinct MI60 (gfx906)
  • Radeon Instinct MI50 16GB (gfx906)
  • Radeon Instinct MI50 32GB (gfx906) (listed in current ROCm tables)

Data was cross-checked on February 21, 2026.

1) Ground-Truth SKU/Resource Table (ROCm Hardware Specs)

From AMD ROCm’s Instinct architecture table:

GPULLVM targetVRAM (GiB)CUsWavefrontLDS/CUL2L1 VectorL1 ScalarL1 I$VGPR fileSGPR fileGFXIP
MI60gfx90632646464 KiB4 MiB16 KiB16 KiB / 3 CUs32 KiB / 3 CUs256 KiB12.5 KiB9.0
MI50 (32GB)gfx90632606464 KiB4 MiB16 KiB16 KiB / 3 CUs32 KiB / 3 CUs256 KiB12.5 KiB9.0
MI50 (16GB)gfx90616606464 KiB4 MiB16 KiB16 KiB / 3 CUs32 KiB / 3 CUs256 KiB12.5 KiB9.0

Optimization implication:

  • Treat MI60 and MI50 as the same ISA/feature family (gfx906) with CU-count and memory-capacity differences as the main SKU split.

2) Launch-Era Product Capabilities (AMD 2018 IR Release)

From AMD’s MI50/MI60 launch release (Nov 6, 2018):

  • MI60 peak throughput (theoretical): FP16 29.5 TFLOPS, FP32 14.8 TFLOPS, FP64 7.4 TFLOPS.
  • MI50 peak throughput (theoretical): FP16 26.8 TFLOPS, FP32 13.4 TFLOPS, FP64 6.7 TFLOPS.
  • Both boards: 300W envelope.
  • Memory/interconnect claims: up to 1 TB/s HBM2 bandwidth, MI60 at 32GB HBM2 ECC, MI50 at 16GB HBM2 ECC (launch), dual IF links up to 200 GB/s P2P, and PCIe Gen4 x16 up to 64 GB/s host link bandwidth.

Date clarification:

  • The 2018 launch material lists MI50 as 16GB.
  • The current ROCm table includes both MI50 16GB and MI50 32GB entries.

3) Compute-Unit and Scheduling Model (HIP Hardware Docs)

ROCm HIP hardware documentation (GCN-oriented model) highlights:

  • Wavefront model is 64 lanes for this class of architecture.
  • CU execution core is modeled as four SIMD16 vector units.
  • Sequencer organization allows up to 40 resident wavefronts per CU (4 pools x up to 10 each), subject to resource limits.
  • Per-cycle issue model can include one instruction to each SIMD path, plus scalar/branch/LDS paths.
  • Resource constraints that gate occupancy: wave slots, VGPR, SGPR, LDS.

Optimization implication:

  • Occupancy is constrained by register and LDS pressure before nominal wave-slot maxima in many real kernels.

4) Memory Hierarchy and Data Movement Facts

4.1 Caches/LDS behavior

From HIP hardware docs and Vega 7nm ISA:

  • LDS is a software-managed on-CU scratchpad with 32 banks, 4-byte bank width.
  • LDS bank conflicts are a first-order performance limiter for shared-memory-heavy kernels.
  • Vector L1 is per-CU, write-through, with 64-byte line granularity and typical 16 KiB size.
  • L2 is shared and is the coherence point for GPU memory traffic.
  • Vega 7nm ISA documents a 4 MiB shared L2 and CU-array scalar/instruction front-end caches.

From AMD 2018 release details:

  • Dual Infinity Fabric Links (xGMI) per GPU with stated up to 200 GB/s aggregate P2P bandwidth.
  • PCIe Gen4 x16 stated up to 64 GB/s host-device transport peak.

Optimization implication:

  • Multi-GPU collectives can benefit significantly when topology actually uses IF links.
  • Host-staging and transfer overlap should assume PCIe constraints unless direct GPU-GPU paths are active.

5) ISA/Compiler-Surface Constraints Specific to gfx906

From LLVM AMDGPU usage/reference and per-target assembler docs:

  • gfx906 target IDs are published as:
    • gfx906:sramecc-:xnack-
    • gfx906:sramecc-:xnack+
  • sramecc not available on gfx906 in this target model.
  • xnack is compiler-visible and relevant for demand-paging/page-migration behavior.
  • wavefrontsize64 is the relevant mode for this generation.
  • Current LLVM gfx906 assembler syntax docs list v_dot* instructions (for example v_dot2_f32_f16, v_dot4_i32_i8, v_dot8_i32_i4).
  • v_mfma* instructions are not listed on the LLVM gfx906 instruction page; they are listed on newer targets (for example gfx908 docs).

Optimization implication:

  • Build artifacts must match the intended XNACK mode (xnack- vs xnack+) for predictable paging/fault behavior and performance.
  • Prefer instruction paths actually listed for gfx906 (v_dot* and standard vector paths), and do not assume MFMA availability on MI50/MI60.

6) Deep-Learning Instruction Path (gfx906-safe view)

From LLVM gfx906 assembler docs and AMD launch material:

  • gfx906 shows mixed-precision/dot instruction forms (v_dot*) in LLVM assembler syntax docs.
  • AMD launch material describes optimized deep-learning operations (DLOPS) for MI50/MI60.
  • MFMA (v_mfma*) should not be assumed for gfx906 based on current per-target LLVM docs.

Optimization implication:

  • For MI50/MI60, prioritize kernels that exploit documented gfx906 dot/mixed-precision instruction paths plus efficient memory behavior.

7) Practical Optimization Baseline Checklist

Use this as the default starting point for kernel tuning on MI50/MI60:

  1. Target compile: Use --offload-arch=gfx906:xnack- or gfx906:xnack+ explicitly (do not leave ambiguous across environments).
  2. Launch geometry: Workgroup sizes in multiples of 64. Sweep occupancy-sensitive block sizes while watching VGPR/LDS pressure.
  3. Register/LDS budget: Keep LDS layouts bank-friendly (avoid many lanes hitting same bank). Track whether VGPR or LDS is the first occupancy limiter.
  4. Memory behavior: Coalesce global accesses for L1/L2 efficiency. Prefer data reuse in LDS where bank conflicts remain controlled.
  5. Multi-GPU: Verify actual IF-link topology; optimize collectives/partitioning for P2P when present.
  6. Math path: Prefer gfx906-documented dot/mixed-precision paths (v_dot*) and avoid assuming MFMA availability.

8) References (Primary Sources)

  • AMD ROCm GPU architecture specs (Instinct table):
    https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html
  • AMD ROCm HIP hardware implementation:
    https://rocm.docs.amd.com/projects/HIP/en/latest/understand/hardware_implementation.html
  • LLVM AMDGPU usage/reference (target features, restrictions, target IDs):
    https://llvm.org/docs/AMDGPUUsage.html
  • LLVM gfx906 instruction syntax (per-target assembler reference):
    https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
  • LLVM gfx908 instruction syntax (contrast target showing MFMA forms):
    https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
  • AMD IR launch release (Nov 6, 2018), MI60/MI50: https://ir.amd.com/news-events/press-releases/detail/859/amd-unveils-worlds-first-7nm-datacenter-gpus----powering-the-next-era-of-artificial-intelligence-cloud-computing-and-high-performance-computing-hpc
  • AMD Vega 7nm Shader ISA PDF: https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf

gfx906 dot4/dot8 Exploration (2026-02-21)

This page documents how dot4/dot8 instructions behave on gfx906 (MI50/MI60 class), including semantics, limits, theoretical TOPS, and real measurements.

1) Instruction Mapping and Semantics

Primary source:

  • LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html

Mapped intrinsics:

  • llvm.amdgcn.sdot4 -> v_dot4_i32_i8
  • llvm.amdgcn.udot4 -> v_dot4_u32_u8
  • llvm.amdgcn.sdot8 -> v_dot8_i32_i4
  • llvm.amdgcn.udot8 -> v_dot8_u32_u4

Semantics:

  • dot4 uses two packed i32 operands that each hold 4x8-bit values.
  • dot8 uses two packed i32 operands that each hold 8x4-bit values.
  • Both add into a 32-bit accumulator (src2).
  • Fourth intrinsic operand is clamp enable (i1).

Per-target syntax confirms availability on gfx906:

  • https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html

Contrast:

  • v_mfma* is not listed on gfx906 syntax page (but appears on gfx908):
    https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html

2) Codegen Validation on Real Host

Host:

  • fox@192.168.1.28 (ROCm installed)
  • rocminfo: 4x gfx906, each 60 CUs, 1725 MHz

Direct compile test:

  • clang -target amdgcn-amd-amdhsa -mcpu=gfx906 -nogpulib -O3 -S

Observed lowering:

  • __builtin_amdgcn_sdot4 -> v_dot4_i32_i8
  • __builtin_amdgcn_udot4 -> v_dot4_u32_u8
  • __builtin_amdgcn_sdot8 -> v_dot8_i32_i4
  • __builtin_amdgcn_udot8 -> v_dot8_u32_u4
  • clamp flag emits ... clamp modifier.

3) Clamp and Overflow Behavior (Measured)

Measured with small HIP kernels on gfx906:

  • sdot4 positive overflow:
    • no clamp: wraps
    • clamp: saturates to INT_MAX (0x7fffffff)
  • sdot4 negative overflow:
    • no clamp: wraps
    • clamp: saturates to INT_MIN (0x80000000)
  • udot4 overflow:
    • no clamp: wraps
    • clamp: saturates to UINT_MAX (0xffffffff)
  • sdot8 overflow-ish case:
    • no clamp: wraps
    • clamp: saturates to INT_MAX

Takeaway:

  • Accumulator is 32-bit and can overflow.
  • Use clamp when saturating behavior is required.

4) Theoretical Throughput (MI50 config from host)

Using measured host-reported config (60 CUs @ 1725 MHz):

  • dot4 theoretical:
    • 26.496 TMAC/s
    • 52.992 TOPS (counting MAC as 2 ops)
  • dot8 theoretical:
    • 52.992 TMAC/s
    • 105.984 TOPS (counting MAC as 2 ops)

Formula used:

  • TMAC/s = CU * 64 lanes * MACs_per_instruction * clock
  • TOPS = 2 * TMAC/s

5) Real Throughput Measurements (All 4 GPUs)

Benchmark A: dependency-chained accumulator

  • blocks=2048, threads=256, iters=65536
  • Across all 4 cards:
    • sdot4: ~21.7 to 22.3 TOPS
    • udot4: ~22.25 to 22.63 TOPS
    • sdot8: ~43.5 to 44.4 TOPS
    • udot8: ~44.5 to 44.6 TOPS

Benchmark B: ILP4 (4 independent accumulators)

  • same launch geometry
  • Across all 4 cards:
    • sdot4_ilp4: ~43.0 to 44.4 TOPS
    • sdot8_ilp4: ~85.3 to 86.2 TOPS

Interpretation:

  • dot8 is ~2x dot4 throughput in both patterns.
  • ILP materially improves achieved throughput by reducing dependency stalls.
  • ILP4 results are roughly ~81% of simple theoretical peak.

6) Practical Optimization Guidance

  1. Use dot8 when quantization/layout supports 4-bit packing; it delivers about 2x dot4 arithmetic density.
  2. Keep multiple independent accumulators per thread to reduce dependency throttling.
  3. Track 32-bit accumulator range; enable clamp where saturation is needed.
  4. On gfx906, optimize around v_dot* and memory behavior; do not assume MFMA.

References

  • LLVM AMDGPU usage/reference:
    https://llvm.org/docs/AMDGPUUsage.html
  • LLVM gfx906 instruction syntax:
    https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html
  • LLVM gfx908 instruction syntax (contrast):
    https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX908.html
  • ROCm GPU architecture specs:
    https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html

gfx906 Special ISA for Quant/Dequant (MI50/MI60)

This note focuses on gfx906 instructions that are especially useful for quantization/dequantization and data movement pipelines.

Verified instruction support on gfx906

I validated support with llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906.

InstructionStatus on gfx906Why it matters
v_dot4_i32_i8supportedint8x4 dot-accumulate
v_dot8_i32_i4supportedint4x8 dot-accumulate
v_dot2_f32_f16supportedfp16x2 dot into fp32
v_dot4c_i32_i8not supportedcannot rely on dot4c lowering
v_dot8c_i32_i4not supportedcannot rely on dot8c lowering
v_pack_b32_f16supportedpack 2xf16 into one dword
v_cvt_pkrtz_f16_f32supporteddirect pack+convert f32->2xf16
v_pk_add_f16/v_pk_mul_f16/v_pk_fma_f16supportedpacked fp16 math (2 lanes/op)
v_mov_b32_dppsupportedwave-lane rearrange without LDS
ds_bpermute_b32 / ds_permute_b32supportedlane gather/scatter style exchange
v_perm_b32supportedbyte permutation within registers
v_bfe_i32supportedfast nibble/bitfield extraction
v_lshl_or_b32supportedpack/insert bits efficiently
SDWA forms (*_sdwa)supportedbyte/word select in ALU/convert ops

Complete SDWA variant sweep on gfx906

  • I extracted all v_*_sdwa mnemonics from LLVM GFX9 syntax docs (AMDGPUAsmGFX9).
  • Total mnemonics found: 239.
  • I assembled each mnemonic with llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx906 using a multi-template operand probe.
  • Result: 239/239 assembled successfully, 0 unsupported, 0 unresolved.
  • Runtime spot checks on hardware passed for representative SDWA ops:
    • v_cvt_f32_i32_sdwa
    • v_add_u32_sdwa

Interpretation: all documented GFX9 SDWA opcode variants are available on gfx906 at instruction level. This is opcode-availability coverage, not an exhaustive test of every legal modifier combination.

What the compiler emitted in real qdq kernels

Built and disassembled HIP kernels on real gfx906 (hipcc -O3 --offload-arch=gfx906 -S).

  • FP32 -> INT8 pack4 path emitted:
    • v_rndne_f32, v_cvt_i32_f32, v_med3_i32 (saturating clamp to [-128,127])
    • v_lshlrev_b32, v_perm_b32, v_or3_b32 (packing)
  • INT8 unpack + dequant path emitted:
    • v_cvt_f32_i32_sdwa ... src0_sel:BYTE_{0..3} (byte extract + sign-extend + convert)
  • INT4 unpack + dequant path emitted:
    • v_bfe_i32 for nibble extraction + sign extension, then v_cvt_f32_i32
  • Wave shuffle path (__shfl_xor) emitted:
    • ds_bpermute_b32
  • Packed fp16 math path emitted:
    • v_pk_fma_f16
  • FP32 -> packed fp16 storage path emitted:
    • v_cvt_f16_f32 + v_pack_b32_f16

High-value instruction families for qdq work

  1. Dot instructions (v_dot4_*, v_dot8_*, v_dot2_f32_f16)
  • Use when data is already packed/quantized (or conversion cost is amortized).
  1. SDWA instructions (*_sdwa)
  • Best for byte/word extraction directly inside ALU/convert op (helps i8 dequant).
  1. Bitfield/pack ops (v_bfe_*, v_lshl_or_b32, v_perm_b32, shifts/ands)
  • Core tools for nibble/byte unpack and repack (especially int4/int8 layouts).
  1. Packed fp16 ops (v_pack_b32_f16, v_cvt_pkrtz_f16_f32, v_pk_*_f16)
  • Useful bridge path when dequantizing into fp16 or doing fp16 pre/post transforms.
  1. Wave data movement (v_mov_b32_dpp, ds_bpermute_b32, ds_permute_b32)
  • Useful for lane remap/reorder without global memory traffic.

Practical limits and caveats

  • dot4c/dot8c are not available on gfx906; only use dot4/dot8 forms.
  • gfx906 dot instructions are available, but v_mfma* instructions are not listed for this target.
  • SDWA selects byte/word sublanes (BYTE_0..3, WORD_0..1, DWORD), not arbitrary bitfields.
  • DPP/DS lane ops are wave-level operations; they are not global cross-wave data movement.
  • clamp behavior matters for integer dot/arith overflow paths; enable only when required.

References

gfx906 Latency-Hiding Ops (Measured)

This note summarizes instruction patterns on gfx906 (MI50/MI60) that are most useful for hiding latency in quant/dequant-style kernels.

Scope

Focus is on:

  • wave-lane exchange (DPP, DS permute)
  • LDS width (b32/b64/b128)
  • global load width (dword vs dwordx4)
  • scheduling behavior (s_waitcnt placement)

All kernels were compiled for --offload-arch=gfx906 and validated with emitted ISA.

Key measured findings

1) Use DPP first for row-local shuffles

v_mov_b32_dpp row shift (row_shr:1) vs LDS+barrier equivalent:

  • dpp_row_shr: ~1778 to 1784 Gxchg/s
  • lds_row_shr: ~906 Gxchg/s

Takeaway: for row-local lane movement, DPP gives about ~2x throughput and removes barrier overhead.

2) Use ds_bpermute_b32 for general in-wave exchange

XOR-neighbor exchange benchmark:

  • ds_bpermute_b32: ~962 to 970 Gxchg/s
  • LDS store+load+barriers equivalent: ~905 to 907 Gxchg/s

Takeaway: ds_bpermute_b32 is consistently better than LDS exchange when shuffle pattern is not DPP-friendly.

3) Prefer wide LDS ops for staging

Pure LDS streaming kernels (instruction forms confirmed in ISA):

  • ds_read/write_b32 (l1): typically ~1.9 to 3.9 TB/s
  • ds_read/write_b64 (l2): typically ~4.3 to 8.8 TB/s
  • ds_read/write_b128 (l4): typically ~9.5 to 11.2 TB/s

Takeaway: b128 LDS accesses are the strongest baseline for LDS-heavy staging paths.

4) Wide global loads help when memory path is healthy

Compiler emits:

  • scalar path: global_load_dword
  • vector path: global_load_dwordx4

In uncongested runs, dwordx4 outperformed scalar (~867-873 GB/s vs ~814 GB/s). On a shared machine, global-memory numbers varied run-to-run; treat this as directionally positive, not a fixed constant.

Scheduling behavior that matters

In ILP kernels, compiler issues multiple loads first and delays waits:

  • VMEM: staged s_waitcnt vmcnt(3..0)
  • LDS: staged s_waitcnt lgkmcnt(...)

That pattern is the core latency-hiding mechanism on gfx906: keep multiple memory operations in flight before consuming results.

What is not available on gfx906 (relevant to hiding)

Assembler probes on gfx906 rejected:

  • s_clause
  • s_waitcnt_depctr
  • s_delay_alu

So latency control is mainly through ILP/wave occupancy and careful s_waitcnt timing, not newer explicit dependency-control instructions.

Practical checklist

  1. Row-local shuffle: use v_mov_b32_dpp.
  2. Arbitrary in-wave shuffle: use ds_bpermute_b32 / ds_permute_b32.
  3. LDS staging: default to ds_read/write_b128 where alignment allows.
  4. Global staging: prefer global_load_dwordx4 for contiguous packed data.
  5. Structure loops to issue multiple independent loads before first use.
  6. Avoid immediate waits after each load; let compiler keep VMEM/LDS queues populated.

References

gfx906 LDS Layout Standard for LLM Blocks

This note defines a practical LDS layout standard for gfx906 kernels (GEMM/attention-style tiles) based on measured bank-pressure behavior.

Why this matters

Many LLM kernels read one operand row-wise and another operand effectively column-wise from LDS. On gfx906, column-style accesses can collapse bandwidth if row stride aliases LDS banks.

Measured result (key experiment)

Microbenchmark on real gfx906 using ds_read_b128/ds_write_b128:

  • contiguous vec4 access baseline: ~4257 GB/s
  • column-style access with ld=32 vec4: ~1865 GB/s
  • same column-style access with ld=33 vec4 padding: ~3974 GB/s

Interpretation:

  • ld=32 (power-of-two stride) is a bad default for column-like LDS reads.
  • adding one vec4 of padding per row (ld=33) recovers most bandwidth.

Instruction forms confirmed

Disassembly for all variants used:

  • ds_write_b128
  • ds_read_b128

So the improvement is layout/bank behavior, not a different opcode path.

Layout standard for gfx906

  1. Use 16-byte vectorized LDS payloads (uint4/float4/packed int blocks).
  2. Keep base LDS buffers 16-byte aligned.
  3. For tiles consumed row-wise only: use natural row stride.
  4. For tiles that will be consumed column-wise (or transposed access), use padded leading dimension in vec4 units: ld_vec = logical_ld_vec + 1.
  5. Prefer ds_read/write_b128 staging paths over scalar LDS traffic.
  • A-like operand (row-consumed): no pad needed.
  • B-like operand (column-consumed by waves): +1 vec4 pad per row.
  • If LDS budget is tight, test +1 first before more complex swizzles.

Practical formula

If a row has K_vec vec4 elements, allocate:

  • stride_vec = K_vec for row-only reads
  • stride_vec = K_vec + 1 for column-like reuse

LDS footprint increase is modest (~1/K_vec fractional overhead) and often worth it.

Caveat

Numbers come from controlled microbenchmarks; final kernel gains depend on occupancy, VMEM pressure, and math mix. Still, this +1 rule is a strong first choice on gfx906.

References

gfx906 KV-Cache Read/Write Kernel Study

This note benchmarks KV-cache layouts on gfx906 for decode-like kernels.

Layouts tested

  • HSD: [head][seq][dim] (dim contiguous inside sequence position)
  • HDS: [head][dim][seq] (seq contiguous for each dim lane)

Measured write behavior (new-token update)

Measured on real gfx906 (float cache, dim=128, heads=32, seq=4096):

  • write_hsd_x4: ~357.6 GB/s
  • write_hsd_x1: ~357.6 GB/s
  • write_hds_x4: ~54.4 GB/s
  • write_hds_x1: ~14.0 GB/s

Takeaway:

  • For decode token writes, HSD is dramatically better than HDS.
  • HDS writes are highly strided and expensive.

Measured read behavior depends on traversal pattern

A) Dot-style decode traversal (per-seq dot over dim)

Kernel pattern: each block handles one (head, seq) row and threads span dim.

  • read_dot_hsd_x4: ~1.76 TB/s
  • read_dot_hds_x4: ~0.37 TB/s

Takeaway:

  • For attention-score style decode reads, HSD is the right layout.

B) Dim-fixed streaming over seq

Kernel pattern: each thread keeps fixed dim lane and streams seq.

  • read_hsd_x1: ~45.0 GB/s
  • read_hsd_x4: ~41.5 GB/s
  • read_hds_x4: ~73.7 GB/s

Takeaway:

  • If the kernel is explicitly dim-fixed streaming over sequence, HDS can be better.

ISA mapping confirmed

Disassembly confirms expected vector paths:

  • scalar read/write: global_load_dword, global_store_dword
  • vector read/write: global_load_dwordx4, global_store_dwordx4
  1. Keep canonical KV layout as HSD ([head][seq][dim]).
  2. Use x4 vectorized loads/stores when naturally aligned.
  3. Optimize decode math kernels around dot-style traversal (per-seq rows), where HSD is strong.
  4. Only use HDS when a specific kernel is dim-fixed seq-streaming and dominates runtime.

Practical implication

For general-purpose LLM kernels, optimizing around HSD gives strong read+write balance. HDS is a specialized alternative, not a universal default.

References

FP32 vs Quant-Dequant + Dot on gfx906 (Measured)

Date: 2026-02-21
Hardware: AMD Instinct MI50/MI60 (gfx906), 60 CUs, 1725 MHz
Host: fox@192.168.1.28 (ROCm installed)

Question:

  • If activations start in FP32, is it worth quantizing/dequantizing them on gfx906 to compute with dot4/dot2, or is pure FP32 better?

Experiment Setup

Three paths were benchmarked on-device with HIP:

  1. pure_fp32
  • FP32 values stay FP32.
  • Compute via FP32 FMA only.
  1. qdq_int8_dot4
  • In kernel hot loop: FP32 activation -> INT8 quantize (pack) -> __builtin_amdgcn_sdot4 -> dequantize.
  1. qdq_fp16_dot2
  • In kernel hot loop: FP32 activation -> FP16 conversion -> __builtin_amdgcn_fdot2.

All paths were normalized to the same arithmetic payload per loop iteration (8 MACs/thread/iter), and reported as effective TOPS (counting MAC as 2 ops).

Core Result (On-the-Fly QDQ in Hot Loop)

Stable best results across cards (after reruns):

  • pure_fp32: ~5.95 TOPS
  • qdq_fp16_dot2: ~4.19 TOPS
  • qdq_int8_dot4: ~2.00 TOPS

Conclusion for this scenario:

  • When activations start as FP32 and conversion is done in the hot loop, pure FP32 wins.
  • dot2 is slower than FP32.
  • dot4 is much slower than FP32.

Amortized Conversion Check (Conversion Once, Reuse Many Times)

A second benchmark converted once outside the hot loop, then reused converted values:

  • fp32_reuse: ~13.0 TOPS
  • dot4_reuse: ~21.7 TOPS
  • dot2_reuse: ~21.9 TOPS

Interpretation:

  • If conversion cost is amortized by reuse (GEMM-like behavior), dot paths can outperform pure FP32.
  • If conversion/deconversion is paid every use, they do not.

Practical Recommendation

  1. For per-use FP32 activations:
  • Use pure FP32 on gfx906.
  1. For high-reuse kernels (where conversion is amortized):
  • Dot paths (dot4/dot2) can be worthwhile.
  • Optimize for reuse depth before deciding.
  1. Do not rely on theoretical dot throughput alone:
  • End-to-end cost is dominated by conversion/packing when done in the hot path.

Instruction Validation Notes

Codegen validation on gfx906:

  • __builtin_amdgcn_sdot4 -> v_dot4_i32_i8
  • __builtin_amdgcn_fdot2 -> v_dot2_f32_f16

Related references:

  • LLVM AMDGPU usage/reference: https://llvm.org/docs/AMDGPUUsage.html
  • LLVM gfx906 syntax: https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX906.html

Contributing

Thank you for considering contributing to the GFX906 Wiki! This guide outlines how you can help improve the documentation.

Easy Contribution

If you want to simply add or edit some markdown files without touching the console, follow this method.

To edit a markdown file on the wiki:

  1. Open it on github and click the "edit" button. You will be prompted to create a fork if you don't already have a fork. Do so.

  2. Make your edits and be sure to keep the markdown clean as outlined in Making Changes.

  3. Commit and repeat this process until you are done.

  4. Submit a pull request to the "main" branch

To add your own existing markdown files:

  1. Click the Fork button on the top right of the repository page.

  2. Click "Add file" on the main page of your fork. Click "Upload Files". Make sure your files contain clean markdown. See Making Changes.

  3. Upload all files you want to add and click "Commit Changes". See Commits

  4. Update the SUMMARY.md file following the process above to add links to your files to the sidebar. Please try to respect the existing structure of the sidebar.

  5. Submit a pull request to the "main" branch

Local deployment

  1. Fork the repository
    Click the Fork button on the top right of the repository page.

  2. Clone your fork

    git clone https://github.com/<your-username>/wiki-gfx906.git
    cd wiki-gfx906
    
  3. Create a feature branch

    git checkout -b <branch-name>
    
  4. Install mdBook (if you haven’t already)

    cargo install mdbook
    
  5. Build and preview locally

    mdbook serve
    

    Open http://localhost:3000 in your browser to see your changes live.

Making Changes

  • Add or edit content in the src/ directory.
  • Update SUMMARY.md to include any new pages you add.
  • Keep markdown clean and consistent:
    • Use headings (#, ##, …) to structure sections.
    • Prefer fenced code blocks with language identifiers.
    • Use relative links for internal navigation.

Commit Messages

Write clear, concise commit messages. Follow this format:

<type>: <short description>

<optional longer description>

Common <type> values:

  • docs: documentation updates
  • fix: typo or small correction
  • feat: new page or major addition

Pull Request Process

  1. Push your branch to your fork:

    git push origin <branch-name>
    
  2. Open a Pull Request against the main branch of the upstream repository.

    • Provide a descriptive title and summary of changes.
    • Link to any relevant issue(s) (e.g., Closes #42).
  3. Review – maintainers will review your PR. Respond to feedback promptly.

Code of Conduct

We expect all contributors to behave respectfully. Harassment and discrimination of any kind will not be tolerated. See the CODE_OF_CONDUCT.md for details.

License

By contributing you agree that your contributions will be licensed under the same license as the project (see LICENSE).


Happy documenting! 🎉

How to install ROCm 7.x on Linux

The steps outlined here are based on this thread. There is also a video guide for Ubuntu 24.04 LTS here.

Approach

AMD has stopped shipping the tensor files for gfx906 with the newer ROCm releases - despite being compatible! This is a simple workaround, wherein we can add the missing tensor files.

ROCm Quick install

  1. Go to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html and copy & paste the outlined commands.

  2. During the installation, you may be prompted to add a key if you have secure boot enabled.

  3. After completing the install, do NOT reboot yet.

Getting the missing tensor files

The missing tensor files can be found in the arch repository here: https://archlinux.org/packages/extra/x86_64/rocblas/ Despite being for ROCm 6.4 it'll work:

  1. Download the rocblas file: https://archlinux.org/packages/extra/x86_64/rocblas/download/

  2. Go to the location you downloaded it to and extract it:

cd Downloads/ && unzstd rocblas-6.4.4-1-x86_64.pkg.tar.zst
  1. There should be two folders, opt/ and usr/

  2. Copy all the files containing the string "gfx906" to ´/opt/rocm/lib/rocblas/library´ (sudo privileges required):

sudo cp opt/rocm/lib/rocblas/library/*gfx906* /opt/rocm/lib/rocblas/library
  1. Now reboot

  2. If you enrolled a key for secure boot, you will get a blue screen with some options. Select "Enroll MOK" and type in the password you assigned earlier.

  3. Check if it worked by running ´sudo update-alternatives --display rocm`

Post-install (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html)

  1. Configure the system linker by specifying where to find the shared objects (.so files) for ROCm applications:
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig
  1. Add the paths to your bash
echo 'export PATH=$PATH:/opt/rocm-7.2.0/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib' >> ~/.bashrc
source ~/.bashrc

That's it, enjoy!

How to install ComfyUI on Linux

The following instructions are aimed at Ubuntu 24.04 LTS using ROCm 7.2 If you are using a different distro I recommend uv and setting a Python 3.12 virtual environment i.e. uv venv venv --python 3.12

Install

  1. Clone the repository:
git clone https://github.com/comfyanonymous/ComfyUI.git
  1. Change to the ComfyUI directory and create a python virtual environment:
cd ComfyUI
python3 -m venv venv
  1. Activate the virtual environment:
source venv/bin/activate
  1. Update pip:
pip install --upgrade pip wheel setuptools
  1. Install PyTorch wheels, you can experiment with different versions for more stability or newer features:
pip install torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/rocm7.1
  1. Install requirements for ComfyUI:
pip install -r requirements.txt

Verify it

  1. Run 'python3 main.py` to check it installed properly, you can exit after as we still have a few more steps.

Creating a script to make it easier to run

  1. Use your favourite text editor to create the script with a name you like in a location you want e.g.
cd ~
nano run-comfyui.sh
  1. Insert the following, adjust the ROCm environment accordingly:
#!/bin/bash
export PATH=$PATH:/opt/rocm-7.2.0/bin
export LD_LIBRARY_PATH=/opt/rocm-7.2.0/lib

cd ComfyUI
source venv/bin/activate
python3 main.py --use-split-cross-attention --disable-smart-memory --front-end-version Comfy-Org/ComfyUI_frontend@latest
  1. Make the script executable:
chmod +x run-comfyui.sh

Note:

These are the parameters that seem to work the best for Z-Image Turbo, but more testing is needed - also with other models. You may come across many other environment variables, but I haven't seen any perceivable differences on gfx906. Feel free to remove --front-end-version if you experience problems with the latest version.

Fixing missing ROCm environment paths:

If for any reason you get a missing tensor files error in ComfyUI, please check the "installing_ROCm_7.x" guide to obtain them. If you still encounter the error, it means the environment is not set properly. You can also manually add the files to ComyUI:

sudo cp ~/rocblas7.1/rocblas/*gfx906* ~/ComfyUI/venv/lib/python3.12/site-packages/torch/lib/rocblas/library/
cd ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager comfyui-manager

That's it, now you can run ComfyUI by running the script ./run-comfyui.sh

Nightly ROCm builds with gfx90x support (including gfx906): https://therock-nightly-tarball.s3.amazonaws.com