vllm - 💡(How to fix) Fix [Bug][ROCm] GLM-5 MXFP4 sparse MLA decode crash on MI355x [1 comments, 2 participants]

ChuanLi1101 · 2026-04-03T16:36:05Z

[vllm] GLM-5 GlmMoeDsaForCausalLM with MXFP4 quantization crashes during the decode phase on ROCm MI355x gfx950 with 8 GPUs. The model loads and prefills succe… GLM-5 (GlmMoeDsaForCausalLM) with MXFP4 quantization crashes during the **decode** phase on ROCm MI355x (gfx950) with 8 GPUs. The model loads and prefills successfully, but decode consistently fails with either `ZeroDivisionError` or `Memory access fault`. ## Fix / Workaround Two complementary approaches (both needed): 1. **MLA decode kernel**: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively 2. **Indexer MQA logits**: Fall back to PyTorch reference implementation when `heads < 16` (implemented in `fix/rocm-glm5-mxfp4-optimizations` branch) ## Summary GLM-5 (GlmMoeDsaForCausalLM) with MXFP4 quantization crashes during the **decode** phase on ROCm MI355x (gfx950) with 8 GPUs. The model loads and prefills successfully, but decode consistently fails with either `ZeroDivisionError` or `Memory access fault`. ## Root Cause GLM-5 has 64 attention heads and 32 sparse indexer heads (`index_n_heads=32`). At TP=8, the MLA decode kernel receives 8 heads per GPU, but AITER's sparse MLA kernels require `num_heads >= 16`: 1. **MLA decode kernel**: `mla_decode_stage1_asm_fwd` only supports `gqa >= 16`. With TP=8, `gqa=8` triggers `RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! gqa:8` 2. **FP8 paged MQA logits (indexer)**: `deepgemm_fp8_paged_mqa_logits_stage1` computes `TileQCount = heads // ChunkQ` with default `ChunkQ=64`. When `heads = 16` requirement for the MLA kernel), a `Memory access fault` persists during decode, suggesting additional issues in the sparse attention indexer's forward_hip path. ## Environment - **GPU**: 8x AMD MI355X (gfx950) - **ROCm**: 7.2.1 - **vLLM**: main branch (latest) - **Model**: GLM-5-MXFP4 (`zai-org/GLM-5-MXFP4` or equivalent Quark checkpoint) - **Config**: `num_attention_heads=64`, `index_n_heads=32`, `kv_lora_rank=512`, `n_routed_experts=256` ## Reproduction ```bash export VLLM_ROCM_USE_AITER=1 export VLLM_ROCM_USE_AITER_LINEAR=1 export VLLM_ROCM_USE_AITER_MOE=1 export VLLM_ROCM_USE_AITER_MLA=1 vllm serve /path/to/GLM-5-MXFP4 \ --tensor-parallel-size 8 \ --block-size 1 \ --gpu-memory-utilization 0.90 \ --enforce-eager ``` Then send any chat completion request - prefill succeeds but decode crashes. ## Error Traces ### ZeroDivisionError (TP=8, indexer path) ``` File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1 SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU ZeroDivisionError: integer division or modulo by zero ``` ### Memory access fault (TP=4 or TP=8, during decode) ``` Memory access fault by GPU node-X (Agent handle: ...) on address ... ``` ## Related - PR #36855 - Fix for head repeat in sparse MLA (addresses the `gqa:8` crash) - https://github.com/ROCm/aiter/issues/726 - Upstream feature request for native nhead=8 support - https://github.com/ROCm/aiter/issues/2563 - Upstream issue for ZeroDivisionError in paged MQA logits - Issue #34553 - GLM-5 FP8 on H200 OOM in sparse_attn_indexer (CUDA, different issue) - Issue #36220 - GLM-5 quantized serving failure (AWQ/INT4, different quantization) ## Proposed Fix Two complementary approaches (both needed): 1. **MLA decode kernel**: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively 2. **Indexer MQA logits**: Fall back to PyTorch reference implementation when `heads < 16` (implemented in `fix/rocm-glm5-mxfp4-optimizations` branch)

vllm2026-04-03 16:36:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38924•Fetched 2026-04-08 02:44:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ChuanLi1101

Participants

ChuanLi1101

github-actions[bot]

Timeline (top)

mentioned ×3subscribed ×3project_v2_item_status_changed ×2added_to_project_v2 ×1

GLM-5 (GlmMoeDsaForCausalLM) with MXFP4 quantization crashes during the decode phase on ROCm MI355x (gfx950) with 8 GPUs. The model loads and prefills successfully, but decode consistently fails with either ZeroDivisionError or Memory access fault.

Error Message

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1 SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU ZeroDivisionError: integer division or modulo by zero

Root Cause

GLM-5 has 64 attention heads and 32 sparse indexer heads (index_n_heads=32). At TP=8, the MLA decode kernel receives 8 heads per GPU, but AITER's sparse MLA kernels require num_heads >= 16:

MLA decode kernel: mla_decode_stage1_asm_fwd only supports gqa >= 16. With TP=8, gqa=8 triggers RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! gqa:8
FP8 paged MQA logits (indexer): deepgemm_fp8_paged_mqa_logits_stage1 computes TileQCount = heads // ChunkQ with default ChunkQ=64. When heads < 64, TileQCount=0 causes ZeroDivisionError at SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU

Even with TP=4 (16 heads per GPU, satisfying the >= 16 requirement for the MLA kernel), a Memory access fault persists during decode, suggesting additional issues in the sparse attention indexer's forward_hip path.

Fix Action

Fix / Workaround

Two complementary approaches (both needed):

MLA decode kernel: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively
Indexer MQA logits: Fall back to PyTorch reference implementation when heads < 16 (implemented in fix/rocm-glm5-mxfp4-optimizations branch)

Code Example

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_MLA=1

vllm serve /path/to/GLM-5-MXFP4 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --gpu-memory-utilization 0.90 \
    --enforce-eager

---

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1
    SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU
ZeroDivisionError: integer division or modulo by zero

---

Memory access fault by GPU node-X (Agent handle: ...) on address ...

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

GLM-5 has 64 attention heads and 32 sparse indexer heads (index_n_heads=32). At TP=8, the MLA decode kernel receives 8 heads per GPU, but AITER's sparse MLA kernels require num_heads >= 16:

MLA decode kernel: mla_decode_stage1_asm_fwd only supports gqa >= 16. With TP=8, gqa=8 triggers RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! gqa:8
FP8 paged MQA logits (indexer): deepgemm_fp8_paged_mqa_logits_stage1 computes TileQCount = heads // ChunkQ with default ChunkQ=64. When heads < 64, TileQCount=0 causes ZeroDivisionError at SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU

Environment

GPU: 8x AMD MI355X (gfx950)
ROCm: 7.2.1
vLLM: main branch (latest)
Model: GLM-5-MXFP4 (zai-org/GLM-5-MXFP4 or equivalent Quark checkpoint)
Config: num_attention_heads=64, index_n_heads=32, kv_lora_rank=512, n_routed_experts=256

Reproduction

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_MLA=1

vllm serve /path/to/GLM-5-MXFP4 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --gpu-memory-utilization 0.90 \
    --enforce-eager

Then send any chat completion request - prefill succeeds but decode crashes.

Error Traces

ZeroDivisionError (TP=8, indexer path)

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1
    SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU
ZeroDivisionError: integer division or modulo by zero

Memory access fault (TP=4 or TP=8, during decode)

Memory access fault by GPU node-X (Agent handle: ...) on address ...

PR #36855 - Fix for head repeat in sparse MLA (addresses the gqa:8 crash)
https://github.com/ROCm/aiter/issues/726 - Upstream feature request for native nhead=8 support
https://github.com/ROCm/aiter/issues/2563 - Upstream issue for ZeroDivisionError in paged MQA logits
Issue #34553 - GLM-5 FP8 on H200 OOM in sparse_attn_indexer (CUDA, different issue)
Issue #36220 - GLM-5 quantized serving failure (AWQ/INT4, different quantization)

Proposed Fix

Two complementary approaches (both needed):

MLA decode kernel: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively
Indexer MQA logits: Fall back to PyTorch reference implementation when heads < 16 (implemented in fix/rocm-glm5-mxfp4-optimizations branch)

extent analysis

TL;DR

To fix the crash during the decode phase of GLM-5 with MXFP4 quantization on ROCm MI355x, apply a temporary workaround by padding the MLA decode kernel from 8 to 16 heads and fall back to the PyTorch reference implementation for indexer MQA logits when heads are less than 16.

Guidance

Apply head repeat padding: Implement the workaround from PR #36855 to pad the MLA decode kernel from 8 to 16 heads, ensuring compatibility until native support for fewer heads is available in AITER.
Fallback to PyTorch reference implementation: For indexer MQA logits, use the fallback to the PyTorch reference implementation when the number of heads is less than 16, as implemented in the fix/rocm-glm5-mxfp4-optimizations branch.
Verify compatibility and performance: After applying these fixes, verify that the model loads, prefills, and decodes successfully without crashes, and monitor performance to ensure there are no significant regressions.

Example

No specific code snippet is provided due to the complexity and specificity of the issue, but the fixes mentioned are intended to be applied as described in the referenced PR and branch.

Notes

These fixes are temporary workarounds and fixes. Native support for fewer heads in AITER and further optimizations may be necessary for long-term stability and performance.

Recommendation

Apply the workaround, as it directly addresses the identified issues and provides a path forward until more permanent fixes or native support is available in AITER.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug][ROCm] GLM-5 MXFP4 sparse MLA decode crash on MI355x [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Root Cause

Environment

Reproduction

Error Traces

ZeroDivisionError (TP=8, indexer path)

Memory access fault (TP=4 or TP=8, during decode)

Related

Proposed Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug][ROCm] GLM-5 MXFP4 sparse MLA decode crash on MI355x [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Root Cause

Environment

Reproduction

Error Traces

ZeroDivisionError (TP=8, indexer path)

Memory access fault (TP=4 or TP=8, during decode)

Related

Proposed Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING