vllm - 💡(How to fix) Fix [Bug][ROCm] GLM-5 MXFP4 sparse MLA decode crash on MI355x [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38924Fetched 2026-04-08 02:44:57
View on GitHub
Comments
1
Participants
2
Timeline
11
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3project_v2_item_status_changed ×2added_to_project_v2 ×1

GLM-5 (GlmMoeDsaForCausalLM) with MXFP4 quantization crashes during the decode phase on ROCm MI355x (gfx950) with 8 GPUs. The model loads and prefills successfully, but decode consistently fails with either ZeroDivisionError or Memory access fault.

Error Message

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1 SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU ZeroDivisionError: integer division or modulo by zero

Root Cause

GLM-5 has 64 attention heads and 32 sparse indexer heads (index_n_heads=32). At TP=8, the MLA decode kernel receives 8 heads per GPU, but AITER's sparse MLA kernels require num_heads >= 16:

  1. MLA decode kernel: mla_decode_stage1_asm_fwd only supports gqa >= 16. With TP=8, gqa=8 triggers RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! gqa:8
  2. FP8 paged MQA logits (indexer): deepgemm_fp8_paged_mqa_logits_stage1 computes TileQCount = heads // ChunkQ with default ChunkQ=64. When heads < 64, TileQCount=0 causes ZeroDivisionError at SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU

Even with TP=4 (16 heads per GPU, satisfying the >= 16 requirement for the MLA kernel), a Memory access fault persists during decode, suggesting additional issues in the sparse attention indexer's forward_hip path.

Fix Action

Fix / Workaround

Two complementary approaches (both needed):

  1. MLA decode kernel: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively
  2. Indexer MQA logits: Fall back to PyTorch reference implementation when heads < 16 (implemented in fix/rocm-glm5-mxfp4-optimizations branch)

Code Example

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_MLA=1

vllm serve /path/to/GLM-5-MXFP4 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --gpu-memory-utilization 0.90 \
    --enforce-eager

---

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1
    SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU
ZeroDivisionError: integer division or modulo by zero

---

Memory access fault by GPU node-X (Agent handle: ...) on address ...
RAW_BUFFERClick to expand / collapse

Summary

GLM-5 (GlmMoeDsaForCausalLM) with MXFP4 quantization crashes during the decode phase on ROCm MI355x (gfx950) with 8 GPUs. The model loads and prefills successfully, but decode consistently fails with either ZeroDivisionError or Memory access fault.

Root Cause

GLM-5 has 64 attention heads and 32 sparse indexer heads (index_n_heads=32). At TP=8, the MLA decode kernel receives 8 heads per GPU, but AITER's sparse MLA kernels require num_heads >= 16:

  1. MLA decode kernel: mla_decode_stage1_asm_fwd only supports gqa >= 16. With TP=8, gqa=8 triggers RuntimeError: get_heuristic_kernel_mla: cannot get heuristic kernel! gqa:8
  2. FP8 paged MQA logits (indexer): deepgemm_fp8_paged_mqa_logits_stage1 computes TileQCount = heads // ChunkQ with default ChunkQ=64. When heads < 64, TileQCount=0 causes ZeroDivisionError at SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU

Even with TP=4 (16 heads per GPU, satisfying the >= 16 requirement for the MLA kernel), a Memory access fault persists during decode, suggesting additional issues in the sparse attention indexer's forward_hip path.

Environment

  • GPU: 8x AMD MI355X (gfx950)
  • ROCm: 7.2.1
  • vLLM: main branch (latest)
  • Model: GLM-5-MXFP4 (zai-org/GLM-5-MXFP4 or equivalent Quark checkpoint)
  • Config: num_attention_heads=64, index_n_heads=32, kv_lora_rank=512, n_routed_experts=256

Reproduction

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_MLA=1

vllm serve /path/to/GLM-5-MXFP4 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --gpu-memory-utilization 0.90 \
    --enforce-eager

Then send any chat completion request - prefill succeeds but decode crashes.

Error Traces

ZeroDivisionError (TP=8, indexer path)

File "aiter/ops/triton/attention/pa_mqa_logits.py", line 198, in deepgemm_fp8_paged_mqa_logits_stage1
    SplitKV = (max(1, TotalCuCount // TileQCount) + 4) // 5 * 5 * WavePerEU
ZeroDivisionError: integer division or modulo by zero

Memory access fault (TP=4 or TP=8, during decode)

Memory access fault by GPU node-X (Agent handle: ...) on address ...

Related

  • PR #36855 - Fix for head repeat in sparse MLA (addresses the gqa:8 crash)
  • https://github.com/ROCm/aiter/issues/726 - Upstream feature request for native nhead=8 support
  • https://github.com/ROCm/aiter/issues/2563 - Upstream issue for ZeroDivisionError in paged MQA logits
  • Issue #34553 - GLM-5 FP8 on H200 OOM in sparse_attn_indexer (CUDA, different issue)
  • Issue #36220 - GLM-5 quantized serving failure (AWQ/INT4, different quantization)

Proposed Fix

Two complementary approaches (both needed):

  1. MLA decode kernel: Head repeat padding from 8->16 (temporary workaround in PR #36855) until AITER supports nhead < 16 natively
  2. Indexer MQA logits: Fall back to PyTorch reference implementation when heads < 16 (implemented in fix/rocm-glm5-mxfp4-optimizations branch)

extent analysis

TL;DR

To fix the crash during the decode phase of GLM-5 with MXFP4 quantization on ROCm MI355x, apply a temporary workaround by padding the MLA decode kernel from 8 to 16 heads and fall back to the PyTorch reference implementation for indexer MQA logits when heads are less than 16.

Guidance

  1. Apply head repeat padding: Implement the workaround from PR #36855 to pad the MLA decode kernel from 8 to 16 heads, ensuring compatibility until native support for fewer heads is available in AITER.
  2. Fallback to PyTorch reference implementation: For indexer MQA logits, use the fallback to the PyTorch reference implementation when the number of heads is less than 16, as implemented in the fix/rocm-glm5-mxfp4-optimizations branch.
  3. Verify compatibility and performance: After applying these fixes, verify that the model loads, prefills, and decodes successfully without crashes, and monitor performance to ensure there are no significant regressions.

Example

No specific code snippet is provided due to the complexity and specificity of the issue, but the fixes mentioned are intended to be applied as described in the referenced PR and branch.

Notes

These fixes are temporary workarounds and fixes. Native support for fewer heads in AITER and further optimizations may be necessary for long-term stability and performance.

Recommendation

Apply the workaround, as it directly addresses the identified issues and provides a path forward until more permanent fixes or native support is available in AITER.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING