vllm - ✅(Solved) Fix [Bug]: Mistral Small 4 (119B MoE) fails to start on ROCm MI325X - two blocking issues [1 pull requests, 3 comments, 3 participants]

maincodeMax · 2026-04-04T07:31:17Z

[vllm] Mistral Small 4 mistralai/Mistral-Small-4-119B-2603 https://huggingface.co/mistralai/Mistral-Small-4-119B-2603 can't load on AMD MI325X GPUs with the cu… Mistral Small 4 [(mistralai/Mistral-Small-4-119B-2603)](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) can't load on AMD MI325X GPUs with the current vLLM ROCm image. Two separate issues are currently preventing startup. I have listed them below with greater detail. Keen to try and work on the solution, unless I am doing something incorrectly? Please let me know if that is the case! # PR #39830: [ROCm][MLA] validate AITER head counts during selection - Repository: vllm-project/vllm - Author: Bortlesboat - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/39830 ## Description (problem / solution / changelog) ## Summary - reject unsupported dense ROCm AITER MLA head counts during backend selection instead of waiting for backend construction to assert - keep supported counts like `32` on the AITER path; the remaining bad case on current `main` is unsupported dense counts such as `24` - add a selector-focused regression that exercises fallback and explicit backend validation without depending on the full ROCm kernel stack ## Why this is not duplicating an existing PR - I re-ran the duplicate checks for `#38972` and related ROCm MLA keywords before opening this. - The closest open PR is `#36855`, but that one fixes the sparse MLA `num_heads < 16` path by repeating heads at runtime. - This PR is about dense ROCm AITER MLA backend selection rejecting unsupported head counts up front so selection can fall back cleanly. ## Testing - `uv run --no-project --with pytest --with torch --with numpy --with packaging --with pyyaml --with regex --with pydantic --with typing_extensions --with filelock --with cachetools --with blake3 --with msgspec --with msgpack --with cloudpickle --with psutil --with requests --with tqdm --with cbor2 --with pyzmq --with huggingface_hub --with transformers python -m pytest tests/v1/attention/test_rocm_aiter_mla_head_selection.py -q --noconftest` - `uv run --no-project python -m py_compile vllm/v1/attention/backend.py vllm/platforms/rocm.py vllm/platforms/cuda.py vllm/v1/attention/backends/mla/rocm_aiter_mla.py tests/v1/attention/test_rocm_aiter_mla_head_selection.py` - `git diff --check` ## AI assistance - I used AI assistance for drafting and local implementation, and I reviewed the final diff and test results before opening this PR. Refs #38972. ## Changed files - `tests/v1/attention/test_rocm_aiter_mla_head_selection.py` (added, +227/-0) - `vllm/platforms/cuda.py` (modified, +2/-0) - `vllm/platforms/rocm.py` (modified, +2/-0) - `vllm/v1/attention/backend.py` (modified, +8/-0) - `vllm/v1/attention/backends/mla/rocm_aiter_mla.py` (modified, +16/-0) ## Fix / Workaround ### Issue 1: AITER MLA attention head assertion (fixed with TP=2) With TP=1 (single GPU, 32 heads), the ROCm aiter MLA backend fails: ```bash File "vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 214, in __init__ assert num_heads == 16 or num_heads == 128 AssertionError: Aiter MLA only supports 16 or 128 number of heads. Provided 32 number of heads. ``` Workaround: Setting `--tensor-parallel-size 2` splits 32 heads into 16 per GPU, passing the assertion. However, I assume this shouldn't be required - the backend should either support 32 heads natively or fall back to a compatible backend. ### Issue 2: MoE JIT kernel compilation timeout (no workaround) With TP=2, the model gets past the attention head check but fails during ROCm JIT kernel compilation for the 128-expert MoE layer: ```bash [aiter] waiting for baton release at .../lock_module_moe_ck2stages_f8_f8_preshuffle_off_b16_silu_per_tensor_mulWeightStage2_ ``` The compilation takes longer than `VLLM_ENGINE_READY_TIMEOUT_S` (default 600s). The engine core process times out: ```bash RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} ``` Reproduce: ```python from vllm import LLM, SamplingParams ### Your current environment - GPU: 8x AMD Instinct MI325X (256GB HBM3e each) - CPU: AMD EPYC 9575F - vLLM image: vllm/vllm-openai-rocm@sha256:cdea9cf61b3415bfdb4214ff56253f68645b0820d713a1745a21f0addeef4bd9 - mistral-common: 1.11.0 - ROCm: bundled in image ### 🐛 Describe the bug ## Summary Mistral Small 4 [(mistralai/Mistral-Small-4-119B-2603)](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) can't load on AMD MI325X GPUs with the current vLLM ROCm image. Two separate issues are currently preventing startup. I have listed them below with greater detail. Keen to try and work on the solution, unless I am doing something incorrectly? Please let me know if that is the case! ### Environment - GPU: 8x AMD Instinct MI325X (256GB VRAM each) - vLLM image: vllm/vllm-openai-rocm@sha256:cdea9cf61b3415bfdb4214ff56253f68645b0820d713a1745a21f0addeef4bd9 - ROCm: (bundled in image) - Model: mistralai/Mistral-Small-4-119B-2603 (119B MoE, 128

vllm2026-04-04 07:31:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38972•Fetched 2026-04-08 02:44:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×4subscribed ×4commented ×3labeled ×2

Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603) can't load on AMD MI325X GPUs with the current vLLM ROCm image. Two separate issues are currently preventing startup. I have listed them below with greater detail. Keen to try and work on the solution, unless I am doing something incorrectly? Please let me know if that is the case!

Error Message

File "vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 214, in init assert num_heads == 16 or num_heads == 128 AssertionError: Aiter MLA only supports 16 or 128 number of heads. Provided 32 number of heads.

Root Cause

Issue 2: MoE JIT kernel compilation timeout (no workaround)

With TP=2, the model gets past the attention head check but fails during ROCm JIT kernel compilation for the 128-expert MoE layer:

[aiter] waiting for baton release at .../lock_module_moe_ck2stages_f8_f8_preshuffle_off_b16_silu_per_tensor_mulWeightStage2_

The compilation takes longer than VLLM_ENGINE_READY_TIMEOUT_S (default 600s). The engine core process times out:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Reproduce:

from vllm import LLM, SamplingParams

Fix Action

Fix / Workaround

Issue 1: AITER MLA attention head assertion (fixed with TP=2)

With TP=1 (single GPU, 32 heads), the ROCm aiter MLA backend fails:

File "vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 214, in __init__
    assert num_heads == 16 or num_heads == 128
AssertionError: Aiter MLA only supports 16 or 128 number of heads.
Provided 32 number of heads.

Workaround: Setting --tensor-parallel-size 2 splits 32 heads into 16 per GPU, passing the assertion. However, I assume this shouldn't be required - the backend should either support 32 heads natively or fall back to a compatible backend.

Issue 2: MoE JIT kernel compilation timeout (no workaround)

With TP=2, the model gets past the attention head check but fails during ROCm JIT kernel compilation for the 128-expert MoE layer:

[aiter] waiting for baton release at .../lock_module_moe_ck2stages_f8_f8_preshuffle_off_b16_silu_per_tensor_mulWeightStage2_

The compilation takes longer than VLLM_ENGINE_READY_TIMEOUT_S (default 600s). The engine core process times out:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Reproduce:

from vllm import LLM, SamplingParams

PR fix notes

PR #39830: [ROCm][MLA] validate AITER head counts during selection

Repository: vllm-project/vllm
Author: Bortlesboat
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39830

Description (problem / solution / changelog)

Summary

reject unsupported dense ROCm AITER MLA head counts during backend selection instead of waiting for backend construction to assert
keep supported counts like 32 on the AITER path; the remaining bad case on current main is unsupported dense counts such as 24
add a selector-focused regression that exercises fallback and explicit backend validation without depending on the full ROCm kernel stack

Why this is not duplicating an existing PR

I re-ran the duplicate checks for #38972 and related ROCm MLA keywords before opening this.
The closest open PR is #36855, but that one fixes the sparse MLA num_heads < 16 path by repeating heads at runtime.
This PR is about dense ROCm AITER MLA backend selection rejecting unsupported head counts up front so selection can fall back cleanly.

Testing

uv run --no-project --with pytest --with torch --with numpy --with packaging --with pyyaml --with regex --with pydantic --with typing_extensions --with filelock --with cachetools --with blake3 --with msgspec --with msgpack --with cloudpickle --with psutil --with requests --with tqdm --with cbor2 --with pyzmq --with huggingface_hub --with transformers python -m pytest tests/v1/attention/test_rocm_aiter_mla_head_selection.py -q --noconftest
uv run --no-project python -m py_compile vllm/v1/attention/backend.py vllm/platforms/rocm.py vllm/platforms/cuda.py vllm/v1/attention/backends/mla/rocm_aiter_mla.py tests/v1/attention/test_rocm_aiter_mla_head_selection.py
git diff --check

AI assistance

I used AI assistance for drafting and local implementation, and I reviewed the final diff and test results before opening this PR.

Refs #38972.

Changed files

tests/v1/attention/test_rocm_aiter_mla_head_selection.py (added, +227/-0)
vllm/platforms/cuda.py (modified, +2/-0)
vllm/platforms/rocm.py (modified, +2/-0)
vllm/v1/attention/backend.py (modified, +8/-0)
vllm/v1/attention/backends/mla/rocm_aiter_mla.py (modified, +16/-0)

Code Example

python3 -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Small-4-119B-2603 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000 \
  --trust-remote-code \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --swap-space 8 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral

---

File "vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 214, in __init__
    assert num_heads == 16 or num_heads == 128
AssertionError: Aiter MLA only supports 16 or 128 number of heads.
Provided 32 number of heads.

---

[aiter] waiting for baton release at .../lock_module_moe_ck2stages_f8_f8_preshuffle_off_b16_silu_per_tensor_mulWeightStage2_

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="mistralai/Mistral-Small-4-119B-2603",
    tensor_parallel_size=2,  # required to work around Issue 1
    max_model_len=32768,
    trust_remote_code=True,
    enable_auto_tool_choice=True,
    tool_call_parser="mistral",
)

# Never reaches here - crashes during engine init^
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8))

RAW_BUFFERClick to expand / collapse

Your current environment

GPU: 8x AMD Instinct MI325X (256GB HBM3e each)
CPU: AMD EPYC 9575F
vLLM image: vllm/vllm-openai-rocm@sha256:cdea9cf61b3415bfdb4214ff56253f68645b0820d713a1745a21f0addeef4bd9
mistral-common: 1.11.0
ROCm: bundled in image

🐛 Describe the bug

Summary

Environment

GPU: 8x AMD Instinct MI325X (256GB VRAM each)
vLLM image: vllm/vllm-openai-rocm@sha256:cdea9cf61b3415bfdb4214ff56253f68645b0820d713a1745a21f0addeef4bd9
ROCm: (bundled in image)
Model: mistralai/Mistral-Small-4-119B-2603 (119B MoE, 128 experts, 4 active, 32 attention heads)
mistral-common: upgraded to 1.11.0 (required for v15 tokenizer)
Tensor parallelism: TP=2 across 2x MI325X

For context, I used:

python3 -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Small-4-119B-2603 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000 \
  --trust-remote-code \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --swap-space 8 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral

Issues

TLDR:

Hard assertion blocks 32 attention heads: rocm_aiter_mla.py:214 only allows 16 or 128 heads. Mistral Small 4 has 32. TP=2 works around it but shouldn't be required.
MoE kernel compilation exceeds startup timeout: The 128-expert FP8 MoE kernels take 15-20+ minutes to JIT compile on MI325X. vLLM's engine startup times out before compilation finishes, even with VLLM_ENGINE_READY_TIMEOUT_S=1800.

Issue 1: AITER MLA attention head assertion (fixed with TP=2)

With TP=1 (single GPU, 32 heads), the ROCm aiter MLA backend fails:

File "vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 214, in __init__
    assert num_heads == 16 or num_heads == 128
AssertionError: Aiter MLA only supports 16 or 128 number of heads.
Provided 32 number of heads.

Suggested fix: Expand the assertion to include all valid head counts (16, 32, 64, 128) or remove the hard assertion and let the kernel handle it. Excuse my naivety here on whether there is an existing fix.

Issue 2: MoE JIT kernel compilation timeout (no workaround)

With TP=2, the model gets past the attention head check but fails during ROCm JIT kernel compilation for the 128-expert MoE layer:

[aiter] waiting for baton release at .../lock_module_moe_ck2stages_f8_f8_preshuffle_off_b16_silu_per_tensor_mulWeightStage2_

The compilation takes longer than VLLM_ENGINE_READY_TIMEOUT_S (default 600s). The engine core process times out:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Reproduce:

from vllm import LLM, SamplingParams

llm = LLM(
    model="mistralai/Mistral-Small-4-119B-2603",
    tensor_parallel_size=2,  # required to work around Issue 1
    max_model_len=32768,
    trust_remote_code=True,
    enable_auto_tool_choice=True,
    tool_call_parser="mistral",
)

# Never reaches here - crashes during engine init^
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8))

Setting VLLM_ENGINE_READY_TIMEOUT_S=1800 did not appear to take effect (the container still crashed at the same point). The MoE kernel compilation for 128 experts with FP8 weights on MI325X appears to take 15-20+ minutes, exceeding the engine startup timeout.

Suggested fix: not sure, ensure VLLM_ENGINE_READY_TIMEOUT_S actually propagates to engine core subprocesses? I am limited with ideas here however.

Related Issues

#37553 - Mistral Small 4 fails on 8x RTX 3090 (SM 8.6) with MLA backend PR #37081 - Mistral tool calling/reasoning parser for post-v15 tokenizers

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix involves modifying the rocm_aiter_mla.py file to support 32 attention heads and increasing the engine startup timeout to accommodate the MoE kernel compilation time.

Guidance

Modify the attention head assertion: Update the rocm_aiter_mla.py file to include 32 in the list of supported attention heads, by changing the line assert num_heads == 16 or num_heads == 128 to assert num_heads in [16, 32, 64, 128].
Increase the engine startup timeout: Ensure that the VLLM_ENGINE_READY_TIMEOUT_S environment variable is properly propagated to the engine core subprocesses, and increase its value to a sufficiently high number (e.g., 3600) to allow for the MoE kernel compilation to complete.
Verify the fixes: After applying the above changes, re-run the model with the --tensor-parallel-size 1 option to test if the attention head assertion issue is resolved, and then with --tensor-parallel-size 2 to verify if the MoE kernel compilation timeout issue is addressed.

Example

# Example code to set the environment variable
import os
os.environ['VLLM_ENGINE_READY_TIMEOUT_S'] = '3600'

Notes

The provided solutions assume that the issues are solely related to the attention head assertion and the MoE kernel compilation timeout. However, other factors might be contributing to the problems, and additional debugging may be necessary.

Recommendation

Apply the workaround by modifying the rocm_aiter_mla.py file and increasing the engine startup timeout, as these changes are likely to resolve the issues described.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Mistral Small 4 (119B MoE) fails to start on ROCm MI325X - two blocking issues [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Issue 2: MoE JIT kernel compilation timeout (no workaround)

Fix Action

Fix / Workaround

Issue 1: AITER MLA attention head assertion (fixed with TP=2)

Issue 2: MoE JIT kernel compilation timeout (no workaround)

PR fix notes

PR #39830: [ROCm][MLA] validate AITER head counts during selection

Description (problem / solution / changelog)

Summary

Why this is not duplicating an existing PR

Testing

AI assistance

Changed files

Code Example

Your current environment

🐛 Describe the bug

Summary

Environment

Issues

Issue 1: AITER MLA attention head assertion (fixed with TP=2)

Issue 2: MoE JIT kernel compilation timeout (no workaround)

Related Issues

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING