vllm - ✅(Solved) Fix [Bug]: ROCM_ATTN produces incorrect output for LiquidAI LFM2 [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41472Fetched 2026-05-02 05:27:58
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3added_to_project_v2 ×1commented ×1

ROCM_ATTN appears to produce incorrect / gibberish continuations for LiquidAI/LFM2-8B-A1B on ROCm. The same prompts produce sane outputs with Hugging Face Transformers in the same environment, while vLLM with the ROCm attention path produces corrupted tails after initially-correct tokens.

This issue is intended to track the root cause discussed in https://github.com/vllm-project/vllm/pull/41054. That PR works around the problem by avoiding the default ROCm attention backend for this model family, but the underlying ROCM_ATTN correctness issue should be tracked separately.

Root Cause

This issue is intended to track the root cause discussed in https://github.com/vllm-project/vllm/pull/41054. That PR works around the problem by avoiding the default ROCm attention backend for this model family, but the underlying ROCM_ATTN correctness issue should be tracked separately.

PR fix notes

PR #41054: Fix LFM2 decoding on ROCm

Description (problem / solution / changelog)

Summary

  • Force TRITON_ATTN for both Lfm2ForCausalLM and Lfm2MoeForCausalLM on ROCm when the user has not explicitly selected an attention backend.
  • Add focused config tests for ROCm auto-selection, explicit-backend preservation, and non-ROCm behavior.

Why

On AMD GPUs, the default ROCm attention backend for LFM2-family models can produce corrupted decode output after an initially plausible first token. Forcing TRITON_ATTN avoids the decode-time corruption and matches Hugging Face greedy generation on the validation prompts.

Baseline latest public ROCm wheel (vllm==0.20.0+rocm721) with LiquidAI/LFM2-8B-A1B generated:

  • 4..
  • Paris, in the context of Manoto Di Tella Declaration for our discussion:
  • invalid JSON/gibberish continuation

Baseline latest public ROCm wheel with LiquidAI/LFM2.5-350M selected ROCM_ATTN and also diverged from Hugging Face:

  • HF: 4; vLLM: 4 consideration consideration paris ing ...
  • HF: Paris; vLLM: Paris's, amicate_query gain thelm
  • HF JSON prefix diverged to ```jsonx,

Forcing TRITON_ATTN produced exact Hugging Face matches for both tested LFM2 models.

Validation

  • python -m py_compile vllm/model_executor/models/config.py tests/models/test_config.py
  • git diff --check
  • python -m pytest tests/models/test_config.py -q (3 passed)
  • Slurm on AMD GPU compute nodes, torch==2.10.0+git8514f05, HIP 7.2.53211, transformers==5.6.2, vllm==0.20.0+rocm721:
    • LiquidAI/LFM2-8B-A1B HF reference: compare-lfm2-hf-rerun.json
    • LiquidAI/LFM2-8B-A1B public vLLM baseline: compare-lfm2-vllm-550279.out
    • LiquidAI/LFM2-8B-A1B patched vLLM: compare-lfm2-vllm-auto-fix-550430.out
    • LiquidAI/LFM2.5-350M public vLLM baseline: compare-lfm25-350m-557120.out / compare-lfm25-350m.json
    • LiquidAI/LFM2.5-350M forced TRITON_ATTN: compare-lfm25-350m-triton-lowmem-557156.out / compare-lfm25-350m-triton-lowmem.json

Changed files

  • tests/models/test_config.py (added, +43/-0)
  • vllm/model_executor/models/config.py (modified, +17/-0)

Code Example

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model = "LiquidAI/LFM2-8B-A1B"
messages = [
    [{"role": "user", "content": "Answer with a single number only. What is 2 + 2?"}],
    [{"role": "user", "content": "What is the capital of France? Answer in a few words."}],
    [{"role": "user", "content": "Write a valid JSON object with key color and value blue."}],
]

tokenizer = AutoTokenizer.from_pretrained(model)
prompts = [
    tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
    for m in messages
]

llm = LLM(
    model=model,
    tensor_parallel_size=1,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    max_model_len=2048,
    enforce_eager=True,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=16))
print([o.outputs[0].text for o in outputs])

---

Using ROCM_ATTN backend out of potential backends: ['ROCM_ATTN', 'ROCM_AITER_UNIFIED_ATTN', 'TRITON_ATTN'].

---

[
  "4,\u201d",
  "Paris, in the context of Manoto Di Tella Declaration for our discussion:",
  "

---

I also tried the AITER path, which still shows corrupted continuations:

---

With vLLM `0.19.0+rocm721` and `transformers==5.5.0` forced into the env, the model loads and generates but the outputs are still degraded:

---

## Expected output / control

With Hugging Face Transformers on the same model and prompts:
RAW_BUFFERClick to expand / collapse

Summary

ROCM_ATTN appears to produce incorrect / gibberish continuations for LiquidAI/LFM2-8B-A1B on ROCm. The same prompts produce sane outputs with Hugging Face Transformers in the same environment, while vLLM with the ROCm attention path produces corrupted tails after initially-correct tokens.

This issue is intended to track the root cause discussed in https://github.com/vllm-project/vllm/pull/41054. That PR works around the problem by avoiding the default ROCm attention backend for this model family, but the underlying ROCM_ATTN correctness issue should be tracked separately.

Environment

  • Hardware: AMD Instinct MI325X (gfx942)
  • ROCm/HIP: 7.2.53211
  • vLLM wheel: 0.20.0+rocm721
  • Torch: 2.10.0+git8514f05
  • Transformers: 5.6.2
  • Model: LiquidAI/LFM2-8B-A1B
  • dtype: bfloat16
  • max_model_len=2048
  • enforce_eager=True

I reproduced this with the ROCm wheels. I have not yet verified the release Docker image.

Reproduction

Minimal shape of the vLLM run:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model = "LiquidAI/LFM2-8B-A1B"
messages = [
    [{"role": "user", "content": "Answer with a single number only. What is 2 + 2?"}],
    [{"role": "user", "content": "What is the capital of France? Answer in a few words."}],
    [{"role": "user", "content": "Write a valid JSON object with key color and value blue."}],
]

tokenizer = AutoTokenizer.from_pretrained(model)
prompts = [
    tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
    for m in messages
]

llm = LLM(
    model=model,
    tensor_parallel_size=1,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    max_model_len=2048,
    enforce_eager=True,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=16))
print([o.outputs[0].text for o in outputs])

The log confirms vLLM selects ROCM_ATTN:

Using ROCM_ATTN backend out of potential backends: ['ROCM_ATTN', 'ROCM_AITER_UNIFIED_ATTN', 'TRITON_ATTN'].

Observed output

With vLLM 0.20.0+rocm721:

[
  "4,\u201d",
  "Paris, in the context of Manoto Di Tella Declaration for our discussion:",
  "```json\n\"Description: Short forages for life cycle knowledge, such as"
]

I also tried the AITER path, which still shows corrupted continuations:

[
  "4.",
  "Paris,, when you mean to say it, one steps, and the field",
  "```json\n\"Restrictive\" niche, in principle, a similar concern"
]

With vLLM 0.19.0+rocm721 and transformers==5.5.0 forced into the env, the model loads and generates but the outputs are still degraded:

[
  "4 c.",
  "Paris, which is"
]

Expected output / control

With Hugging Face Transformers on the same model and prompts:

[
  "4",
  "Paris",
  "```json\n{\n  \"color\": \"blue\"\n}\n```"
]

Notes

  • This seems model-family specific. In the same environment, ROCM_ATTN behaves normally for a Qwen model.
  • My current hypothesis is that the ROCm attention path mishandles the hybrid cache / LFM cache layout, causing the first token(s) to be plausible and later tokens to degrade.
  • The behavior is visible even with enforce_eager=True, so this is separate from CUDA graph capture issues.
  • Related PR: https://github.com/vllm-project/vllm/pull/41054

extent analysis

TL;DR

The issue can be worked around by avoiding the default ROCm attention backend for the LiquidAI/LFM2-8B-A1B model family.

Guidance

  • Verify that the issue is specific to the LiquidAI/LFM2-8B-A1B model family by testing other models, such as Qwen, with the same environment and ROCm attention path.
  • Try using a different attention backend, such as ROCM_AITER_UNIFIED_ATTN or TRITON_ATTN, to see if the issue persists.
  • Investigate the hybrid cache / LFM cache layout to determine if it is causing the degradation of tokens.
  • Test the model with different versions of vLLM and Transformers to see if the issue is version-specific.

Example

No code snippet is provided as the issue is more related to the environment and model configuration.

Notes

The issue seems to be model-family specific and is not related to CUDA graph capture issues. The behavior is visible even with enforce_eager=True.

Recommendation

Apply workaround by avoiding the default ROCm attention backend for the LiquidAI/LFM2-8B-A1B model family, as done in the related PR https://github.com/vllm-project/vllm/pull/41054. This is because the issue is likely caused by the ROCm attention path mishandling the hybrid cache / LFM cache layout.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: ROCM_ATTN produces incorrect output for LiquidAI LFM2 [1 pull requests, 1 comments, 2 participants]