vllm - ✅(Solved) Fix [Bug]: ROCM_ATTN produces incorrect output for LiquidAI LFM2 [1 pull requests, 1 comments, 2 participants]

vllm2026-05-01 18:01:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41472•Fetched 2026-05-02 05:27:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tianshu-Michael-yu

Participants

github-actions[bot]

tianshu-Michael-yu

Timeline (top)

mentioned ×3subscribed ×3added_to_project_v2 ×1commented ×1

ROCM_ATTN appears to produce incorrect / gibberish continuations for LiquidAI/LFM2-8B-A1B on ROCm. The same prompts produce sane outputs with Hugging Face Transformers in the same environment, while vLLM with the ROCm attention path produces corrupted tails after initially-correct tokens.

This issue is intended to track the root cause discussed in https://github.com/vllm-project/vllm/pull/41054. That PR works around the problem by avoiding the default ROCm attention backend for this model family, but the underlying ROCM_ATTN correctness issue should be tracked separately.

Root Cause

PR fix notes

PR #41054: Fix LFM2 decoding on ROCm

Repository: vllm-project/vllm
Author: tianshu-Michael-yu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41054

Description (problem / solution / changelog)

Summary

Force TRITON_ATTN for both Lfm2ForCausalLM and Lfm2MoeForCausalLM on ROCm when the user has not explicitly selected an attention backend.
Add focused config tests for ROCm auto-selection, explicit-backend preservation, and non-ROCm behavior.

Why

On AMD GPUs, the default ROCm attention backend for LFM2-family models can produce corrupted decode output after an initially plausible first token. Forcing TRITON_ATTN avoids the decode-time corruption and matches Hugging Face greedy generation on the validation prompts.

Baseline latest public ROCm wheel (vllm==0.20.0+rocm721) with LiquidAI/LFM2-8B-A1B generated:

4..
Paris, in the context of Manoto Di Tella Declaration for our discussion:
invalid JSON/gibberish continuation

Baseline latest public ROCm wheel with LiquidAI/LFM2.5-350M selected ROCM_ATTN and also diverged from Hugging Face:

HF: 4; vLLM: 4 consideration consideration paris ing ...
HF: Paris; vLLM: Paris's, amicate_query gain thelm
HF JSON prefix diverged to ```jsonx,

Forcing TRITON_ATTN produced exact Hugging Face matches for both tested LFM2 models.

Validation

python -m py_compile vllm/model_executor/models/config.py tests/models/test_config.py
git diff --check
python -m pytest tests/models/test_config.py -q (3 passed)
Slurm on AMD GPU compute nodes, torch==2.10.0+git8514f05, HIP 7.2.53211, transformers==5.6.2, vllm==0.20.0+rocm721:
- LiquidAI/LFM2-8B-A1B HF reference: compare-lfm2-hf-rerun.json
- LiquidAI/LFM2-8B-A1B public vLLM baseline: compare-lfm2-vllm-550279.out
- LiquidAI/LFM2-8B-A1B patched vLLM: compare-lfm2-vllm-auto-fix-550430.out
- LiquidAI/LFM2.5-350M public vLLM baseline: compare-lfm25-350m-557120.out / compare-lfm25-350m.json
- LiquidAI/LFM2.5-350M forced TRITON_ATTN: compare-lfm25-350m-triton-lowmem-557156.out / compare-lfm25-350m-triton-lowmem.json

Changed files

tests/models/test_config.py (added, +43/-0)
vllm/model_executor/models/config.py (modified, +17/-0)

Code Example

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model = "LiquidAI/LFM2-8B-A1B"
messages = [
    [{"role": "user", "content": "Answer with a single number only. What is 2 + 2?"}],
    [{"role": "user", "content": "What is the capital of France? Answer in a few words."}],
    [{"role": "user", "content": "Write a valid JSON object with key color and value blue."}],
]

tokenizer = AutoTokenizer.from_pretrained(model)
prompts = [
    tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
    for m in messages
]

llm = LLM(
    model=model,
    tensor_parallel_size=1,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    max_model_len=2048,
    enforce_eager=True,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=16))
print([o.outputs[0].text for o in outputs])

---

Using ROCM_ATTN backend out of potential backends: ['ROCM_ATTN', 'ROCM_AITER_UNIFIED_ATTN', 'TRITON_ATTN'].

---

[
  "4,\u201d",
  "Paris, in the context of Manoto Di Tella Declaration for our discussion:",
  "

---

I also tried the AITER path, which still shows corrupted continuations:

---

With vLLM `0.19.0+rocm721` and `transformers==5.5.0` forced into the env, the model loads and generates but the outputs are still degraded:

---

## Expected output / control

With Hugging Face Transformers on the same model and prompts:

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hardware: AMD Instinct MI325X (gfx942)
ROCm/HIP: 7.2.53211
vLLM wheel: 0.20.0+rocm721
Torch: 2.10.0+git8514f05
Transformers: 5.6.2
Model: LiquidAI/LFM2-8B-A1B
dtype: bfloat16
max_model_len=2048
enforce_eager=True

I reproduced this with the ROCm wheels. I have not yet verified the release Docker image.

Reproduction

Minimal shape of the vLLM run:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model = "LiquidAI/LFM2-8B-A1B"
messages = [
    [{"role": "user", "content": "Answer with a single number only. What is 2 + 2?"}],
    [{"role": "user", "content": "What is the capital of France? Answer in a few words."}],
    [{"role": "user", "content": "Write a valid JSON object with key color and value blue."}],
]

tokenizer = AutoTokenizer.from_pretrained(model)
prompts = [
    tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
    for m in messages
]

llm = LLM(
    model=model,
    tensor_parallel_size=1,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    max_model_len=2048,
    enforce_eager=True,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=16))
print([o.outputs[0].text for o in outputs])

The log confirms vLLM selects ROCM_ATTN:

Using ROCM_ATTN backend out of potential backends: ['ROCM_ATTN', 'ROCM_AITER_UNIFIED_ATTN', 'TRITON_ATTN'].

Observed output

With vLLM 0.20.0+rocm721:

[
  "4,\u201d",
  "Paris, in the context of Manoto Di Tella Declaration for our discussion:",
  "```json\n\"Description: Short forages for life cycle knowledge, such as"
]

I also tried the AITER path, which still shows corrupted continuations:

[
  "4.",
  "Paris,, when you mean to say it, one steps, and the field",
  "```json\n\"Restrictive\" niche, in principle, a similar concern"
]

With vLLM 0.19.0+rocm721 and transformers==5.5.0 forced into the env, the model loads and generates but the outputs are still degraded:

[
  "4 c.",
  "Paris, which is"
]

Expected output / control

With Hugging Face Transformers on the same model and prompts:

[
  "4",
  "Paris",
  "```json\n{\n  \"color\": \"blue\"\n}\n```"
]

Notes

This seems model-family specific. In the same environment, ROCM_ATTN behaves normally for a Qwen model.
My current hypothesis is that the ROCm attention path mishandles the hybrid cache / LFM cache layout, causing the first token(s) to be plausible and later tokens to degrade.
The behavior is visible even with enforce_eager=True, so this is separate from CUDA graph capture issues.
Related PR: https://github.com/vllm-project/vllm/pull/41054

extent analysis

TL;DR

The issue can be worked around by avoiding the default ROCm attention backend for the LiquidAI/LFM2-8B-A1B model family.

Guidance

Verify that the issue is specific to the LiquidAI/LFM2-8B-A1B model family by testing other models, such as Qwen, with the same environment and ROCm attention path.
Try using a different attention backend, such as ROCM_AITER_UNIFIED_ATTN or TRITON_ATTN, to see if the issue persists.
Investigate the hybrid cache / LFM cache layout to determine if it is causing the degradation of tokens.
Test the model with different versions of vLLM and Transformers to see if the issue is version-specific.

Example

No code snippet is provided as the issue is more related to the environment and model configuration.

Notes

The issue seems to be model-family specific and is not related to CUDA graph capture issues. The behavior is visible even with enforce_eager=True.

Recommendation

Apply workaround by avoiding the default ROCm attention backend for the LiquidAI/LFM2-8B-A1B model family, as done in the related PR https://github.com/vllm-project/vllm/pull/41054. This is because the issue is likely caused by the ROCm attention path mishandling the hybrid cache / LFM cache layout.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: ROCM_ATTN produces incorrect output for LiquidAI LFM2 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #41054: Fix LFM2 decoding on ROCm

Description (problem / solution / changelog)

Summary

Why

Validation

Changed files

Code Example

Summary

Environment

Reproduction

Observed output

Expected output / control

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: ROCM_ATTN produces incorrect output for LiquidAI LFM2 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #41054: Fix LFM2 decoding on ROCm

Description (problem / solution / changelog)

Summary

Why

Validation

Changed files

Code Example

Summary

Environment

Reproduction

Observed output

Expected output / control

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING