vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38643Fetched 2026-04-08 01:58:48
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
subscribed ×5commented ×2

When running Qwen3_5ForConditionalGeneration (Qwen3.5-4B-NVFP4) with vLLM nightly, the model produces completely garbled/gibberish output — a mix of random tokens from multiple languages.

Root Cause

The following warning is emitted during every inference call, indicating a tensor format mismatch in the FLA (Flash Linear Attention) ops layer:

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

Also from torch/_dynamo/eval_frame.py:

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

Qwen3.5 uses a hybrid Transformer + linear attention (Mamba-style) architecture. The linear_attn.conv1d layers are present in multiple decoder layers. The FLA ops kernel appears to receive tensors in [B, H, T, ...] (head-first) format when head_first=False is expected, causing completely incorrect computation and gibberish output.

Code Example

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

---

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

---

# docker-compose.yml
services:
  qwen3-5-4b:
    image: vllm/vllm-openai:nightly
    command: [
      "/models/Qwen3.5-4B-NVFP4",
      "--served-model-name", "Qwen3.5-4B-NVFP4",
      "--gpu-memory-utilization", "0.85",
      "--host", "0.0.0.0",
      "--port", "8000",
      "--dtype", "auto",
      "--max-model-len", "32768",
      "--trust-remote-code",
      "--attention-backend", "flashinfer"
    ]

---

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-4B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, introduce yourself"}],
    "temperature": 0.6,
    "max_tokens": 100
  }'

---

一根ulaHTTPRequestdataIK_and胜利 Ký尔ula树皮 auxili紹ns/welcome生生 televis多喝在水中 ric尔ael注射گز...
RAW_BUFFERClick to expand / collapse

Description

When running Qwen3_5ForConditionalGeneration (Qwen3.5-4B-NVFP4) with vLLM nightly, the model produces completely garbled/gibberish output — a mix of random tokens from multiple languages.

Root Cause

The following warning is emitted during every inference call, indicating a tensor format mismatch in the FLA (Flash Linear Attention) ops layer:

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

Also from torch/_dynamo/eval_frame.py:

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

Qwen3.5 uses a hybrid Transformer + linear attention (Mamba-style) architecture. The linear_attn.conv1d layers are present in multiple decoder layers. The FLA ops kernel appears to receive tensors in [B, H, T, ...] (head-first) format when head_first=False is expected, causing completely incorrect computation and gibberish output.

Reproduction

# docker-compose.yml
services:
  qwen3-5-4b:
    image: vllm/vllm-openai:nightly
    command: [
      "/models/Qwen3.5-4B-NVFP4",
      "--served-model-name", "Qwen3.5-4B-NVFP4",
      "--gpu-memory-utilization", "0.85",
      "--host", "0.0.0.0",
      "--port", "8000",
      "--dtype", "auto",
      "--max-model-len", "32768",
      "--trust-remote-code",
      "--attention-backend", "flashinfer"
    ]
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-4B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, introduce yourself"}],
    "temperature": 0.6,
    "max_tokens": 100
  }'

Expected output: Normal coherent text response

Actual output: Gibberish mix of random multilingual tokens, e.g.:

一根ulaHTTPRequestdataIK_and胜利 Ký尔ula树皮 auxili紹ns/welcome生生 televis多喝在水中 ric尔ael注射گز...

Environment

  • vLLM version: v0.17.0rc1.dev164+gfff3711a2
  • Model: Qwen/Qwen3.5-4B (NVFP4 quantized checkpoint, Qwen3_5ForConditionalGeneration)
  • GPU: NVIDIA GeForce RTX 5090 (SM 12.0 / Blackwell)
  • CUDA: via Docker vllm/vllm-openai:nightly
  • OS: Ubuntu (WSL2, kernel 6.6.87.2-microsoft-standard-WSL2)

Additional Notes

  • The bug occurs regardless of --quantization flag, --dtype, --kv-cache-dtype, or --enforce-eager
  • With --enforce-eager, the request hangs indefinitely (no response), confirming the linear attention computation is broken
  • The hf_quant_config.json correctly excludes model.language_model.layers.*.linear_attn.conv1d from NVFP4 quantization, so those layers should be in BF16 — the issue is in the FLA kernel's tensor layout, not quantization
  • Standard Transformer-only models (e.g., Qwen3-4B) work correctly; the issue is specific to Qwen3.5's hybrid linear attention architecture

extent analysis

TL;DR

The most likely fix is to ensure the input tensor format matches the expected shape [B, T, H, ...] by adjusting the head_first parameter or modifying the input tensor layout.

Guidance

  • Verify the input tensor format in the FLA ops layer to ensure it matches the expected shape [B, T, H, ...].
  • Check the head_first parameter in the FLA ops layer and adjust it if necessary to match the input tensor format.
  • Consider modifying the input tensor layout to match the expected format, potentially by transposing or rearranging the tensor dimensions.
  • Test the model with a simplified input or a different attention backend to isolate the issue and confirm the fix.

Example

No code snippet is provided as the issue is related to tensor format and layout, which requires a deeper understanding of the model architecture and implementation.

Notes

The issue is specific to the Qwen3.5 model's hybrid linear attention architecture and does not occur with standard Transformer-only models. The fix may require modifications to the model implementation or the input data preparation pipeline.

Recommendation

Apply a workaround by adjusting the head_first parameter or modifying the input tensor layout to match the expected format, as the root cause is related to a tensor format mismatch in the FLA ops layer.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING