vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output [2 comments, 2 participants]

vllm2026-03-31 14:47:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38643•Fetched 2026-04-08 01:58:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

BANG404

Participants

BANG404

ZJY0516

Timeline (top)

subscribed ×5commented ×2

When running Qwen3_5ForConditionalGeneration (Qwen3.5-4B-NVFP4) with vLLM nightly, the model produces completely garbled/gibberish output — a mix of random tokens from multiple languages.

Root Cause

The following warning is emitted during every inference call, indicating a tensor format mismatch in the FLA (Flash Linear Attention) ops layer:

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

Also from torch/_dynamo/eval_frame.py:

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

Qwen3.5 uses a hybrid Transformer + linear attention (Mamba-style) architecture. The linear_attn.conv1d layers are present in multiple decoder layers. The FLA ops kernel appears to receive tensors in [B, H, T, ...] (head-first) format when head_first=False is expected, causing completely incorrect computation and gibberish output.

Code Example

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

---

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

---

# docker-compose.yml
services:
  qwen3-5-4b:
    image: vllm/vllm-openai:nightly
    command: [
      "/models/Qwen3.5-4B-NVFP4",
      "--served-model-name", "Qwen3.5-4B-NVFP4",
      "--gpu-memory-utilization", "0.85",
      "--host", "0.0.0.0",
      "--port", "8000",
      "--dtype", "auto",
      "--max-model-len", "32768",
      "--trust-remote-code",
      "--attention-backend", "flashinfer"
    ]

---

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-4B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, introduce yourself"}],
    "temperature": 0.6,
    "max_tokens": 100
  }'

---

一根ulaHTTPRequestdataIK_and胜利 Ký尔ula树皮 auxili紹ns/welcome生生 televis多喝在水中 ric尔ael注射گز...

RAW_BUFFERClick to expand / collapse

Description

When running Qwen3_5ForConditionalGeneration (Qwen3.5-4B-NVFP4) with vLLM nightly, the model produces completely garbled/gibberish output — a mix of random tokens from multiple languages.

Root Cause

The following warning is emitted during every inference call, indicating a tensor format mismatch in the FLA (Flash Linear Attention) ops layer:

vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning:
Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified. Please verify your input tensor format
matches the expected shape [B, T, H, ...].

Also from torch/_dynamo/eval_frame.py:

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...].

Reproduction

# docker-compose.yml
services:
  qwen3-5-4b:
    image: vllm/vllm-openai:nightly
    command: [
      "/models/Qwen3.5-4B-NVFP4",
      "--served-model-name", "Qwen3.5-4B-NVFP4",
      "--gpu-memory-utilization", "0.85",
      "--host", "0.0.0.0",
      "--port", "8000",
      "--dtype", "auto",
      "--max-model-len", "32768",
      "--trust-remote-code",
      "--attention-backend", "flashinfer"
    ]

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-4B-NVFP4",
    "messages": [{"role": "user", "content": "Hello, introduce yourself"}],
    "temperature": 0.6,
    "max_tokens": 100
  }'

Expected output: Normal coherent text response

Actual output: Gibberish mix of random multilingual tokens, e.g.:

一根ulaHTTPRequestdataIK_and胜利 Ký尔ula树皮 auxili紹ns/welcome生生 televis多喝在水中 ric尔ael注射گز...

Environment

vLLM version: v0.17.0rc1.dev164+gfff3711a2
Model: Qwen/Qwen3.5-4B (NVFP4 quantized checkpoint, Qwen3_5ForConditionalGeneration)
GPU: NVIDIA GeForce RTX 5090 (SM 12.0 / Blackwell)
CUDA: via Docker vllm/vllm-openai:nightly
OS: Ubuntu (WSL2, kernel 6.6.87.2-microsoft-standard-WSL2)

Additional Notes

The bug occurs regardless of --quantization flag, --dtype, --kv-cache-dtype, or --enforce-eager
With --enforce-eager, the request hangs indefinitely (no response), confirming the linear attention computation is broken
The hf_quant_config.json correctly excludes model.language_model.layers.*.linear_attn.conv1d from NVFP4 quantization, so those layers should be in BF16 — the issue is in the FLA kernel's tensor layout, not quantization
Standard Transformer-only models (e.g., Qwen3-4B) work correctly; the issue is specific to Qwen3.5's hybrid linear attention architecture

extent analysis

TL;DR

The most likely fix is to ensure the input tensor format matches the expected shape [B, T, H, ...] by adjusting the head_first parameter or modifying the input tensor layout.

Guidance

Verify the input tensor format in the FLA ops layer to ensure it matches the expected shape [B, T, H, ...].
Check the head_first parameter in the FLA ops layer and adjust it if necessary to match the input tensor format.
Consider modifying the input tensor layout to match the expected format, potentially by transposing or rearranging the tensor dimensions.
Test the model with a simplified input or a different attention backend to isolate the issue and confirm the fix.

Example

No code snippet is provided as the issue is related to tensor format and layout, which requires a deeper understanding of the model architecture and implementation.

Notes

The issue is specific to the Qwen3.5 model's hybrid linear attention architecture and does not occur with standard Transformer-only models. The fix may require modifications to the model implementation or the input data preparation pipeline.

Recommendation

Apply a workaround by adjusting the head_first parameter or modifying the input tensor layout to match the expected format, as the root cause is related to a tensor format mismatch in the FLA ops layer.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #agent execution #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Root Cause

Reproduction

Environment

Additional Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Root Cause

Reproduction

Environment

Additional Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING