vllm - ✅(Solved) Fix [Bug]: GLM-5.1-FP8 produces gibberish with RunAI streamer after ac3dac545 [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43163Fetched 2026-05-20 03:39:38
View on GitHub
Comments
1
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
mentioned ×3subscribed ×3commented ×1

Error Message

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

Fix Action

Fix / Workaround

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

PR fix notes

PR #38928: [Bugfix][Perf] Indexer upcast WK to BF16 for fusion

Description (problem / solution / changelog)

Purpose

Alternative fix to https://github.com/vllm-project/vllm/pull/38870/ which maintains the fusion.

Performance:

  • B200 TP8 FP8 BS1 8k/1k
  • Compares this PR (top), multi-stream (#35968), and baseline (#38870)
Upcast+Fused WK: (Decode): 11.90 ms
Upcast+Fused WK: (TTFT):   375.0 ms

Multi-Stream:    (Decode): 12.74 ms
Multi-Stream:    (TTFT):   376.5 ms

Separate:        (Decode): 12.59 ms
Separate:        (TTFT):   378.2 ms

Testing

My setup is broken, GSM8k is giving 0.00 for me even with #38870. Will try to fix my setup and rerun, but have moderate confidence in this fix. Would be handy if someone else could try running this in the meantime.

I'm okay with merging #38870 if we need to fix ASAP this weekend.

Changed files

  • vllm/model_executor/models/deepseek_mtp.py (modified, +14/-11)
  • vllm/model_executor/models/deepseek_v2.py (modified, +70/-53)

Code Example

OS: Ubuntu 24.04.4 LTS (x86_64)
Python: 3.12.3
PyTorch: 2.11.0+cu130
CUDA used to build PyTorch: 13.0
NVIDIA driver: 580.126.20
GPUs: 8x NVIDIA H200
vLLM versions tested: 0.20.0 and source bisect builds
transformers: 5.5.0
runai-model-streamer: 0.15.8
runai-model-streamer-s3: 0.15.8
runai-model-streamer-gcs: 0.15.8
runai-model-streamer-azure: 0.15.8
flashinfer-python: 0.6.8.post1
triton: 3.6.0

System details from collect_env.py:

OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Libc version                 : glibc-2.39
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Is CUDA available            : True
GPU models and configuration : 8x NVIDIA H200
Nvidia driver version        : 580.126.20
CPU                          : AMD EPYC 9655 96-Core Processor, 192 CPUs
vLLM Version                 : 0.20.0

Relevant pip packages:
flashinfer-python==0.6.8.post1
numpy==2.2.6
torch==2.11.0+cu130
torchaudio==2.11.0+cu130
torchvision==0.26.0+cu130
transformers==5.5.0
triton==3.6.0
runai-model-streamer==0.15.8
runai-model-streamer-s3==0.15.8
runai-model-streamer-gcs==0.15.8
runai-model-streamer-azure==0.15.8

---

ac3dac545b28ea6cf847e0044859e58f33d4f8b9
[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (#38928)

---

https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh

---

wget https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh
chmod +x run-docker-and-repro.sh
./run-docker-and-repro.sh

---

vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404

---

zai-org/GLM-5.1-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --chat-template-content-format=string \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --enable-auto-tool-choice \
  --reasoning-parser glm45 \
  --load-format=runai_streamer \
  --model-loader-extra-config='{"concurrency": 140, "distributed": true}' \
  --enforce-eager

---

{
  "model": "zai-org/GLM-5.1-FP8",
  "max_tokens": 32000,
  "messages": [
    {
      "role": "user",
      "content": "hi"
    }
  ],
  "tools": [
    "bash tool schema",
    "edit tool schema",
    "glob tool schema",
    "grep tool schema",
    "question tool schema"
  ],
  "tool_choice": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

---

it is considered BEST PRACTICE to to use hi read like recommend that option to use first read the file to establish the in the using context of the tool instructions are read the a file on the workspace. You is the tool mentioned in the instructions, If Read tool instructions say hi there is a test file named "get_pipeline"...

---

reasoning:reasoning: --------------------------------
reasoning:reasoning: Probability
reasoning:reasoning: editable

---

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

---

user message: "hi"
tools: bash, edit, glob, grep, question
tool_choice: auto
stream: true
max_tokens: 32000
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS: Ubuntu 24.04.4 LTS (x86_64)
Python: 3.12.3
PyTorch: 2.11.0+cu130
CUDA used to build PyTorch: 13.0
NVIDIA driver: 580.126.20
GPUs: 8x NVIDIA H200
vLLM versions tested: 0.20.0 and source bisect builds
transformers: 5.5.0
runai-model-streamer: 0.15.8
runai-model-streamer-s3: 0.15.8
runai-model-streamer-gcs: 0.15.8
runai-model-streamer-azure: 0.15.8
flashinfer-python: 0.6.8.post1
triton: 3.6.0

System details from collect_env.py:

OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Libc version                 : glibc-2.39
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Is CUDA available            : True
GPU models and configuration : 8x NVIDIA H200
Nvidia driver version        : 580.126.20
CPU                          : AMD EPYC 9655 96-Core Processor, 192 CPUs
vLLM Version                 : 0.20.0

Relevant pip packages:
flashinfer-python==0.6.8.post1
numpy==2.2.6
torch==2.11.0+cu130
torchaudio==2.11.0+cu130
torchvision==0.26.0+cu130
transformers==5.5.0
triton==3.6.0
runai-model-streamer==0.15.8
runai-model-streamer-s3==0.15.8
runai-model-streamer-gcs==0.15.8
runai-model-streamer-azure==0.15.8
</details>

🐛 Describe the bug

When serving zai-org/GLM-5.1-FP8 with vLLM using --load-format=runai_streamer, tool-calling chat requests can produce corrupted/gibberish output. The model loads and serves successfully, but streamed output becomes abnormal and often looks like fragments of tool schemas or unrelated text instead of a coherent answer.

This reproduces on v0.20.0 and on the Apr 16 nightly. It does not reproduce on the Apr 15 nightly in the same test setup.

The exact first bad commit from source bisection is:

ac3dac545b28ea6cf847e0044859e58f33d4f8b9
[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (#38928)

PR that caused the issue: https://github.com/vllm-project/vllm/pull/38928

Commit: https://github.com/vllm-project/vllm/commit/ac3dac545b28ea6cf847e0044859e58f33d4f8b9

Immediate bisect boundary:

CommitResult
39ac640490ee2e8f951d343ae1707dd9bdacaf70good
ac3dac545b28ea6cf847e0044859e58f33d4f8b9bad

Minimal reproduction

An end-to-end repro script is available here:

https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh

Run on an 8x H200 host with Docker/NVIDIA runtime available:

wget https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh
chmod +x run-docker-and-repro.sh
./run-docker-and-repro.sh

The script starts:

vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404

with these vLLM args:

zai-org/GLM-5.1-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --chat-template-content-format=string \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --enable-auto-tool-choice \
  --reasoning-parser glm45 \
  --load-format=runai_streamer \
  --model-loader-extra-config='{"concurrency": 140, "distributed": true}' \
  --enforce-eager

The request is intentionally minimal:

{
  "model": "zai-org/GLM-5.1-FP8",
  "max_tokens": 32000,
  "messages": [
    {
      "role": "user",
      "content": "hi"
    }
  ],
  "tools": [
    "bash tool schema",
    "edit tool schema",
    "glob tool schema",
    "grep tool schema",
    "question tool schema"
  ],
  "tool_choice": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

The full JSON request body is embedded in the script linked above.

Observed behavior

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

it is considered BEST PRACTICE to to use hi read like recommend that option to use first read the file to establish the in the using context of the tool instructions are read the a file on the workspace. You is the tool mentioned in the instructions, If Read tool instructions say hi there is a test file named "get_pipeline"...

Other direct replays produced fragments such as:

reasoning: 子
reasoning: --------------------------------
reasoning: 救
reasoning: Probability
reasoning: 連
reasoning: editable

Expected behavior

The model should produce a coherent response to hi, for example:

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

This is what happens on the last good commit in the bisection.

Prompt/request reduction

The issue does not require OpenCode at runtime. OpenCode was only used to capture the original request; direct replay against vLLM reproduces the issue.

Reduction results:

PayloadResult
full original system prompt, all toolsbad
shortened system prompt, all toolsbad
full system prompt, no toolsgood
no system prompt, all toolsbad
bash onlygood
bash, edit, globgood
grep, question, readgood
bash, edit, glob, grepgood
bash, edit, glob, grep, questionbad

So the smallest currently known failing payload is:

user message: "hi"
tools: bash, edit, glob, grep, question
tool_choice: auto
stream: true
max_tokens: 32000

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The model should produce a coherent response to hi, for example:

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

This is what happens on the last good commit in the bisection.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING