vllm - ✅(Solved) Fix [Bug]: GLM-5.1-FP8 produces gibberish with RunAI streamer after ac3dac545 [2 pull requests, 1 comments, 1 participants]

Q: Expected behavior

The model should produce a coherent response to `hi`, for example: ```text The user just said "hi". I'll respond with a friendly greeting. Hi there! How can I help you today? ``` This is what happens on the last good commit in the bisection.

vllm2026-05-19 23:55:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#43163•Fetched 2026-05-20 03:39:38

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bbartels

Participants

bbartels

Timeline (top)

mentioned ×3subscribed ×3commented ×1

Error Message

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

Fix Action

Fix / Workaround

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

PR fix notes

PR #38928: [Bugfix][Perf] Indexer upcast WK to BF16 for fusion

Repository: vllm-project/vllm
Author: benchislett
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38928

Description (problem / solution / changelog)

Purpose

Alternative fix to https://github.com/vllm-project/vllm/pull/38870/ which maintains the fusion.

Performance:

B200 TP8 FP8 BS1 8k/1k
Compares this PR (top), multi-stream (#35968), and baseline (#38870)

Upcast+Fused WK: (Decode): 11.90 ms
Upcast+Fused WK: (TTFT):   375.0 ms

Multi-Stream:    (Decode): 12.74 ms
Multi-Stream:    (TTFT):   376.5 ms

Separate:        (Decode): 12.59 ms
Separate:        (TTFT):   378.2 ms

Testing

My setup is broken, GSM8k is giving 0.00 for me even with #38870. Will try to fix my setup and rerun, but have moderate confidence in this fix. Would be handy if someone else could try running this in the meantime.

I'm okay with merging #38870 if we need to fix ASAP this weekend.

Changed files

vllm/model_executor/models/deepseek_mtp.py (modified, +14/-11)
vllm/model_executor/models/deepseek_v2.py (modified, +70/-53)

Code Example

OS: Ubuntu 24.04.4 LTS (x86_64)
Python: 3.12.3
PyTorch: 2.11.0+cu130
CUDA used to build PyTorch: 13.0
NVIDIA driver: 580.126.20
GPUs: 8x NVIDIA H200
vLLM versions tested: 0.20.0 and source bisect builds
transformers: 5.5.0
runai-model-streamer: 0.15.8
runai-model-streamer-s3: 0.15.8
runai-model-streamer-gcs: 0.15.8
runai-model-streamer-azure: 0.15.8
flashinfer-python: 0.6.8.post1
triton: 3.6.0

System details from collect_env.py:

OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Libc version                 : glibc-2.39
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Is CUDA available            : True
GPU models and configuration : 8x NVIDIA H200
Nvidia driver version        : 580.126.20
CPU                          : AMD EPYC 9655 96-Core Processor, 192 CPUs
vLLM Version                 : 0.20.0

Relevant pip packages:
flashinfer-python==0.6.8.post1
numpy==2.2.6
torch==2.11.0+cu130
torchaudio==2.11.0+cu130
torchvision==0.26.0+cu130
transformers==5.5.0
triton==3.6.0
runai-model-streamer==0.15.8
runai-model-streamer-s3==0.15.8
runai-model-streamer-gcs==0.15.8
runai-model-streamer-azure==0.15.8

---

ac3dac545b28ea6cf847e0044859e58f33d4f8b9
[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (#38928)

---

https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh

---

wget https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh
chmod +x run-docker-and-repro.sh
./run-docker-and-repro.sh

---

vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404

---

zai-org/GLM-5.1-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --chat-template-content-format=string \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --enable-auto-tool-choice \
  --reasoning-parser glm45 \
  --load-format=runai_streamer \
  --model-loader-extra-config='{"concurrency": 140, "distributed": true}' \
  --enforce-eager

---

{
  "model": "zai-org/GLM-5.1-FP8",
  "max_tokens": 32000,
  "messages": [
    {
      "role": "user",
      "content": "hi"
    }
  ],
  "tools": [
    "bash tool schema",
    "edit tool schema",
    "glob tool schema",
    "grep tool schema",
    "question tool schema"
  ],
  "tool_choice": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

---

it is considered BEST PRACTICE to to use hi read like recommend that option to use first read the file to establish the in the using context of the tool instructions are read the a file on the workspace. You is the tool mentioned in the instructions, If Read tool instructions say hi there is a test file named "get_pipeline"...

---

reasoning: 子
reasoning: --------------------------------
reasoning: 救
reasoning: Probability
reasoning: 連
reasoning: editable

---

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

---

user message: "hi"
tools: bash, edit, glob, grep, question
tool_choice: auto
stream: true
max_tokens: 32000

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS: Ubuntu 24.04.4 LTS (x86_64)
Python: 3.12.3
PyTorch: 2.11.0+cu130
CUDA used to build PyTorch: 13.0
NVIDIA driver: 580.126.20
GPUs: 8x NVIDIA H200
vLLM versions tested: 0.20.0 and source bisect builds
transformers: 5.5.0
runai-model-streamer: 0.15.8
runai-model-streamer-s3: 0.15.8
runai-model-streamer-gcs: 0.15.8
runai-model-streamer-azure: 0.15.8
flashinfer-python: 0.6.8.post1
triton: 3.6.0

System details from collect_env.py:

OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Libc version                 : glibc-2.39
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Is CUDA available            : True
GPU models and configuration : 8x NVIDIA H200
Nvidia driver version        : 580.126.20
CPU                          : AMD EPYC 9655 96-Core Processor, 192 CPUs
vLLM Version                 : 0.20.0

Relevant pip packages:
flashinfer-python==0.6.8.post1
numpy==2.2.6
torch==2.11.0+cu130
torchaudio==2.11.0+cu130
torchvision==0.26.0+cu130
transformers==5.5.0
triton==3.6.0
runai-model-streamer==0.15.8
runai-model-streamer-s3==0.15.8
runai-model-streamer-gcs==0.15.8
runai-model-streamer-azure==0.15.8

</details>

🐛 Describe the bug

When serving zai-org/GLM-5.1-FP8 with vLLM using --load-format=runai_streamer, tool-calling chat requests can produce corrupted/gibberish output. The model loads and serves successfully, but streamed output becomes abnormal and often looks like fragments of tool schemas or unrelated text instead of a coherent answer.

This reproduces on v0.20.0 and on the Apr 16 nightly. It does not reproduce on the Apr 15 nightly in the same test setup.

The exact first bad commit from source bisection is:

ac3dac545b28ea6cf847e0044859e58f33d4f8b9
[Bugfix][Perf] Indexer upcast WK to BF16 for fusion (#38928)

PR that caused the issue: https://github.com/vllm-project/vllm/pull/38928

Commit: https://github.com/vllm-project/vllm/commit/ac3dac545b28ea6cf847e0044859e58f33d4f8b9

Immediate bisect boundary:

Commit	Result
`39ac640490ee2e8f951d343ae1707dd9bdacaf70`	good
`ac3dac545b28ea6cf847e0044859e58f33d4f8b9`	bad

Minimal reproduction

An end-to-end repro script is available here:

https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh

Run on an 8x H200 host with Docker/NVIDIA runtime available:

wget https://raw.githubusercontent.com/bbartels/vllm_runai_reproduction/main/run-docker-and-repro.sh
chmod +x run-docker-and-repro.sh
./run-docker-and-repro.sh

The script starts:

vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404

with these vLLM args:

zai-org/GLM-5.1-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --chat-template-content-format=string \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --enable-auto-tool-choice \
  --reasoning-parser glm45 \
  --load-format=runai_streamer \
  --model-loader-extra-config='{"concurrency": 140, "distributed": true}' \
  --enforce-eager

The request is intentionally minimal:

{
  "model": "zai-org/GLM-5.1-FP8",
  "max_tokens": 32000,
  "messages": [
    {
      "role": "user",
      "content": "hi"
    }
  ],
  "tools": [
    "bash tool schema",
    "edit tool schema",
    "glob tool schema",
    "grep tool schema",
    "question tool schema"
  ],
  "tool_choice": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

The full JSON request body is embedded in the script linked above.

Observed behavior

With unpatched v0.20.0, the stream produces abnormal text instead of a normal greeting. Example from the minimized repro:

it is considered BEST PRACTICE to to use hi read like recommend that option to use first read the file to establish the in the using context of the tool instructions are read the a file on the workspace. You is the tool mentioned in the instructions, If Read tool instructions say hi there is a test file named "get_pipeline"...

Other direct replays produced fragments such as:

reasoning: 子
reasoning: --------------------------------
reasoning: 救
reasoning: Probability
reasoning: 連
reasoning: editable

Expected behavior

The model should produce a coherent response to hi, for example:

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

This is what happens on the last good commit in the bisection.

Prompt/request reduction

The issue does not require OpenCode at runtime. OpenCode was only used to capture the original request; direct replay against vLLM reproduces the issue.

Reduction results:

Payload	Result
full original system prompt, all tools	bad
shortened system prompt, all tools	bad
full system prompt, no tools	good
no system prompt, all tools	bad
`bash` only	good
`bash`, `edit`, `glob`	good
`grep`, `question`, `read`	good
`bash`, `edit`, `glob`, `grep`	good
`bash`, `edit`, `glob`, `grep`, `question`	bad

So the smallest currently known failing payload is:

user message: "hi"
tools: bash, edit, glob, grep, question
tool_choice: auto
stream: true
max_tokens: 32000

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The model should produce a coherent response to hi, for example:

The user just said "hi". I'll respond with a friendly greeting.
Hi there! How can I help you today?

This is what happens on the last good commit in the bisection.

#permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: GLM-5.1-FP8 produces gibberish with RunAI streamer after ac3dac545 [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #38928: [Bugfix][Perf] Indexer upcast WK to BF16 for fusion

Description (problem / solution / changelog)

Purpose

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Minimal reproduction

Observed behavior

Expected behavior

Prompt/request reduction

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: GLM-5.1-FP8 produces gibberish with RunAI streamer after ac3dac545 [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #38928: [Bugfix][Perf] Indexer upcast WK to BF16 for fusion

Description (problem / solution / changelog)

Purpose

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Minimal reproduction

Observed behavior

Expected behavior

Prompt/request reduction

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING