vllm - 💡(How to fix) Fix [Bug]: GLM-5.1 tool call parsing fails intermittently when used as backend for Claude Code

vllm2026-05-12 09:11:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Use Claude Code for an extended session — once the conversation context approaches ~200k tokens, ask Claude Code to enter plan mode and create a debug plan. For example: "Switch to plan mode and create a plan to debug the authentication issue." The error "The model's tool call could not be parsed (retry also failed)" appears.

Root Cause

Known streaming tool call parser bugs (amplified at long context):

#39757: Streaming mode truncates tool names (e.g., get_weather → get).
#36857: Streaming mode returns complete arguments in the final chunk instead of incrementally appending, causing JSON parse failure.

These may be rare at short context but become more frequent near the context limit, where model output quality degrades and streaming chunk boundaries shift.

Code Example

vLLM version: v0.20.1 (vllm-openai:v0.20.1)
Model: GLM-5.1-FP8 (max_model_len: 202752)
Tool call parser: glm47
Reasoning parser: glm45
Speculative decoding: MTP (num_speculative_tokens=3)

---

vllm serve /xxxxx/GLM-5.1-FP8 \
      --trust-remote-code \
      --chat-template-content-format=string \
      --tensor-parallel-size 8 \
      --tool-call-parser glm47 \
      --enable-auto-tool-choice \
      --reasoning-parser glm45 \
      --speculative-config.method mtp \
      --speculative-config.num_speculative_tokens 3 \
      --port 8000

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: v0.20.1 (vllm-openai:v0.20.1)
Model: GLM-5.1-FP8 (max_model_len: 202752)
Tool call parser: glm47
Reasoning parser: glm45
Speculative decoding: MTP (num_speculative_tokens=3)

🐛 Describe the bug

When deploying GLM-5.1 with vLLM and connecting it as a custom model provider in Claude Code (Anthropic's CLI agent), tool calls intermittently fail to parse. Claude Code reports:

The model's tool call could not be parsed (retry also failed).

Key observations:

The issue only manifests when the context window is nearly full (~200k tokens, approaching the model's 202752 max_model_len limit). With short-to-moderate contexts, tool calls parse correctly in both streaming and non-streaming modes.
The failure typically occurs when Claude Code enters planning mode — the model must produce a structured Plan tool call with detailed content at long context. This combination of near-max context + complex tool call output appears to be the trigger.

Reproduction

Deploy GLM-5.1 with vLLM:

 vllm serve /xxxxx/GLM-5.1-FP8 \
   --trust-remote-code \
   --chat-template-content-format=string \
   --tensor-parallel-size 8 \
   --tool-call-parser glm47 \
   --enable-auto-tool-choice \
   --reasoning-parser glm45 \
   --speculative-config.method mtp \
   --speculative-config.num_speculative_tokens 3 \
   --port 8000

Configure Claude Code to use this vLLM endpoint via api_base. Claude Code communicates via the Anthropic Messages API (/v1/messages) in streaming mode.
Use Claude Code for an extended session — once the conversation context approaches ~200k tokens, ask Claude Code to enter plan mode and create a debug plan. For example: "Switch to plan mode and create a plan to debug the authentication issue." The error "The model's tool call could not be parsed (retry also failed)" appears.

Root cause analysis

Known streaming tool call parser bugs (amplified at long context):

#39757: Streaming mode truncates tool names (e.g., get_weather → get).
#36857: Streaming mode returns complete arguments in the final chunk instead of incrementally appending, causing JSON parse failure.

These may be rare at short context but become more frequent near the context limit, where model output quality degrades and streaming chunk boundaries shift.

Expected behavior

Tool calls should be reliably parsed even at long contexts approaching max_model_len.

Additional context

Fine-grained tool streaming is off by default when connecting through a custom provider — the issue persists regardless.
Related issues: #39757, #36857

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Tool calls should be reliably parsed even at long contexts approaching max_model_len.

#api #tool integration #LLM response #prompt template #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: GLM-5.1 tool call parsing fails intermittently when used as backend for Claude Code

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Root cause analysis

Expected behavior

Additional context

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: GLM-5.1 tool call parsing fails intermittently when used as backend for Claude Code

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Root cause analysis

Expected behavior

Additional context

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING