vllm - 💡(How to fix) Fix [Bug]: GLM-5.1 tool call parsing fails intermittently when used as backend for Claude Code

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  1. Use Claude Code for an extended session — once the conversation context approaches ~200k tokens, ask Claude Code to enter plan mode and create a debug plan. For example: "Switch to plan mode and create a plan to debug the authentication issue." The error "The model's tool call could not be parsed (retry also failed)" appears.

Root Cause

Known streaming tool call parser bugs (amplified at long context):

  • #39757: Streaming mode truncates tool names (e.g., get_weatherget).
  • #36857: Streaming mode returns complete arguments in the final chunk instead of incrementally appending, causing JSON parse failure.

These may be rare at short context but become more frequent near the context limit, where model output quality degrades and streaming chunk boundaries shift.

Code Example

vLLM version: v0.20.1 (vllm-openai:v0.20.1)
Model: GLM-5.1-FP8 (max_model_len: 202752)
Tool call parser: glm47
Reasoning parser: glm45
Speculative decoding: MTP (num_speculative_tokens=3)

---

vllm serve /xxxxx/GLM-5.1-FP8 \
      --trust-remote-code \
      --chat-template-content-format=string \
      --tensor-parallel-size 8 \
      --tool-call-parser glm47 \
      --enable-auto-tool-choice \
      --reasoning-parser glm45 \
      --speculative-config.method mtp \
      --speculative-config.num_speculative_tokens 3 \
      --port 8000
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: v0.20.1 (vllm-openai:v0.20.1)
Model: GLM-5.1-FP8 (max_model_len: 202752)
Tool call parser: glm47
Reasoning parser: glm45
Speculative decoding: MTP (num_speculative_tokens=3)

🐛 Describe the bug

When deploying GLM-5.1 with vLLM and connecting it as a custom model provider in Claude Code (Anthropic's CLI agent), tool calls intermittently fail to parse. Claude Code reports:

The model's tool call could not be parsed (retry also failed).

Key observations:

  1. The issue only manifests when the context window is nearly full (~200k tokens, approaching the model's 202752 max_model_len limit). With short-to-moderate contexts, tool calls parse correctly in both streaming and non-streaming modes.
  2. The failure typically occurs when Claude Code enters planning mode — the model must produce a structured Plan tool call with detailed content at long context. This combination of near-max context + complex tool call output appears to be the trigger.
<img width="1606" height="472" alt="Image" src="https://github.com/user-attachments/assets/af106d70-b473-4996-86fb-760c0fc1b0bd" />

Reproduction

  1. Deploy GLM-5.1 with vLLM:

     vllm serve /xxxxx/GLM-5.1-FP8 \
       --trust-remote-code \
       --chat-template-content-format=string \
       --tensor-parallel-size 8 \
       --tool-call-parser glm47 \
       --enable-auto-tool-choice \
       --reasoning-parser glm45 \
       --speculative-config.method mtp \
       --speculative-config.num_speculative_tokens 3 \
       --port 8000
  2. Configure Claude Code to use this vLLM endpoint via api_base. Claude Code communicates via the Anthropic Messages API (/v1/messages) in streaming mode.

  3. Use Claude Code for an extended session — once the conversation context approaches ~200k tokens, ask Claude Code to enter plan mode and create a debug plan. For example: "Switch to plan mode and create a plan to debug the authentication issue." The error "The model's tool call could not be parsed (retry also failed)" appears.

Root cause analysis

Known streaming tool call parser bugs (amplified at long context):

  • #39757: Streaming mode truncates tool names (e.g., get_weatherget).
  • #36857: Streaming mode returns complete arguments in the final chunk instead of incrementally appending, causing JSON parse failure.

These may be rare at short context but become more frequent near the context limit, where model output quality degrades and streaming chunk boundaries shift.

Expected behavior

Tool calls should be reliably parsed even at long contexts approaching max_model_len.

Additional context

  • Fine-grained tool streaming is off by default when connecting through a custom provider — the issue persists regardless.
  • Related issues: #39757, #36857

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Tool calls should be reliably parsed even at long contexts approaching max_model_len.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: GLM-5.1 tool call parsing fails intermittently when used as backend for Claude Code