vllm - ✅(Solved) Fix [Bug]: The arguments invoked by the tool in the GLM-5 streaming output cannot be parsed into the JSON format. [1 pull requests, 2 comments, 2 participants]

vllm2026-03-12 06:46:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36857•Fetched 2026-04-08 00:34:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lililcode9527

Participants

lililcode9527

QwertyJack

Timeline (top)

cross-referenced ×3commented ×2labeled ×1subscribed ×1

Root Cause

During the phase of generating the "tool_calls" content, the model returns the complete "tool_calls" arguments content in the final chunk, instead of incrementally appending the content back into the "}" format, which ultimately leads to a JSON parsing failure. We found that the logic in the code is located in vllm\entrypoints\openai\chat_completion\serving.py, where the actual_call in expected_call is not being replaced as expected, mainly because there is a missing space after the key value in actual_call.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix tool call streaming JSON separator mismatch (https://github.com/vllm-project/vllm/pull/36866)

PR fix notes

PR #36866: [Bugfix] Fix tool call streaming JSON separator mismatch

Repository: vllm-project/vllm
Author: xr843
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/36866

Description (problem / solution / changelog)

Summary

Fixes #36857

When a tool parser stores arguments as a parsed dict (via json.loads), the serving layer re-serializes them with json.dumps() using Python's default separators (', ' and ': '). If the model streamed compact JSON without spaces (e.g. {"key":"value"} instead of {"key": "value"}), the str.replace() call that computes the remaining unstreamed arguments fails silently — the replacement has no effect and the entire arguments string is dumped in the final streaming chunk.

This adds a fallback: when the default-formatted expected_call does not match the actually streamed text (actual_call), retry with compact JSON separators ((',', ':')).

Affects models like GLM-5 that stream tool call arguments without spaces after :
No behavior change for models whose output already matches Python's default json.dumps formatting
The fix is backward-compatible: it only activates when the initial replace() has no effect

Test plan

Verify with GLM-5 model that tool call arguments are correctly streamed incrementally (not batched in final chunk)
Verify existing tool call streaming tests still pass (no regression for models that use spaced JSON)

🤖 Generated with Claude Code

Changed files

vllm/entrypoints/openai/chat_completion/serving.py (modified, +19/-0)

Code Example

Your output of `python collect_env.py` here

---

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"iz"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"hu.js"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\""}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"id":null,"type":null,"index":0,"function":{"name":null,"arguments":"{\"content\": \"//Dou Dizhu\\nlet cards=[...'34567890JQKA2'].flatMap(v=>[v,v,v,v]).concat('X','D');\\nconsole.log('The game of Dou Dizhu has begun！Card group:',cards);\", \"filePath\": \"/home/Code/doudizhu.js\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","stop_reason":154829,"token_ids":null}]}

data: [DONE]

---

args = tool_parser.prev_tool_call_arr[index].get(
                                "arguments", {}
                            )
                            if isinstance(args, str):
                                expected_call = args
                            else:
                                expected_call = json.dumps(args, ensure_ascii=False)

                            # get what we've streamed so far for arguments
                            # for the current tool
                            actual_call = tool_parser.streamed_args_for_tool[index]
                            if latest_delta_len > 0:
                                actual_call = actual_call[:-latest_delta_len]

                            # check to see if there's anything left to stream
                            remaining_call = expected_call.replace(actual_call, "", 1)
                            # set that as a delta message
                            delta_message = self._create_remaining_args_delta(
                                delta_message, remaining_call, index
                            )

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

I deployed and tested GLM-5-w4a8-mtp in the Function Call streaming output scenario on vLLM 0.16.0. The relevant configuration and test result are provided at the end.

Streaming output result of the GLM-5 model:

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"iz"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"hu.js"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\""}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"id":null,"type":null,"index":0,"function":{"name":null,"arguments":"{\"content\": \"//Dou Dizhu\\nlet cards=[...'34567890JQKA2'].flatMap(v=>[v,v,v,v]).concat('X','D');\\nconsole.log('The game of Dou Dizhu has begun！Card group:',cards);\", \"filePath\": \"/home/Code/doudizhu.js\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","stop_reason":154829,"token_ids":null}]}

data: [DONE]

During the phase of generating the "tool_calls" content, the model returns the complete "tool_calls" arguments content in the final chunk, instead of incrementally appending the content back into the "}" format, which ultimately leads to a JSON parsing failure. We found that the logic in the code is located in vllm\entrypoints\openai\chat_completion\serving.py, where the actual_call in expected_call is not being replaced as expected, mainly because there is a missing space after the key value in actual_call.

                            args = tool_parser.prev_tool_call_arr[index].get(
                                "arguments", {}
                            )
                            if isinstance(args, str):
                                expected_call = args
                            else:
                                expected_call = json.dumps(args, ensure_ascii=False)

                            # get what we've streamed so far for arguments
                            # for the current tool
                            actual_call = tool_parser.streamed_args_for_tool[index]
                            if latest_delta_len > 0:
                                actual_call = actual_call[:-latest_delta_len]

                            # check to see if there's anything left to stream
                            remaining_call = expected_call.replace(actual_call, "", 1)
                            # set that as a delta message
                            delta_message = self._create_remaining_args_delta(
                                delta_message, remaining_call, index
                            )

We hope this implementation can be improved to support the correct incremental output of the "tool_calls" content in the final chunk.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue with the incremental output of "tool_calls" content, we need to modify the serving.py file in the vllm\entrypoints\openai\chat_completion directory. The main issue is that there is a missing space after the key value in actual_call, which causes the replacement to fail.

Here are the steps to fix the issue:

Modify the expected_call generation to handle the case where args is a dictionary.
Add a space after the key value in actual_call to ensure correct replacement.

Example code changes:

# ...

if isinstance(args, str):
    expected_call = args
else:
    expected_call = json.dumps(args, ensure_ascii=False)

# get what we've streamed so far for arguments
# for the current tool
actual_call = tool_parser.streamed_args_for_tool[index]
if latest_delta_len > 0:
    actual_call = actual_call[:-latest_delta_len]

# Add a space after the key value in actual_call
if actual_call and actual_call[-1] == '}':
    actual_call += ' '

# check to see if there's anything left to stream
remaining_call = expected_call.replace(actual_call, "", 1)
# set that as a delta message
delta_message = self._create_remaining_args_delta(
    delta_message, remaining_call, index
)

Verification

To verify that the fix worked, you can test the streaming output of the GLM-5 model again and check if the "tool_calls" content is correctly incrementally appended to the output.

Extra Tips

Make sure to test the fix with different input scenarios to ensure that it works correctly in all cases.
Consider adding additional logging or debugging statements to help identify any further issues that may arise.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API rate limit #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: The arguments invoked by the tool in the GLM-5 streaming output cannot be parsed into the JSON format. [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #36866: [Bugfix] Fix tool call streaming JSON separator mismatch

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: The arguments invoked by the tool in the GLM-5 streaming output cannot be parsed into the JSON format. [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #36866: [Bugfix] Fix tool call streaming JSON separator mismatch

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING