vllm - ✅(Solved) Fix [Bug]: The arguments invoked by the tool in the GLM-5 streaming output cannot be parsed into the JSON format. [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36857Fetched 2026-04-08 00:34:10
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Timeline (top)
cross-referenced ×3commented ×2labeled ×1subscribed ×1

Root Cause

During the phase of generating the "tool_calls" content, the model returns the complete "tool_calls" arguments content in the final chunk, instead of incrementally appending the content back into the "}" format, which ultimately leads to a JSON parsing failure. We found that the logic in the code is located in vllm\entrypoints\openai\chat_completion\serving.py, where the actual_call in expected_call is not being replaced as expected, mainly because there is a missing space after the key value in actual_call.

Fix Action

Fixed

PR fix notes

PR #36866: [Bugfix] Fix tool call streaming JSON separator mismatch

Description (problem / solution / changelog)

Summary

Fixes #36857

When a tool parser stores arguments as a parsed dict (via json.loads), the serving layer re-serializes them with json.dumps() using Python's default separators (', ' and ': '). If the model streamed compact JSON without spaces (e.g. {"key":"value"} instead of {"key": "value"}), the str.replace() call that computes the remaining unstreamed arguments fails silently — the replacement has no effect and the entire arguments string is dumped in the final streaming chunk.

This adds a fallback: when the default-formatted expected_call does not match the actually streamed text (actual_call), retry with compact JSON separators ((',', ':')).

  • Affects models like GLM-5 that stream tool call arguments without spaces after :
  • No behavior change for models whose output already matches Python's default json.dumps formatting
  • The fix is backward-compatible: it only activates when the initial replace() has no effect

Test plan

  • Verify with GLM-5 model that tool call arguments are correctly streamed incrementally (not batched in final chunk)
  • Verify existing tool call streaming tests still pass (no regression for models that use spaced JSON)

🤖 Generated with Claude Code

Changed files

  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +19/-0)

Code Example

Your output of `python collect_env.py` here

---

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"iz"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"hu.js"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\""}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"id":null,"type":null,"index":0,"function":{"name":null,"arguments":"{\"content\": \"//Dou Dizhu\\nlet cards=[...'34567890JQKA2'].flatMap(v=>[v,v,v,v]).concat('X','D');\\nconsole.log('The game of Dou Dizhu has begun!Card group:',cards);\", \"filePath\": \"/home/Code/doudizhu.js\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","stop_reason":154829,"token_ids":null}]}

data: [DONE]

---

args = tool_parser.prev_tool_call_arr[index].get(
                                "arguments", {}
                            )
                            if isinstance(args, str):
                                expected_call = args
                            else:
                                expected_call = json.dumps(args, ensure_ascii=False)

                            # get what we've streamed so far for arguments
                            # for the current tool
                            actual_call = tool_parser.streamed_args_for_tool[index]
                            if latest_delta_len > 0:
                                actual_call = actual_call[:-latest_delta_len]

                            # check to see if there's anything left to stream
                            remaining_call = expected_call.replace(actual_call, "", 1)
                            # set that as a delta message
                            delta_message = self._create_remaining_args_delta(
                                delta_message, remaining_call, index
                            )
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

I deployed and tested GLM-5-w4a8-mtp in the Function Call streaming output scenario on vLLM 0.16.0. The relevant configuration and test result are provided at the end.

Streaming output result of the GLM-5 model:

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"iz"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"hu.js"}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\""}}]},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-","object":"chat.completion.chunk","created":xx,"model":"glm-5","choices":[{"index":0,"delta":{"tool_calls":[{"id":null,"type":null,"index":0,"function":{"name":null,"arguments":"{\"content\": \"//Dou Dizhu\\nlet cards=[...'34567890JQKA2'].flatMap(v=>[v,v,v,v]).concat('X','D');\\nconsole.log('The game of Dou Dizhu has begun!Card group:',cards);\", \"filePath\": \"/home/Code/doudizhu.js\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","stop_reason":154829,"token_ids":null}]}

data: [DONE]

During the phase of generating the "tool_calls" content, the model returns the complete "tool_calls" arguments content in the final chunk, instead of incrementally appending the content back into the "}" format, which ultimately leads to a JSON parsing failure. We found that the logic in the code is located in vllm\entrypoints\openai\chat_completion\serving.py, where the actual_call in expected_call is not being replaced as expected, mainly because there is a missing space after the key value in actual_call.

                            args = tool_parser.prev_tool_call_arr[index].get(
                                "arguments", {}
                            )
                            if isinstance(args, str):
                                expected_call = args
                            else:
                                expected_call = json.dumps(args, ensure_ascii=False)

                            # get what we've streamed so far for arguments
                            # for the current tool
                            actual_call = tool_parser.streamed_args_for_tool[index]
                            if latest_delta_len > 0:
                                actual_call = actual_call[:-latest_delta_len]

                            # check to see if there's anything left to stream
                            remaining_call = expected_call.replace(actual_call, "", 1)
                            # set that as a delta message
                            delta_message = self._create_remaining_args_delta(
                                delta_message, remaining_call, index
                            )

We hope this implementation can be improved to support the correct incremental output of the "tool_calls" content in the final chunk.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue with the incremental output of "tool_calls" content, we need to modify the serving.py file in the vllm\entrypoints\openai\chat_completion directory. The main issue is that there is a missing space after the key value in actual_call, which causes the replacement to fail.

Here are the steps to fix the issue:

  • Modify the expected_call generation to handle the case where args is a dictionary.
  • Add a space after the key value in actual_call to ensure correct replacement.

Example code changes:

# ...

if isinstance(args, str):
    expected_call = args
else:
    expected_call = json.dumps(args, ensure_ascii=False)

# get what we've streamed so far for arguments
# for the current tool
actual_call = tool_parser.streamed_args_for_tool[index]
if latest_delta_len > 0:
    actual_call = actual_call[:-latest_delta_len]

# Add a space after the key value in actual_call
if actual_call and actual_call[-1] == '}':
    actual_call += ' '

# check to see if there's anything left to stream
remaining_call = expected_call.replace(actual_call, "", 1)
# set that as a delta message
delta_message = self._create_remaining_args_delta(
    delta_message, remaining_call, index
)

Verification

To verify that the fix worked, you can test the streaming output of the GLM-5 model again and check if the "tool_calls" content is correctly incrementally appended to the output.

Extra Tips

  • Make sure to test the fix with different input scenarios to ensure that it works correctly in all cases.
  • Consider adding additional logging or debugging statements to help identify any further issues that may arise.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING