vllm - 💡(How to fix) Fix [Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish_reason

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

When the final engine output has both a tool argument delta and output.finish_reason is not None, the current stream generator builds one ChatCompletionResponseStreamChoice containing both:

Fix Action

Fix / Workaround

  • vLLM issue #38603 / PR #39598: related null/empty field and MTP tool-call final chunk behavior, but not the full GLM metadata-repeat + finish-chunk split semantics.
  • vLLM issue #36857 / PR #37845: suffix-alignment / full-argument re-emission issue; closed as fixed on main, but it does not cover these two protocol issues.
  • vLLM PR #39253: fixed GLM parser streaming under MTP / stream interval, but the serving-layer final chunk behavior above still exists.
  • vLLM-Ascend issue vllm-project/vllm-ascend#8327: reports the argument-delta and finish_reason="tool_calls" being combined in one final chunk.
  • vLLM-Ascend PR vllm-project/vllm-ascend#8178: downstream patch avoiding duplicate function metadata in final chunks.

Code Example

commit 6bdabbad5 [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956)

---

return DeltaMessage(
    tool_calls=[
        DeltaToolCall(
            index=index,
            id=original_tc.id if original_tc else None,
            type=original_tc.type if original_tc else None,
            function=DeltaFunctionCall(
                name=original_fn.name if original_fn else None,
                arguments=remaining_call,
            ),
        )
    ]
)

---

remaining_delta.model_dump=
{'id': 'call_current', 'type': 'function', 'index': 0,
 'function': {'name': 'current_name', 'arguments': ']}'}}

---

{
  "index": 0,
  "delta": {
    "tool_calls": [
      {"index": 0, "function": {"arguments": "\"pong.py\"}"}}
    ]
  },
  "finish_reason": "tool_calls"
}
RAW_BUFFERClick to expand / collapse

Your current environment

Current local upstream main:

commit 6bdabbad5 [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956)

The relevant code path is vllm/entrypoints/openai/chat_completion/serving.py with GLM tool parsers such as glm45 / glm47 used for GLM-4.5 / GLM-5 style tool-call streaming.

🐛 Describe the bug

There are still two GLM tool-call streaming protocol issues on current main.

1. Final remaining-argument chunks can re-emit tool-call metadata

When OpenAIServingChat computes remaining tool arguments at finish time, _create_remaining_args_delta() preserves id, type, and function.name from the original delta:

return DeltaMessage(
    tool_calls=[
        DeltaToolCall(
            index=index,
            id=original_tc.id if original_tc else None,
            type=original_tc.type if original_tc else None,
            function=DeltaFunctionCall(
                name=original_fn.name if original_fn else None,
                arguments=remaining_call,
            ),
        )
    ]
)

For continuation / final remaining-argument chunks, this can send id, type, and function.name again even though those fields were already emitted in the first chunk for that tool-call index. OpenAI-compatible clients generally expect metadata to appear in the first chunk only, while later chunks append only function.arguments fragments.

A minimal probe against current main shows the metadata is still preserved:

remaining_delta.model_dump=
{'id': 'call_current', 'type': 'function', 'index': 0,
 'function': {'name': 'current_name', 'arguments': ']}'}}

2. A terminal argument chunk can be combined with finish_reason="tool_calls"

When the final engine output has both a tool argument delta and output.finish_reason is not None, the current stream generator builds one ChatCompletionResponseStreamChoice containing both:

  • delta.tool_calls[*].function.arguments
  • finish_reason="tool_calls"

Minimal serialized example from current main:

{
  "index": 0,
  "delta": {
    "tool_calls": [
      {"index": 0, "function": {"arguments": "\"pong.py\"}"}}
    ]
  },
  "finish_reason": "tool_calls"
}

This is problematic for strict OpenAI-compatible streaming clients because the last argument bytes can be associated with the finish chunk and dropped or mishandled. The safer protocol shape is:

  1. emit the terminal argument fragment with finish_reason: null;
  2. emit a separate empty-delta finish chunk with finish_reason: "tool_calls";
  3. then emit the usage chunk if stream_options.include_usage is enabled.

Related upstream / downstream context

This is related to, but not fully covered by, existing issues and PRs:

  • vLLM issue #38603 / PR #39598: related null/empty field and MTP tool-call final chunk behavior, but not the full GLM metadata-repeat + finish-chunk split semantics.
  • vLLM issue #36857 / PR #37845: suffix-alignment / full-argument re-emission issue; closed as fixed on main, but it does not cover these two protocol issues.
  • vLLM PR #39253: fixed GLM parser streaming under MTP / stream interval, but the serving-layer final chunk behavior above still exists.
  • vLLM-Ascend issue vllm-project/vllm-ascend#8327: reports the argument-delta and finish_reason="tool_calls" being combined in one final chunk.
  • vLLM-Ascend PR vllm-project/vllm-ascend#8178: downstream patch avoiding duplicate function metadata in final chunks.

Expected behavior

For a given tool-call index:

  • id, type, and function.name should be emitted only when the tool-call header is first introduced.
  • Continuation/final argument chunks should emit only function.arguments fragments.
  • If the terminal chunk contains a non-empty function.arguments fragment and also ends the tool call, the stream should first send the argument fragment with finish_reason: null, then send a separate empty finish chunk with finish_reason: "tool_calls".

Before submitting a new issue...

  • I have searched existing issues and PRs and listed the nearest related ones above.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For a given tool-call index:

  • id, type, and function.name should be emitted only when the tool-call header is first introduced.
  • Continuation/final argument chunks should emit only function.arguments fragments.
  • If the terminal chunk contains a non-empty function.arguments fragment and also ends the tool call, the stream should first send the argument fragment with finish_reason: null, then send a separate empty finish chunk with finish_reason: "tool_calls".

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish_reason