For a given tool-call index: - `id`, `type`, and `function.name` should be emitted only when the tool-call header is first introduced. - Continuation/final argument chunks should emit only `function.arguments` fragments. - If the terminal chunk contains a non-empty `function.arguments` fragment and also ends the tool call, the stream should first send the argument fragment with `finish_reason: null`, then send a separate empty finish chunk with `finish_reason: "tool_calls"`.

vllm - 💡(How to fix) Fix [Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish

Fix Action

Fix / Workaround

vLLM issue #38603 / PR #39598: related null/empty field and MTP tool-call final chunk behavior, but not the full GLM metadata-repeat + finish-chunk split semantics.
vLLM issue #36857 / PR #37845: suffix-alignment / full-argument re-emission issue; closed as fixed on main, but it does not cover these two protocol issues.
vLLM PR #39253: fixed GLM parser streaming under MTP / stream interval, but the serving-layer final chunk behavior above still exists.
vLLM-Ascend issue vllm-project/vllm-ascend#8327: reports the argument-delta and finish_reason="tool_calls" being combined in one final chunk.
vLLM-Ascend PR vllm-project/vllm-ascend#8178: downstream patch avoiding duplicate function metadata in final chunks.

Code Example

commit 6bdabbad5 [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956)

---

return DeltaMessage(
    tool_calls=[
        DeltaToolCall(
            index=index,
            id=original_tc.id if original_tc else None,
            type=original_tc.type if original_tc else None,
            function=DeltaFunctionCall(
                name=original_fn.name if original_fn else None,
                arguments=remaining_call,
            ),
        )
    ]
)

---

remaining_delta.model_dump=
{'id': 'call_current', 'type': 'function', 'index': 0,
 'function': {'name': 'current_name', 'arguments': ']}'}}

---

{
  "index": 0,
  "delta": {
    "tool_calls": [
      {"index": 0, "function": {"arguments": "\"pong.py\"}"}}
    ]
  },
  "finish_reason": "tool_calls"
}

Your current environment

Current local upstream main:

commit 6bdabbad5 [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956)

The relevant code path is vllm/entrypoints/openai/chat_completion/serving.py with GLM tool parsers such as glm45 / glm47 used for GLM-4.5 / GLM-5 style tool-call streaming.

🐛 Describe the bug

There are still two GLM tool-call streaming protocol issues on current main.

1. Final remaining-argument chunks can re-emit tool-call metadata

When OpenAIServingChat computes remaining tool arguments at finish time, _create_remaining_args_delta() preserves id, type, and function.name from the original delta:

return DeltaMessage(
    tool_calls=[
        DeltaToolCall(
            index=index,
            id=original_tc.id if original_tc else None,
            type=original_tc.type if original_tc else None,
            function=DeltaFunctionCall(
                name=original_fn.name if original_fn else None,
                arguments=remaining_call,
            ),
        )
    ]
)

For continuation / final remaining-argument chunks, this can send id, type, and function.name again even though those fields were already emitted in the first chunk for that tool-call index. OpenAI-compatible clients generally expect metadata to appear in the first chunk only, while later chunks append only function.arguments fragments.

A minimal probe against current main shows the metadata is still preserved:

remaining_delta.model_dump=
{'id': 'call_current', 'type': 'function', 'index': 0,
 'function': {'name': 'current_name', 'arguments': ']}'}}

2. A terminal argument chunk can be combined with `finish_reason="tool_calls"`

When the final engine output has both a tool argument delta and output.finish_reason is not None, the current stream generator builds one ChatCompletionResponseStreamChoice containing both:

delta.tool_calls[*].function.arguments
finish_reason="tool_calls"

Minimal serialized example from current main:

{
  "index": 0,
  "delta": {
    "tool_calls": [
      {"index": 0, "function": {"arguments": "\"pong.py\"}"}}
    ]
  },
  "finish_reason": "tool_calls"
}

This is problematic for strict OpenAI-compatible streaming clients because the last argument bytes can be associated with the finish chunk and dropped or mishandled. The safer protocol shape is:

emit the terminal argument fragment with finish_reason: null;
emit a separate empty-delta finish chunk with finish_reason: "tool_calls";
then emit the usage chunk if stream_options.include_usage is enabled.

Related upstream / downstream context

This is related to, but not fully covered by, existing issues and PRs:

vLLM issue #38603 / PR #39598: related null/empty field and MTP tool-call final chunk behavior, but not the full GLM metadata-repeat + finish-chunk split semantics.
vLLM issue #36857 / PR #37845: suffix-alignment / full-argument re-emission issue; closed as fixed on main, but it does not cover these two protocol issues.
vLLM PR #39253: fixed GLM parser streaming under MTP / stream interval, but the serving-layer final chunk behavior above still exists.
vLLM-Ascend issue vllm-project/vllm-ascend#8327: reports the argument-delta and finish_reason="tool_calls" being combined in one final chunk.
vLLM-Ascend PR vllm-project/vllm-ascend#8178: downstream patch avoiding duplicate function metadata in final chunks.

Expected behavior

For a given tool-call index:

id, type, and function.name should be emitted only when the tool-call header is first introduced.
Continuation/final argument chunks should emit only function.arguments fragments.
If the terminal chunk contains a non-empty function.arguments fragment and also ends the tool call, the stream should first send the argument fragment with finish_reason: null, then send a separate empty finish chunk with finish_reason: "tool_calls".

Before submitting a new issue...

I have searched existing issues and PRs and listed the nearest related ones above.

FAQ

Expected behavior

For a given tool-call index:

id, type, and function.name should be emitted only when the tool-call header is first introduced.
Continuation/final argument chunks should emit only function.arguments fragments.
If the terminal chunk contains a non-empty function.arguments fragment and also ends the tool call, the stream should first send the argument fragment with finish_reason: null, then send a separate empty finish chunk with finish_reason: "tool_calls".

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish_reason

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

1. Final remaining-argument chunks can re-emit tool-call metadata

2. A terminal argument chunk can be combined with `finish_reason="tool_calls"`

Related upstream / downstream context

Expected behavior

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish_reason

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

1. Final remaining-argument chunks can re-emit tool-call metadata

2. A terminal argument chunk can be combined with finish_reason="tool_calls"

Related upstream / downstream context

Expected behavior

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

TRENDING

2. A terminal argument chunk can be combined with `finish_reason="tool_calls"`