vllm - 💡(How to fix) Fix [Bug]: DeepSeek V4 DSML tool calls fake-stream arguments instead of incremental deltas [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

This does not require model weights or a GPU to reproduce because it is in the DeepSeek DSML tool parser streaming path.

Fix Action

Fixed

Code Example

import json
from unittest.mock import MagicMock

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest, ChatCompletionToolsParam
from vllm.tool_parsers.deepseekv4_tool_parser import DeepSeekV4ToolParser

mock_tokenizer = MagicMock()
mock_tokenizer.get_vocab.return_value = {}

TC_START = "<|DSML|tool_calls>"
TC_END = "</|DSML|tool_calls>"
INV_START = '<|DSML|invoke name="'
INV_END = "</|DSML|invoke>"
PARAM_START = '<|DSML|parameter name="'
PARAM_END = "</|DSML|parameter>"

tool = ChatCompletionToolsParam(
    type="function",
    function={
        "name": "plan_trip",
        "parameters": {
            "type": "object",
            "properties": {
                "days": {"type": "integer"},
                "flexible": {"type": "boolean"},
                "cities": {"type": "array", "items": {"type": "string"}},
                "notes": {"type": "string"},
            },
            "required": ["days", "flexible", "cities", "notes"],
        },
    },
)

full_text = (
    f'{TC_START}\n{INV_START}plan_trip">\n'
    f'{PARAM_START}days" string="false">3{PARAM_END}\n'
    f'{PARAM_START}flexible" string="false">false{PARAM_END}\n'
    f'{PARAM_START}cities" string="false">["Beijing","Shanghai","Tokyo","New York"]{PARAM_END}\n'
    f'{PARAM_START}notes" string="true">靠窗座位{PARAM_END}\n'
    f'{INV_END}\n{TC_END}'
)

parser = DeepSeekV4ToolParser(mock_tokenizer, tools=[tool])
request = ChatCompletionRequest(model="m", messages=[], tools=[tool])
prev = ""
deltas = []
for start in range(0, len(full_text), 4):
    delta_text = full_text[start:start + 4]
    curr = prev + delta_text
    delta = parser.extract_tool_calls_streaming(prev, curr, delta_text, [], [], [1], request)
    prev = curr
    if delta is not None:
        deltas.append(delta)

arg_chunks = [
    tc.function.arguments
    for d in deltas
    for tc in (d.tool_calls or [])
    if tc.function and tc.function.arguments is not None
]

print(len(arg_chunks), [len(c) for c in arg_chunks])
print(json.loads("".join(arg_chunks)))

---

1 [103]
{'days': 3, 'flexible': False, 'cities': ['Beijing', 'Shanghai', 'Tokyo', 'New York'], 'notes': '靠窗座位'}
RAW_BUFFERClick to expand / collapse

Your current environment

Reproduced at parser level on upstream main commit 0fa888465e5a30b797bdf2cdcd0f57fc77541cef.

This does not require model weights or a GPU to reproduce because it is in the DeepSeek DSML tool parser streaming path.

🐛 Describe the bug

DeepSeek V4 DSML tool-call streaming currently buffers the entire <|DSML|invoke ...>...</|DSML|invoke> block and emits function.arguments only once the closing invoke tag is present.

That means stream=true does not actually stream tool-call arguments incrementally. For long tool-call arguments, clients receive no argument deltas until the full invoke is complete, then receive one large function.arguments payload.

The current code path is inherited from DeepSeekV32ToolParser and explicitly documents the behavior as a buffer-until-complete-invoke strategy.

Expected behavior:

  1. Emit the tool-call id/type/name once the invoke start tag is recognized.
  2. Emit valid OpenAI-compatible function.arguments fragments incrementally as DSML parameter content arrives.
  3. Reassembling all argument fragments should produce the same JSON object as non-streaming parsing.
  4. Preserve the existing string="true|false" semantics from #41801.

Actual behavior:

A chunked DSML tool call produces a single argument delta after the entire invoke is complete.

Parser-level reproduction:

import json
from unittest.mock import MagicMock

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest, ChatCompletionToolsParam
from vllm.tool_parsers.deepseekv4_tool_parser import DeepSeekV4ToolParser

mock_tokenizer = MagicMock()
mock_tokenizer.get_vocab.return_value = {}

TC_START = "<|DSML|tool_calls>"
TC_END = "</|DSML|tool_calls>"
INV_START = '<|DSML|invoke name="'
INV_END = "</|DSML|invoke>"
PARAM_START = '<|DSML|parameter name="'
PARAM_END = "</|DSML|parameter>"

tool = ChatCompletionToolsParam(
    type="function",
    function={
        "name": "plan_trip",
        "parameters": {
            "type": "object",
            "properties": {
                "days": {"type": "integer"},
                "flexible": {"type": "boolean"},
                "cities": {"type": "array", "items": {"type": "string"}},
                "notes": {"type": "string"},
            },
            "required": ["days", "flexible", "cities", "notes"],
        },
    },
)

full_text = (
    f'{TC_START}\n{INV_START}plan_trip">\n'
    f'{PARAM_START}days" string="false">3{PARAM_END}\n'
    f'{PARAM_START}flexible" string="false">false{PARAM_END}\n'
    f'{PARAM_START}cities" string="false">["Beijing","Shanghai","Tokyo","New York"]{PARAM_END}\n'
    f'{PARAM_START}notes" string="true">靠窗座位{PARAM_END}\n'
    f'{INV_END}\n{TC_END}'
)

parser = DeepSeekV4ToolParser(mock_tokenizer, tools=[tool])
request = ChatCompletionRequest(model="m", messages=[], tools=[tool])
prev = ""
deltas = []
for start in range(0, len(full_text), 4):
    delta_text = full_text[start:start + 4]
    curr = prev + delta_text
    delta = parser.extract_tool_calls_streaming(prev, curr, delta_text, [], [], [1], request)
    prev = curr
    if delta is not None:
        deltas.append(delta)

arg_chunks = [
    tc.function.arguments
    for d in deltas
    for tc in (d.tool_calls or [])
    if tc.function and tc.function.arguments is not None
]

print(len(arg_chunks), [len(c) for c in arg_chunks])
print(json.loads("".join(arg_chunks)))

Current output on main:

1 [103]
{'days': 3, 'flexible': False, 'cities': ['Beijing', 'Shanghai', 'Tokyo', 'New York'], 'notes': '靠窗座位'}

The final reconstructed object is correct, but streaming is fake because there is only one large arguments chunk.

Before submitting a new issue...

  • I searched existing issues and PRs. #40801 is about DSML marker leakage; #40959 and #41531 are nested-arguments normalization PRs, which are separate from this incremental streaming behavior and largely superseded by #41801.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepSeek V4 DSML tool calls fake-stream arguments instead of incremental deltas [1 pull requests]