vllm - 💡(How to fix) Fix [Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

nvidia/NVIDIA-Nemotron-Nano-9B-v2 ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:

--enable-auto-tool-choice
--tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
--tool-call-parser nemotron_json

The cookbook pins vLLM to commit 75531a6c… (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.

Error Message

SPDX-License-Identifier: Apache-2.0

import json import re from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest from vllm.entrypoints.openai.engine.protocol import ( DeltaMessage, ExtractedToolCallInformation, FunctionCall, ToolCall, ) from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager from vllm.logger import init_logger from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(name)

@ToolParserManager.register_module("nemotron_json") class NemotronJSONToolParser(ToolParser): def init(self, tokenizer: TokenizerLike, tools=None): super().init(tokenizer, tools) self.tool_call_start_token = "<TOOLCALL>" self.tool_call_end_token = "</TOOLCALL>" self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

def extract_tool_calls(
    self, model_output: str, request: ChatCompletionRequest
) -> ExtractedToolCallInformation:
    if self.tool_call_start_token not in model_output:
        return ExtractedToolCallInformation(
            tools_called=False, tool_calls=[], content=model_output
        )
    try:
        str_calls = self.tool_call_regex.findall(model_output)[0].strip()
        if not str_calls.startswith("["):
            str_calls = "[" + str_calls
        if not str_calls.endswith("]"):
            str_calls = str_calls + "]"
        tool_calls = []
        for tc in json.loads(str_calls):
            try:
                args = tc["arguments"]
                tool_calls.append(ToolCall(
                    type="function",
                    function=FunctionCall(
                        name=tc["name"],
                        arguments=json.dumps(args, ensure_ascii=False)
                            if isinstance(args, dict) else args,
                    ),
                ))
            except Exception:
                continue
        content = model_output[:model_output.rfind(self.tool_call_start_token)]
        return ExtractedToolCallInformation(
            tools_called=True, tool_calls=tool_calls,
            content=content if content else None,
        )
    except Exception:
        logger.exception("Error extracting tool call from: %s", model_output)
        return ExtractedToolCallInformation(
            tools_called=False, tool_calls=[], content=model_output
        )

def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
    raise NotImplementedError("Streaming not supported")

Root Cause

Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.

Fix Action

Fix / Workaround

Patched plugin (works against v0.20.1)

vLLM 0.20.1 + vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_json with the upstream plugin file → ImportError chain ending in KeyError: 'invalid tool call parser: nemotron_json'. After patching imports, first request with tools=[…] raises TypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given.

Code Example

# SPDX-License-Identifier: Apache-2.0

import json
import re
from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
    DeltaMessage,
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(__name__)


@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike, tools=None):
        super().__init__(tokenizer, tools)
        self.tool_call_start_token = "<TOOLCALL>"
        self.tool_call_end_token = "</TOOLCALL>"
        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        if self.tool_call_start_token not in model_output:
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
        try:
            str_calls = self.tool_call_regex.findall(model_output)[0].strip()
            if not str_calls.startswith("["):
                str_calls = "[" + str_calls
            if not str_calls.endswith("]"):
                str_calls = str_calls + "]"
            tool_calls = []
            for tc in json.loads(str_calls):
                try:
                    args = tc["arguments"]
                    tool_calls.append(ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=tc["name"],
                            arguments=json.dumps(args, ensure_ascii=False)
                                if isinstance(args, dict) else args,
                        ),
                    ))
                except Exception:
                    continue
            content = model_output[:model_output.rfind(self.tool_call_start_token)]
            return ExtractedToolCallInformation(
                tools_called=True, tool_calls=tool_calls,
                content=content if content else None,
            )
        except Exception:
            logger.exception("Error extracting tool call from: %s", model_output)
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )

    def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
        raise NotImplementedError("Streaming not supported")
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Context

nvidia/NVIDIA-Nemotron-Nano-9B-v2 ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:

--enable-auto-tool-choice
--tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
--tool-call-parser nemotron_json

The cookbook pins vLLM to commit 75531a6c… (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.

What breaks on v0.20.x

Three import paths in the plugin no longer resolve, plus the ToolParser.__init__ calling convention changed:

Symbol / surfaceOld (Aug-2025 vLLM)v0.20.1
ChatCompletionRequestvllm.entrypoints.openai.protocolvllm.entrypoints.openai.chat_completion.protocol
FunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformationvllm.entrypoints.openai.protocolvllm.entrypoints.openai.engine.protocol
ToolParser, ToolParserManagervllm.entrypoints.openai.tool_parsers.abstract_tool_parservllm.tool_parsers.abstract_tool_parser
AnyTokenizervllm.transformers_utils.tokenizerrenamed to TokenizerLike in vllm.tokenizers.protocol
ToolParser.__init__(tokenizer)one positional argnow called as tool_parser(tokenizer, request.tools) (see vllm/entrypoints/serve/render/serving.py) — subclasses must accept the second arg

Result against current vLLM: server fails to start with KeyError: 'invalid tool call parser: nemotron_json' (plugin can't be imported), and even after fixing imports the parser raises TypeError: __init__() takes 2 positional arguments but 3 were given on the first request that carries tools=[…].

Patched plugin (works against v0.20.1)

Only imports + AnyTokenizer -> TokenizerLike rename + __init__ accepts tools; parsing logic is identical to NVIDIA's upstream.

<details> <summary>nemotron_parser.py</summary>
# SPDX-License-Identifier: Apache-2.0

import json
import re
from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
    DeltaMessage,
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(__name__)


@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike, tools=None):
        super().__init__(tokenizer, tools)
        self.tool_call_start_token = "<TOOLCALL>"
        self.tool_call_end_token = "</TOOLCALL>"
        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        if self.tool_call_start_token not in model_output:
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
        try:
            str_calls = self.tool_call_regex.findall(model_output)[0].strip()
            if not str_calls.startswith("["):
                str_calls = "[" + str_calls
            if not str_calls.endswith("]"):
                str_calls = str_calls + "]"
            tool_calls = []
            for tc in json.loads(str_calls):
                try:
                    args = tc["arguments"]
                    tool_calls.append(ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=tc["name"],
                            arguments=json.dumps(args, ensure_ascii=False)
                                if isinstance(args, dict) else args,
                        ),
                    ))
                except Exception:
                    continue
            content = model_output[:model_output.rfind(self.tool_call_start_token)]
            return ExtractedToolCallInformation(
                tools_called=True, tool_calls=tool_calls,
                content=content if content else None,
            )
        except Exception:
            logger.exception("Error extracting tool call from: %s", model_output)
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )

    def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
        raise NotImplementedError("Streaming not supported")
</details>

Proposal

Either

  • accept this as a built-in nemotron_json parser under vllm/tool_parsers/ (the format <TOOLCALL>[{"name": ..., "arguments": ...}, ...]</TOOLCALL> is baked into the model's chat template, so it's a stable target), or
  • coordinate with NVIDIA to refresh the plugin in their HF model repo.

Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.

Reproduction

vLLM 0.20.1 + vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_json with the upstream plugin file → ImportError chain ending in KeyError: 'invalid tool call parser: nemotron_json'. After patching imports, first request with tools=[…] raises TypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING