vllm - 💡(How to fix) Fix [Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)

vllm2026-05-08 13:21:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

nvidia/NVIDIA-Nemotron-Nano-9B-v2 ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:

--enable-auto-tool-choice
--tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
--tool-call-parser nemotron_json

The cookbook pins vLLM to commit 75531a6c… (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.

Error Message

SPDX-License-Identifier: Apache-2.0

import json import re from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest from vllm.entrypoints.openai.engine.protocol import ( DeltaMessage, ExtractedToolCallInformation, FunctionCall, ToolCall, ) from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager from vllm.logger import init_logger from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(name)

@ToolParserManager.register_module("nemotron_json") class NemotronJSONToolParser(ToolParser): def init(self, tokenizer: TokenizerLike, tools=None): super().init(tokenizer, tools) self.tool_call_start_token = "<TOOLCALL>" self.tool_call_end_token = "</TOOLCALL>" self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

def extract_tool_calls(
    self, model_output: str, request: ChatCompletionRequest
) -> ExtractedToolCallInformation:
    if self.tool_call_start_token not in model_output:
        return ExtractedToolCallInformation(
            tools_called=False, tool_calls=[], content=model_output
        )
    try:
        str_calls = self.tool_call_regex.findall(model_output)[0].strip()
        if not str_calls.startswith("["):
            str_calls = "[" + str_calls
        if not str_calls.endswith("]"):
            str_calls = str_calls + "]"
        tool_calls = []
        for tc in json.loads(str_calls):
            try:
                args = tc["arguments"]
                tool_calls.append(ToolCall(
                    type="function",
                    function=FunctionCall(
                        name=tc["name"],
                        arguments=json.dumps(args, ensure_ascii=False)
                            if isinstance(args, dict) else args,
                    ),
                ))
            except Exception:
                continue
        content = model_output[:model_output.rfind(self.tool_call_start_token)]
        return ExtractedToolCallInformation(
            tools_called=True, tool_calls=tool_calls,
            content=content if content else None,
        )
    except Exception:
        logger.exception("Error extracting tool call from: %s", model_output)
        return ExtractedToolCallInformation(
            tools_called=False, tool_calls=[], content=model_output
        )

def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
    raise NotImplementedError("Streaming not supported")

Root Cause

Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.

Fix Action

Fix / Workaround

Patched plugin (works against v0.20.1)

vLLM 0.20.1 + vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --enable-auto-tool-choice --tool-parser-plugin <upstream-plugin> --tool-call-parser nemotron_json with the upstream plugin file → ImportError chain ending in KeyError: 'invalid tool call parser: nemotron_json'. After patching imports, first request with tools=[…] raises TypeError: NemotronJSONToolParser.__init__() takes 2 positional arguments but 3 were given.

Code Example

# SPDX-License-Identifier: Apache-2.0

import json
import re
from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
    DeltaMessage,
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(__name__)


@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike, tools=None):
        super().__init__(tokenizer, tools)
        self.tool_call_start_token = "<TOOLCALL>"
        self.tool_call_end_token = "</TOOLCALL>"
        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        if self.tool_call_start_token not in model_output:
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
        try:
            str_calls = self.tool_call_regex.findall(model_output)[0].strip()
            if not str_calls.startswith("["):
                str_calls = "[" + str_calls
            if not str_calls.endswith("]"):
                str_calls = str_calls + "]"
            tool_calls = []
            for tc in json.loads(str_calls):
                try:
                    args = tc["arguments"]
                    tool_calls.append(ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=tc["name"],
                            arguments=json.dumps(args, ensure_ascii=False)
                                if isinstance(args, dict) else args,
                        ),
                    ))
                except Exception:
                    continue
            content = model_output[:model_output.rfind(self.tool_call_start_token)]
            return ExtractedToolCallInformation(
                tools_called=True, tool_calls=tool_calls,
                content=content if content else None,
            )
        except Exception:
            logger.exception("Error extracting tool call from: %s", model_output)
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )

    def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
        raise NotImplementedError("Streaming not supported")

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Context

nvidia/NVIDIA-Nemotron-Nano-9B-v2 ships an out-of-tree tool-call parser plugin (nemotron_toolcall_parser_no_streaming.py) that NVIDIA's own vLLM cookbook tells users to load via:

--enable-auto-tool-choice
--tool-parser-plugin "<repo>/nemotron_toolcall_parser_no_streaming.py"
--tool-call-parser nemotron_json

The cookbook pins vLLM to commit 75531a6c… (2025-08-15). The plugin file in NVIDIA's HF model repo has not been updated since.

What breaks on v0.20.x

Three import paths in the plugin no longer resolve, plus the ToolParser.__init__ calling convention changed:

Symbol / surface	Old (Aug-2025 vLLM)	v0.20.1
`ChatCompletionRequest`	`vllm.entrypoints.openai.protocol`	`vllm.entrypoints.openai.chat_completion.protocol`
`FunctionCall, ToolCall, DeltaFunctionCall, DeltaToolCall, DeltaMessage, ExtractedToolCallInformation`	`vllm.entrypoints.openai.protocol`	`vllm.entrypoints.openai.engine.protocol`
`ToolParser, ToolParserManager`	`vllm.entrypoints.openai.tool_parsers.abstract_tool_parser`	`vllm.tool_parsers.abstract_tool_parser`
`AnyTokenizer`	`vllm.transformers_utils.tokenizer`	renamed to `TokenizerLike` in `vllm.tokenizers.protocol`
`ToolParser.__init__(tokenizer)`	one positional arg	now called as `tool_parser(tokenizer, request.tools)` (see `vllm/entrypoints/serve/render/serving.py`) — subclasses must accept the second arg

Result against current vLLM: server fails to start with KeyError: 'invalid tool call parser: nemotron_json' (plugin can't be imported), and even after fixing imports the parser raises TypeError: __init__() takes 2 positional arguments but 3 were given on the first request that carries tools=[…].

Patched plugin (works against v0.20.1)

Only imports + AnyTokenizer -> TokenizerLike rename + __init__ accepts tools; parsing logic is identical to NVIDIA's upstream.

<details> <summary>nemotron_parser.py</summary>

# SPDX-License-Identifier: Apache-2.0

import json
import re
from typing import Union

from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import (
    DeltaMessage,
    ExtractedToolCallInformation,
    FunctionCall,
    ToolCall,
)
from vllm.tool_parsers.abstract_tool_parser import ToolParser, ToolParserManager
from vllm.logger import init_logger
from vllm.tokenizers.protocol import TokenizerLike

logger = init_logger(__name__)


@ToolParserManager.register_module("nemotron_json")
class NemotronJSONToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike, tools=None):
        super().__init__(tokenizer, tools)
        self.tool_call_start_token = "<TOOLCALL>"
        self.tool_call_end_token = "</TOOLCALL>"
        self.tool_call_regex = re.compile(r"<TOOLCALL>(.*?)</TOOLCALL>", re.DOTALL)

    def extract_tool_calls(
        self, model_output: str, request: ChatCompletionRequest
    ) -> ExtractedToolCallInformation:
        if self.tool_call_start_token not in model_output:
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
        try:
            str_calls = self.tool_call_regex.findall(model_output)[0].strip()
            if not str_calls.startswith("["):
                str_calls = "[" + str_calls
            if not str_calls.endswith("]"):
                str_calls = str_calls + "]"
            tool_calls = []
            for tc in json.loads(str_calls):
                try:
                    args = tc["arguments"]
                    tool_calls.append(ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=tc["name"],
                            arguments=json.dumps(args, ensure_ascii=False)
                                if isinstance(args, dict) else args,
                        ),
                    ))
                except Exception:
                    continue
            content = model_output[:model_output.rfind(self.tool_call_start_token)]
            return ExtractedToolCallInformation(
                tools_called=True, tool_calls=tool_calls,
                content=content if content else None,
            )
        except Exception:
            logger.exception("Error extracting tool call from: %s", model_output)
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )

    def extract_tool_calls_streaming(self, *_args, **_kwargs) -> Union[DeltaMessage, None]:
        raise NotImplementedError("Streaming not supported")

</details>

Proposal

Either

accept this as a built-in nemotron_json parser under vllm/tool_parsers/ (the format <TOOLCALL>[{"name": ..., "arguments": ...}, ...]</TOOLCALL> is baked into the model's chat template, so it's a stable target), or
coordinate with NVIDIA to refresh the plugin in their HF model repo.

Happy with whichever. Flagging because the current state is silently broken for anyone following NVIDIA's official cookbook against current vLLM.

Reproduction

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)

Recommended Tools

GitHub issue graph ai analysis

Error Message

SPDX-License-Identifier: Apache-2.0

Root Cause

Fix Action

Fix / Workaround

Patched plugin (works against v0.20.1)

Code Example

🚀 The feature, motivation and pitch

Context

What breaks on v0.20.x

Patched plugin (works against v0.20.1)

Proposal

Reproduction

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)

Recommended Tools

GitHub issue graph ai analysis

Error Message

SPDX-License-Identifier: Apache-2.0

Root Cause

Fix Action

Fix / Workaround

Patched plugin (works against v0.20.1)

Code Example

🚀 The feature, motivation and pitch

Context

What breaks on v0.20.x

Patched plugin (works against v0.20.1)

Proposal

Reproduction

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING