vllm - ✅(Solved) Fix [RFC]: Unifying Tool Calling via Region-Scoped Guided Decoding, Tool-Aware Grammars, and Related Parsers [1 pull requests, 1 participants]

bitbottrap · 2026-04-15T00:46:22Z

[vllm] PR 36891: Tool Parser Kimi K2: guided decoding for tool choice="auto" — 75% → 100% schema accuracy - Repository: vllm-project/vllm - Author: ZhanqiuHu -… # PR #36891: [Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy - Repository: vllm-project/vllm - Author: ZhanqiuHu - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/36891 ## Description (problem / solution / changelog) Co-authored with @Yzong-rh ## Purpose The Kimi K2 tool parser currently relies on post-hoc parsing for `tool_choice="auto"` — the model generates freely and vLLM extracts tool calls afterward. This works most of the time, but the model can hallucinate tool names not in the user's schema (e.g., calling `img_gen` when only `search` is available), causing schema validation failures. This PR adds generation-time enforcement via xgrammar's structural tag mechanism, ensuring that once the model decides to make a tool call, it can only produce tool names and arguments that conform to the provided schema. This is the first tool parser in vLLM to use guided decoding for `tool_choice="auto"`. For background on Kimi K2 tool calling on vLLM, see: [Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM](https://blog.vllm.ai/2025/10/28/Kimi-K2-Accuracy.html). **Key benefits:** - **100% schema accuracy** on the [K2-Vendor-Verifier](https://github.com/MoonshotAI/K2-Vendor-Verifier) benchmark (up from 75.4%), eliminating all tool name hallucination - **Zero overhead for non-tool-call tokens** — the grammar only activates after the ` ` trigger, so free-text generation is unconstrained - **Composable with existing behavior** — `tool_choice="required"` and forced function still use the base class JSON schema path; this only fills the gap for `"auto"` - **Generalizable pattern** — the same `TriggeredTagsFormat` approach can be applied to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination issues ## Summary - This is the first tool parser in vLLM to apply guided decoding for `tool_choice="auto"`, and the approach generalizes to other parsers - Add xgrammar structural tag guided decoding to the Kimi K2 tool parser when `tool_choice` is `"auto"` or unset - Eliminates tool name hallucination (e.g., model calling `img_gen` when only `search`/`urls_fetch_tool` are available) by constraining generation at the token level - No change to `tool_choice="required"` or forced function behavior (handled by base class) ## Approach Override `adjust_request()` in `KimiK2ToolParser` to build a `TriggeredTagsFormat` structural tag from the request's tool definitions: - **Trigger**: ` ` — free text allowed until this token - **Per-tool tag**: ` {name}:\d+ {json} ` - **Composable content**: `sequence` of `regex` (call ID) + `const_string` (argument marker) + `json_schema` (parameters) - Supports multiple tool calls per response (`stop_after_first=False`) - Respects existing `structured_outputs` if already set (e.g., by `tool_choice="required"`) ## Evaluation K2-Vendor-Verifier benchmark, 2000 samples, `moonshotai/Kimi-K2-Instruct-0905` (revision `94a4053eb8863059dd8afc00937f054e1365abbd`): | | Tool Calls | Schema Errors | Accuracy | |---|---|---|---| | Baseline (no guided decoding) | 678 | 167 | 75.4% | | **This PR** | 677 | **0** | **100%** | The dominant failure mode in the baseline was tool name hallucination — the model generating calls to tools not in the provided schema (e.g., `img_gen`). With structural tag enforcement, the grammar only allows tokens that match valid tool names after the ` ` trigger. **Reproduction:** ```bash # Server vllm serve moonshotai/Kimi-K2-Instruct-0905 \ --revision 94a4053eb8863059dd8afc00937f054e1365abbd \ # changing this might results in regression, still verifying --tensor-parallel-size 8 --trust-remote-code \ --enable-auto-tool-choice --tool-call-parser kimi_k2 # Eval (using K2-Vendor-Verifier) python tool_calls_eval.py downloads/tool-calls/samples.jsonl \ --model moonshotai/Kimi-K2-Instruct-0905 \ --base-url http://localhost:8000/v1 --api-key dummy \ --concurrency 8 --temperature 0.6 --max-tokens 64000 \ --output results.jsonl --summary summary.json ``` ## Caveats/Limitations - **Performance not benchmarked** — throughput/latency overhead of structural tag guided decoding has not been measured. The grammar only constrains tokens inside tool calls (not free text), so overhead should be minimal, but this needs validation. ## Future work - [ ] Integrate per-function `strict` parameter for argument schema guidance (add `strict` to `FunctionDefinition` in vLLM's protocol layer first). - [ ] Generalize this approach to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination in `tool_choice="auto"` - [ ] Validate `tool_choice='required'` path. - [ ] Benchmark throughput/latency overhead

vllm2026-04-15 00:46:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39848•Fetched 2026-04-16 06:36:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bitbottrap

Participants

bitbottrap

Timeline (top)

cross-referenced ×3labeled ×1

Root Cause

When tool_choice="required" is specified, the system avoids these parsing challenges by enforcing a global JSON schema via structured_outputs.json. However, this introduces a different set of problems. Because the schema is applied from the first token and defined as an array, decoding is forced to begin with [ and remain within JSON for the entire output. This bypasses model-native tool-call formats and suppresses reasoning output entirely.

Fix Action

Fixed

Fixed by PR: [Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy (https://github.com/vllm-project/vllm/pull/36891)

PR fix notes

PR #36891: [Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy

Repository: vllm-project/vllm
Author: ZhanqiuHu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36891

Description (problem / solution / changelog)

Co-authored with @Yzong-rh

Purpose

The Kimi K2 tool parser currently relies on post-hoc parsing for tool_choice="auto" — the model generates freely and vLLM extracts tool calls afterward. This works most of the time, but the model can hallucinate tool names not in the user's schema (e.g., calling img_gen when only search is available), causing schema validation failures.

This PR adds generation-time enforcement via xgrammar's structural tag mechanism, ensuring that once the model decides to make a tool call, it can only produce tool names and arguments that conform to the provided schema. This is the first tool parser in vLLM to use guided decoding for tool_choice="auto".

For background on Kimi K2 tool calling on vLLM, see: Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM.

Key benefits:

100% schema accuracy on the K2-Vendor-Verifier benchmark (up from 75.4%), eliminating all tool name hallucination
Zero overhead for non-tool-call tokens — the grammar only activates after the <|tool_call_begin|> trigger, so free-text generation is unconstrained
Composable with existing behavior — tool_choice="required" and forced function still use the base class JSON schema path; this only fills the gap for "auto"
Generalizable pattern — the same TriggeredTagsFormat approach can be applied to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination issues

Summary

This is the first tool parser in vLLM to apply guided decoding for tool_choice="auto", and the approach generalizes to other parsers
Add xgrammar structural tag guided decoding to the Kimi K2 tool parser when tool_choice is "auto" or unset
Eliminates tool name hallucination (e.g., model calling img_gen when only search/urls_fetch_tool are available) by constraining generation at the token level
No change to tool_choice="required" or forced function behavior (handled by base class)

Approach

Override adjust_request() in KimiK2ToolParser to build a TriggeredTagsFormat structural tag from the request's tool definitions:

Trigger: <|tool_call_begin|> — free text allowed until this token
Per-tool tag: <|tool_call_begin|>{name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|>
Composable content: sequence of regex (call ID) + const_string (argument marker) + json_schema (parameters)
Supports multiple tool calls per response (stop_after_first=False)
Respects existing structured_outputs if already set (e.g., by tool_choice="required")

Evaluation

K2-Vendor-Verifier benchmark, 2000 samples, moonshotai/Kimi-K2-Instruct-0905 (revision 94a4053eb8863059dd8afc00937f054e1365abbd):

	Tool Calls	Schema Errors	Accuracy
Baseline (no guided decoding)	678	167	75.4%
This PR	677	0	100%

The dominant failure mode in the baseline was tool name hallucination — the model generating calls to tools not in the provided schema (e.g., img_gen). With structural tag enforcement, the grammar only allows tokens that match valid tool names after the <|tool_call_begin|> trigger.

Reproduction:

# Server
vllm serve moonshotai/Kimi-K2-Instruct-0905 \
  --revision 94a4053eb8863059dd8afc00937f054e1365abbd \ # changing this might results in regression, still verifying
  --tensor-parallel-size 8 --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser kimi_k2

# Eval (using K2-Vendor-Verifier)
python tool_calls_eval.py downloads/tool-calls/samples.jsonl \
  --model moonshotai/Kimi-K2-Instruct-0905 \
  --base-url http://localhost:8000/v1 --api-key dummy \
  --concurrency 8 --temperature 0.6 --max-tokens 64000 \
  --output results.jsonl --summary summary.json

Caveats/Limitations

Performance not benchmarked — throughput/latency overhead of structural tag guided decoding has not been measured. The grammar only constrains tokens inside tool calls (not free text), so overhead should be minimal, but this needs validation.

Future work

Integrate per-function strict parameter for argument schema guidance (add strict to FunctionDefinition in vLLM's protocol layer first).
Generalize this approach to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination in tool_choice="auto"
Validate tool_choice='required' path.
Benchmark throughput/latency overhead of structural tag guided decoding vs. unconstrained generation

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/tool_parsers/kimi_k2_tool_parser.py (modified, +63/-0)

RAW_BUFFERClick to expand / collapse

Motivation.

This proposal outlines a unified approach to tool calling in vLLM that improves robustness, reduces parser complexity, and preserves model-native behavior. It achieves this through region-scoped guided decoding with tool-aware grammars and explicit boundary tracking during generation.

Tool calling in vLLM is currently implemented via two structurally different mechanisms that lead to inconsistent behavior, reduced robustness, and increasing parser complexity.

When tool_choice="auto" is used, generation is unconstrained and tool calls must be inferred after the fact. Parsers are responsible for detecting tool calls, identifying their boundaries, and reconstructing structured arguments. This introduces several well-known failure modes:

ambiguous or missed tool-call boundaries
incorrectly identifying free-form text as a tool call
malformed or partially emitted tool arguments
hallucinated tool names or incorrect tool selection
difficulty handling streaming or partial outputs
divergence across model-specific parsing implementations

These issues are not theoretical; they have been observed in practice and have driven repeated efforts to improve parser logic. The system has been described as fragmented and requiring redesign (see vllm issue #22918). Ongoing work on parser consolidation and flexibility (see vllm issue #27661) further highlights both the importance of this area and the growing complexity of the current approach.

This creates a fundamental tradeoff:

auto preserves model-native behavior but is prone to parsing errors
required enforces structure but overrides model-native formats

As a result, vLLM does not currently have a robust and unified approach to tool calling.

Recent work demonstrates that this tradeoff is not fundamental. The concept of trigger-based, region-scoped constrained decoding is described in vllm issue #32142. Early implementations exist in vllm pull request #32202 and vllm pull request #32232, and a concrete realization is demonstrated in vllm pull request #36891.

PR #36891 is particularly significant because it applies grammar-guided decoding only within the tool-call region, while preserving free-form generation outside it. By constraining tool names and arguments at generation time, it reduces failure modes such as hallucinated tool calls and malformed arguments. This demonstrates that generation-time enforcement can improve robustness without sacrificing model-native behavior.

At the same time, the current trajectory places significant emphasis on improving parser flexibility. While valuable, this approach treats parsing as the primary problem. In practice, parser complexity is a symptom of a deeper issue: the system does not explicitly represent structured regions during decoding and must infer them afterward.

This leads to duplication across parsers, inconsistent behavior across model families, and increasing maintenance cost. Even with improved parsers, heuristic detection remains inherently failure-prone.

Taken together, these observations suggest that:

robustness is currently achieved only by overriding model behavior
parser complexity continues to grow without eliminating failure modes
emerging work demonstrates a viable alternative based on region-scoped constraint

This RFC proposes adopting that alternative as a unified architectural model. By making structured regions explicit during decoding and enforcing both syntactic and semantic correctness within those regions, vLLM can improve robustness, reduce parser complexity, and provide a consistent tool-calling model across diverse model formats.

Proposed Change.

This proposal suggests adopting region-scoped guided decoding with tool-aware grammars and explicit boundary tracking as the primary architectural model for tool calling in vLLM.

The central goal is to improve robustness by shifting correctness guarantees from post-processing to generation time, while preserving model-native behavior.

Region-Scoped Guided Decoding

Instead of applying constraints globally or relying on post-hoc parsing, tool calls are treated as structured regions within the decoding process.

Generation proceeds as follows:

generation begins unconstrained
a model-specific trigger token is emitted
decoding transitions into a constrained mode
a grammar is enforced within the structured region
an end token is emitted
decoding returns to unconstrained mode

This aligns with the approach described in vllm issue #32142 and demonstrated in vllm pull request #36891.

This model improves robustness by:

ensuring structured output is valid at generation time
preventing malformed or partially emitted tool calls
eliminating ambiguity around where structured regions begin and end
preserving reasoning and model-native formatting outside the constrained region

Explicit Tool Call Boundaries as a Runtime Primitive

Tool-call boundaries should be treated as a first-class runtime concept.

The system should explicitly track:

entry into a structured region
position within the region
exit from the region

This improves robustness by:

eliminating incorrect identification of text as tool calls
removing ambiguity in boundary detection
enabling deterministic parsing
supporting reliable streaming and partial outputs

These boundaries should be:

exposed to parsers
available during streaming
consistent across model families

This ensures that parsing operates on known structured regions, not inferred ones.

Tool-Aware Grammar Generation

Grammars should be dynamically generated from the tools exposed in the request.

Rather than enforcing only syntactic structure, the grammar enforces semantic correctness by:

restricting tool names to the provided set
enforcing argument schemas per tool
encoding required and optional parameters
preventing semantically invalid tool calls

This directly addresses common failure modes such as:

hallucinated tool names
invalid argument structures
incorrect tool selection

This capability is partially demonstrated in vllm pull request #36891 and should be generalized across all tool-calling paths.

Grammars can be constructed using model-specific templates (e.g., JSON, XML, token-based formats), ensuring compatibility with diverse model conventions while maintaining correctness guarantees.

Parser Alignment

Parsers should be aligned with decoding rather than reconstructing structure afterward.

Under this model, parsers would:

define trigger, begin, and end tokens
define grammars for structured regions
extract structured data from bounded regions

This reduces parser complexity by:

eliminating heuristic detection
removing full-text scanning
enabling reuse across model families
ensuring consistent behavior

Parsing becomes a deterministic extraction step, rather than a probabilistic reconstruction process.

Tool Choice Semantics

This proposal suggests refining the meaning of tool_choice to reflect semantic intent rather than output format.

Under this model:

tool_choice="auto" allows optional entry into structured regions
tool_choice="required" ensures that at least one structured region is entered

Neither mode requires global JSON constraint.

This removes the current tradeoff between robustness and model alignment.

Relationship to Existing Work

This proposal builds on and unifies several ongoing efforts:

vllm issue #22918
vllm issue #27661
vllm issue #32142
vllm pull request #32202
vllm pull request #36891

It does not replace these efforts; it provides a unifying architectural direction.

Implementation Pathway

A phased approach can be used:

Phase 1: Enable structural-tag constraint for required

replace global JSON constraint with region-scoped constraint

Phase 2: Unify auto and required

both use region-scoped decoding

Phase 3: Introduce boundary-aware runtime

track structured regions during decoding

Phase 4: Generalize grammar system

support multiple formats and templates

Phase 5: Deprecate legacy paths

remove global JSON constraint
reduce reliance on heuristic parsing

Benefits

This approach improves:

robustness of tool calling
correctness of generated tool calls
consistency across models
parser simplicity and reuse
streaming reliability
alignment with model-native formats

Non-Goals / Tradeoffs

This proposal does not attempt to:

eliminate all parser logic; parsers remain necessary for extraction and model-specific configuration
mandate a single tool-call format; multiple formats (JSON, XML, token-based) are supported
fully replace existing implementations immediately; a phased migration is expected

Potential tradeoffs include:

increased complexity in decoding logic due to dynamic constraint switching
need for careful integration with batching and scheduling
additional work to standardize grammar generation across backends
possible performance considerations when activating grammars dynamically

However, these tradeoffs are balanced by improved robustness, reduced parser complexity, and a more consistent architecture.

Feedback Period.

Feedback period: TBD (maintainer guidance requested)

CC List.

Please add if relevant.

Any Other Things.

This proposal is intended to align with ongoing development rather than replace it. Recent work suggests that vLLM is already moving toward region-scoped guided decoding in specific contexts. This RFC proposes making that direction explicit and consistent across the system.

Further feedback on boundary tracking, grammar abstraction, and integration strategy would be especially valuable.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adopt region-scoped guided decoding with tool-aware grammars and explicit boundary tracking to improve robustness and reduce parser complexity in vLLM tool calling.

Guidance

Implement region-scoped guided decoding to enforce syntactic and semantic correctness within structured regions, as demonstrated in vllm pull request #36891.
Introduce explicit tool call boundaries as a runtime primitive to eliminate ambiguity and improve parsing reliability.
Develop a dynamic grammar generation system that restricts tool names and enforces argument schemas to prevent common failure modes.
Align parsers with the new decoding model to reduce complexity and enable deterministic extraction of structured data.
Refine the meaning of tool_choice to reflect semantic intent rather than output format, allowing for more flexible and robust tool calling.

Example

No specific code example is provided, as the proposal focuses on architectural changes and high-level design. However, the mentioned pull requests (e.g., #36891) may contain relevant implementation details.

Notes

The proposed changes aim to address the tradeoff between robustness and model-native behavior in vLLM tool calling. While the new approach may introduce additional complexity in decoding logic, it is expected to improve overall robustness and reduce parser complexity.

Recommendation

Apply the proposed workaround by adopting region-scoped guided decoding with tool-aware grammars and explicit boundary tracking, as it offers a more robust and consistent approach to tool calling in vLLM. This approach has been partially demonstrated in existing pull requests and aligns with ongoing development efforts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.