vllm - ✅(Solved) Fix [RFC]: Unifying Tool Calling via Region-Scoped Guided Decoding, Tool-Aware Grammars, and Related Parsers [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39848Fetched 2026-04-16 06:36:15
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×3labeled ×1

Root Cause

When tool_choice="required" is specified, the system avoids these parsing challenges by enforcing a global JSON schema via structured_outputs.json. However, this introduces a different set of problems. Because the schema is applied from the first token and defined as an array, decoding is forced to begin with [ and remain within JSON for the entire output. This bypasses model-native tool-call formats and suppresses reasoning output entirely.

Fix Action

Fixed

PR fix notes

PR #36891: [Tool Parser] Kimi K2: guided decoding for tool_choice="auto" — 75% → 100% schema accuracy

Description (problem / solution / changelog)

Co-authored with @Yzong-rh

Purpose

The Kimi K2 tool parser currently relies on post-hoc parsing for tool_choice="auto" — the model generates freely and vLLM extracts tool calls afterward. This works most of the time, but the model can hallucinate tool names not in the user's schema (e.g., calling img_gen when only search is available), causing schema validation failures.

This PR adds generation-time enforcement via xgrammar's structural tag mechanism, ensuring that once the model decides to make a tool call, it can only produce tool names and arguments that conform to the provided schema. This is the first tool parser in vLLM to use guided decoding for tool_choice="auto".

For background on Kimi K2 tool calling on vLLM, see: Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM.

Key benefits:

  • 100% schema accuracy on the K2-Vendor-Verifier benchmark (up from 75.4%), eliminating all tool name hallucination
  • Zero overhead for non-tool-call tokens — the grammar only activates after the <|tool_call_begin|> trigger, so free-text generation is unconstrained
  • Composable with existing behaviortool_choice="required" and forced function still use the base class JSON schema path; this only fills the gap for "auto"
  • Generalizable pattern — the same TriggeredTagsFormat approach can be applied to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination issues

Summary

  • This is the first tool parser in vLLM to apply guided decoding for tool_choice="auto", and the approach generalizes to other parsers
  • Add xgrammar structural tag guided decoding to the Kimi K2 tool parser when tool_choice is "auto" or unset
  • Eliminates tool name hallucination (e.g., model calling img_gen when only search/urls_fetch_tool are available) by constraining generation at the token level
  • No change to tool_choice="required" or forced function behavior (handled by base class)

Approach

Override adjust_request() in KimiK2ToolParser to build a TriggeredTagsFormat structural tag from the request's tool definitions:

  • Trigger: <|tool_call_begin|> — free text allowed until this token
  • Per-tool tag: <|tool_call_begin|>{name}:\d+<|tool_call_argument_begin|>{json}<|tool_call_end|>
  • Composable content: sequence of regex (call ID) + const_string (argument marker) + json_schema (parameters)
  • Supports multiple tool calls per response (stop_after_first=False)
  • Respects existing structured_outputs if already set (e.g., by tool_choice="required")

Evaluation

K2-Vendor-Verifier benchmark, 2000 samples, moonshotai/Kimi-K2-Instruct-0905 (revision 94a4053eb8863059dd8afc00937f054e1365abbd):

Tool CallsSchema ErrorsAccuracy
Baseline (no guided decoding)67816775.4%
This PR6770100%

The dominant failure mode in the baseline was tool name hallucination — the model generating calls to tools not in the provided schema (e.g., img_gen). With structural tag enforcement, the grammar only allows tokens that match valid tool names after the <|tool_call_begin|> trigger.

Reproduction:

# Server
vllm serve moonshotai/Kimi-K2-Instruct-0905 \
  --revision 94a4053eb8863059dd8afc00937f054e1365abbd \ # changing this might results in regression, still verifying
  --tensor-parallel-size 8 --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser kimi_k2

# Eval (using K2-Vendor-Verifier)
python tool_calls_eval.py downloads/tool-calls/samples.jsonl \
  --model moonshotai/Kimi-K2-Instruct-0905 \
  --base-url http://localhost:8000/v1 --api-key dummy \
  --concurrency 8 --temperature 0.6 --max-tokens 64000 \
  --output results.jsonl --summary summary.json

Caveats/Limitations

  • Performance not benchmarked — throughput/latency overhead of structural tag guided decoding has not been measured. The grammar only constrains tokens inside tool calls (not free text), so overhead should be minimal, but this needs validation.

Future work

  • Integrate per-function strict parameter for argument schema guidance (add strict to FunctionDefinition in vLLM's protocol layer first).
  • Generalize this approach to other tool parsers (hermes, jamba, etc.) that suffer from similar hallucination in tool_choice="auto"
  • Validate tool_choice='required' path.
  • Benchmark throughput/latency overhead of structural tag guided decoding vs. unconstrained generation

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/tool_parsers/kimi_k2_tool_parser.py (modified, +63/-0)
RAW_BUFFERClick to expand / collapse

Motivation.

This proposal outlines a unified approach to tool calling in vLLM that improves robustness, reduces parser complexity, and preserves model-native behavior. It achieves this through region-scoped guided decoding with tool-aware grammars and explicit boundary tracking during generation.

Tool calling in vLLM is currently implemented via two structurally different mechanisms that lead to inconsistent behavior, reduced robustness, and increasing parser complexity.

When tool_choice="auto" is used, generation is unconstrained and tool calls must be inferred after the fact. Parsers are responsible for detecting tool calls, identifying their boundaries, and reconstructing structured arguments. This introduces several well-known failure modes:

  • ambiguous or missed tool-call boundaries
  • incorrectly identifying free-form text as a tool call
  • malformed or partially emitted tool arguments
  • hallucinated tool names or incorrect tool selection
  • difficulty handling streaming or partial outputs
  • divergence across model-specific parsing implementations

These issues are not theoretical; they have been observed in practice and have driven repeated efforts to improve parser logic. The system has been described as fragmented and requiring redesign (see vllm issue #22918). Ongoing work on parser consolidation and flexibility (see vllm issue #27661) further highlights both the importance of this area and the growing complexity of the current approach.

When tool_choice="required" is specified, the system avoids these parsing challenges by enforcing a global JSON schema via structured_outputs.json. However, this introduces a different set of problems. Because the schema is applied from the first token and defined as an array, decoding is forced to begin with [ and remain within JSON for the entire output. This bypasses model-native tool-call formats and suppresses reasoning output entirely.

This creates a fundamental tradeoff:

  • auto preserves model-native behavior but is prone to parsing errors
  • required enforces structure but overrides model-native formats

As a result, vLLM does not currently have a robust and unified approach to tool calling.

Recent work demonstrates that this tradeoff is not fundamental. The concept of trigger-based, region-scoped constrained decoding is described in vllm issue #32142. Early implementations exist in vllm pull request #32202 and vllm pull request #32232, and a concrete realization is demonstrated in vllm pull request #36891.

PR #36891 is particularly significant because it applies grammar-guided decoding only within the tool-call region, while preserving free-form generation outside it. By constraining tool names and arguments at generation time, it reduces failure modes such as hallucinated tool calls and malformed arguments. This demonstrates that generation-time enforcement can improve robustness without sacrificing model-native behavior.

At the same time, the current trajectory places significant emphasis on improving parser flexibility. While valuable, this approach treats parsing as the primary problem. In practice, parser complexity is a symptom of a deeper issue: the system does not explicitly represent structured regions during decoding and must infer them afterward.

This leads to duplication across parsers, inconsistent behavior across model families, and increasing maintenance cost. Even with improved parsers, heuristic detection remains inherently failure-prone.

Taken together, these observations suggest that:

  • robustness is currently achieved only by overriding model behavior
  • parser complexity continues to grow without eliminating failure modes
  • emerging work demonstrates a viable alternative based on region-scoped constraint

This RFC proposes adopting that alternative as a unified architectural model. By making structured regions explicit during decoding and enforcing both syntactic and semantic correctness within those regions, vLLM can improve robustness, reduce parser complexity, and provide a consistent tool-calling model across diverse model formats.

Proposed Change.

This proposal suggests adopting region-scoped guided decoding with tool-aware grammars and explicit boundary tracking as the primary architectural model for tool calling in vLLM.

The central goal is to improve robustness by shifting correctness guarantees from post-processing to generation time, while preserving model-native behavior.


Region-Scoped Guided Decoding

Instead of applying constraints globally or relying on post-hoc parsing, tool calls are treated as structured regions within the decoding process.

Generation proceeds as follows:

  • generation begins unconstrained
  • a model-specific trigger token is emitted
  • decoding transitions into a constrained mode
  • a grammar is enforced within the structured region
  • an end token is emitted
  • decoding returns to unconstrained mode

This aligns with the approach described in vllm issue #32142 and demonstrated in vllm pull request #36891.

This model improves robustness by:

  • ensuring structured output is valid at generation time
  • preventing malformed or partially emitted tool calls
  • eliminating ambiguity around where structured regions begin and end
  • preserving reasoning and model-native formatting outside the constrained region

Explicit Tool Call Boundaries as a Runtime Primitive

Tool-call boundaries should be treated as a first-class runtime concept.

The system should explicitly track:

  • entry into a structured region
  • position within the region
  • exit from the region

This improves robustness by:

  • eliminating incorrect identification of text as tool calls
  • removing ambiguity in boundary detection
  • enabling deterministic parsing
  • supporting reliable streaming and partial outputs

These boundaries should be:

  • exposed to parsers
  • available during streaming
  • consistent across model families

This ensures that parsing operates on known structured regions, not inferred ones.


Tool-Aware Grammar Generation

Grammars should be dynamically generated from the tools exposed in the request.

Rather than enforcing only syntactic structure, the grammar enforces semantic correctness by:

  • restricting tool names to the provided set
  • enforcing argument schemas per tool
  • encoding required and optional parameters
  • preventing semantically invalid tool calls

This directly addresses common failure modes such as:

  • hallucinated tool names
  • invalid argument structures
  • incorrect tool selection

This capability is partially demonstrated in vllm pull request #36891 and should be generalized across all tool-calling paths.

Grammars can be constructed using model-specific templates (e.g., JSON, XML, token-based formats), ensuring compatibility with diverse model conventions while maintaining correctness guarantees.


Parser Alignment

Parsers should be aligned with decoding rather than reconstructing structure afterward.

Under this model, parsers would:

  • define trigger, begin, and end tokens
  • define grammars for structured regions
  • extract structured data from bounded regions

This reduces parser complexity by:

  • eliminating heuristic detection
  • removing full-text scanning
  • enabling reuse across model families
  • ensuring consistent behavior

Parsing becomes a deterministic extraction step, rather than a probabilistic reconstruction process.


Tool Choice Semantics

This proposal suggests refining the meaning of tool_choice to reflect semantic intent rather than output format.

Under this model:

  • tool_choice="auto" allows optional entry into structured regions
  • tool_choice="required" ensures that at least one structured region is entered

Neither mode requires global JSON constraint.

This removes the current tradeoff between robustness and model alignment.


Relationship to Existing Work

This proposal builds on and unifies several ongoing efforts:

  • vllm issue #22918
  • vllm issue #27661
  • vllm issue #32142
  • vllm pull request #32202
  • vllm pull request #36891

It does not replace these efforts; it provides a unifying architectural direction.


Implementation Pathway

A phased approach can be used:

Phase 1: Enable structural-tag constraint for required

  • replace global JSON constraint with region-scoped constraint

Phase 2: Unify auto and required

  • both use region-scoped decoding

Phase 3: Introduce boundary-aware runtime

  • track structured regions during decoding

Phase 4: Generalize grammar system

  • support multiple formats and templates

Phase 5: Deprecate legacy paths

  • remove global JSON constraint
  • reduce reliance on heuristic parsing

Benefits

This approach improves:

  • robustness of tool calling
  • correctness of generated tool calls
  • consistency across models
  • parser simplicity and reuse
  • streaming reliability
  • alignment with model-native formats

Non-Goals / Tradeoffs

This proposal does not attempt to:

  • eliminate all parser logic; parsers remain necessary for extraction and model-specific configuration
  • mandate a single tool-call format; multiple formats (JSON, XML, token-based) are supported
  • fully replace existing implementations immediately; a phased migration is expected

Potential tradeoffs include:

  • increased complexity in decoding logic due to dynamic constraint switching
  • need for careful integration with batching and scheduling
  • additional work to standardize grammar generation across backends
  • possible performance considerations when activating grammars dynamically

However, these tradeoffs are balanced by improved robustness, reduced parser complexity, and a more consistent architecture.

Feedback Period.

Feedback period: TBD (maintainer guidance requested)

CC List.

Please add if relevant.

Any Other Things.

This proposal is intended to align with ongoing development rather than replace it. Recent work suggests that vLLM is already moving toward region-scoped guided decoding in specific contexts. This RFC proposes making that direction explicit and consistent across the system.

Further feedback on boundary tracking, grammar abstraction, and integration strategy would be especially valuable.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adopt region-scoped guided decoding with tool-aware grammars and explicit boundary tracking to improve robustness and reduce parser complexity in vLLM tool calling.

Guidance

  • Implement region-scoped guided decoding to enforce syntactic and semantic correctness within structured regions, as demonstrated in vllm pull request #36891.
  • Introduce explicit tool call boundaries as a runtime primitive to eliminate ambiguity and improve parsing reliability.
  • Develop a dynamic grammar generation system that restricts tool names and enforces argument schemas to prevent common failure modes.
  • Align parsers with the new decoding model to reduce complexity and enable deterministic extraction of structured data.
  • Refine the meaning of tool_choice to reflect semantic intent rather than output format, allowing for more flexible and robust tool calling.

Example

No specific code example is provided, as the proposal focuses on architectural changes and high-level design. However, the mentioned pull requests (e.g., #36891) may contain relevant implementation details.

Notes

The proposed changes aim to address the tradeoff between robustness and model-native behavior in vLLM tool calling. While the new approach may introduce additional complexity in decoding logic, it is expected to improve overall robustness and reduce parser complexity.

Recommendation

Apply the proposed workaround by adopting region-scoped guided decoding with tool-aware grammars and explicit boundary tracking, as it offers a more robust and consistent approach to tool calling in vLLM. This approach has been partially demonstrated in existing pull requests and aligns with ongoing development efforts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING