vllm - ✅(Solved) Fix [Feature]: Allow user selection of structured output (xgrammar) backend for bitmask application [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37650Fetched 2026-04-08 01:04:20
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
4
Assignees
Timeline (top)
commented ×2cross-referenced ×2assigned ×1labeled ×1

Fix Action

Fix / Workaround

  • Relying only on auto backend selection, hoping libraries do the right thing.
  • Forcing users to patch the code manually if they require a specific backend.
  • Exposing the option only in Python and not via the CLI/config.

PR fix notes

PR #37654: [Feature] Expose xgrammar bitmask backend selection in StructuredOutputsConfig

Description (problem / solution / changelog)

Fix https://github.com/vllm-project/vllm/issues/37650

Purpose

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

  • vllm/config/structured_outputs.py (modified, +8/-0)
  • vllm/v1/structured_output/utils.py (modified, +10/-2)
  • vllm/v1/worker/gpu_model_runner.py (modified, +6/-1)

PR #1: feat(structured-outputs): expose xgrammar bitmask backend selection via platform interface

Description (problem / solution / changelog)

xgr.apply_token_bitmask_inplace supports multiple backends ("auto", "cpu", "cuda", "triton", "torch_compile", "torch_native") but the backend was hardcoded to "auto" with no user control. This adds a platform-aware backend selection mechanism following the existing get_vit_attn_backend pattern.

Changes

  • vllm/config/structured_outputs.py: Add BitmaskBackend type alias and bitmask_backend: BitmaskBackend = "auto" field to StructuredOutputsConfig

  • vllm/platforms/interface.py: Add get_supported_bitmask_backends() and get_bitmask_backend() to Platform base class; base implementation supports ["auto", "cpu"] and returns "auto" unchanged

  • Platform overrides:

    • cuda.py: supports all backends; "auto" delegates to xgrammar
    • cpu.py: supports ["auto", "cpu"]; resolves "auto""cpu"
    • rocm.py: supports ["auto", "cpu", "cuda", "torch_compile", "torch_native"]; "auto" delegates to xgrammar
    • xpu.py: supports ["auto", "cpu"]; resolves "auto""cpu"
  • vllm/v1/structured_output/utils.py: Add bitmask_backend: str = "auto" param to apply_grammar_bitmask(), thread it into both xgr.apply_token_bitmask_inplace() calls

  • vllm/v1/worker/gpu_model_runner.py: Resolve backend once at init via current_platform.get_bitmask_backend(config.bitmask_backend), cache as self._bitmask_backend, pass on each call

Usage

# Via Python API
from vllm import LLM, SamplingParams
from vllm.config import StructuredOutputsConfig

llm = LLM(model="...", structured_outputs_config=StructuredOutputsConfig(bitmask_backend="triton"))

# Via CLI
vllm serve ... --structured-outputs-config.bitmask_backend=triton

Note: This only affects the legacy GPU model runner path that calls xgr.apply_token_bitmask_inplace. The new GPU model runner (vllm/v1/worker/gpu/model_runner.py) uses a custom Triton kernel directly and is unaffected.

<!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary>

Feature: Allow user selection of structured output (xgrammar) backend for bitmask application

Implements: https://github.com/vllm-project/vllm/issues/37650

Background

In vLLM, when structured output tasks are enabled, the function xgr.apply_token_bitmask_inplace from xgrammar supports different backends ("auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"), but currently the backend is hard-coded as "auto" and not exposed to the user.

Design

Following @shen-shanshan's suggestion in the issue comments (https://github.com/vllm-project/vllm/issues/37650#issuecomment-4096210982), the implementation adds a Platform interface for various hardware platforms to customize their xgrammar backend selection logic, similar to the existing get_vit_attn_backend pattern in vllm/platforms/interface.py#L243-L278.

Implementation Plan

1. Add bitmask_backend field to StructuredOutputsConfig (vllm/config/structured_outputs.py)

Add a new type alias and field:

BitmaskBackend = Literal[
    "auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"
]

Add to the StructuredOutputsConfig class:

bitmask_backend: BitmaskBackend = "auto"
"""Select the backend for applying the structured output token bitmask
via xgrammar's apply_token_bitmask_inplace. Options: "auto", "cpu",
"cuda", "triton", "torch_compile", "torch_native".
Default "auto" lets the platform decide the best backend."""

This exposes it via CLI as --structured-outputs-config.bitmask_backend=triton and via Python API as StructuredOutputsConfig(bitmask_backend="triton").

2. Add Platform interface methods (vllm/platforms/interface.py)

Add two new classmethods to the Platform base class, following the get_supported_vit_attn_backends / get_vit_attn_backend pattern:

@classmethod
def get_supported_bitmask_backends(cls) -> list[str]:
    """Return the list of supported bitmask backends on this platform."""
    return ["auto", "cpu"]

@classmethod
def get_bitmask_backend(cls, backend: str = "auto") -> str:
    """
    Get the bitmask backend for structured output on this platform.
    If user specifies a backend explicitly, validate it's supported and use it.
    If "auto", let the platform choose the best default.
    """
    if backend != "auto":
        supported = cls.get_supported_bitmask_backends()
        if backend not in supported:
            raise ValueError(
                f"Bitmask backend '{backend}' is not supported on "
                f"{cls.device_name}. Supported: {supported}"
            )
        return backend
    return "auto"

3. Override in platform subclasses

  • CudaPlatform (vllm/platforms/cuda.py): Support all backends ["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"]. Default "auto" lets xgrammar choose.
  • CpuPlatform (vllm/platforms/cpu.py): Support ["auto", "cpu"]. When "auto", resolve to "cpu" since only CPU kernels are available.
  • RocmPlatform (vllm/platforms/rocm.py): Support ["auto", "cpu", "cuda", "torch_compile", "torch_native"]. Default "auto".
  • XPUPlatform (vllm/platforms/xpu.py): Support ["auto", "cpu"]. When "auto", resolve to "cpu".

4. Modify apply_grammar_bitmask in vllm/v1/structured_output/utils.py

Add a bitmask_backend: str = "auto" parameter and pass it through to xgr.apply_token_bitmask_inplace:

def apply_grammar_bitmask(
    scheduler_output: SchedulerOutput,
    grammar_output: GrammarOutput,
    input_batch: InputBatch,
    logits: torch.Tensor,
    bitmask_backend: str = "auto",  # NEW
) -> None:
    # ... existing code ...
    
    if logits.device.type == "cpu" and logits.dtype != torch.float32:
        logits_float32 = logits.to(torch.float32)
        xgr.apply_token_bitmask_inplace(
            logits_float32, grammar_bitmask, indices=index_tensor,
            backend=bitmask_backend,  # NEW
        )
        logits.copy_(logits_float32.to(logits.dtype))
    else:
        xgr.apply_token_bitmask_inplace(
            logits, grammar_bitmask, indices=index_tensor,
            backend=bitmask_backend,  # NEW
        )

5. Modify callers to resolve and pass the backend

In vllm/v1/worker/gpu_model_runner.py (the legacy GPU model runner that calls apply_grammar_bitmask from utils.py):

  • During initialization, resolve the bitmask backend using the platform interface:
    from vllm.platforms import current_platform
    self._bitmask_backend = current_platform.get_bitmask_backend(
        self.vllm_config.structured_outputs_config.bitmask_backend
    )
  • Pass it when calling:
    if grammar_output is not None:
        apply_grammar_bitmask(
            scheduler_output, grammar_output, self.input_batch, logits,
            bitmask_backend=self._bitmask_backend,
        )

Note: The new GPU model runner path (vllm/v1/worker/gpu/model_runner.py) uses its own...

</details> <!-- START COPILOT CODING AGENT SUFFIX -->

This pull request was created from Copilot chat.

<!-- START COPILOT CODING AGENT TIPS -->

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Changed files

  • vllm/config/structured_outputs.py (modified, +9/-0)
  • vllm/platforms/cpu.py (modified, +20/-0)
  • vllm/platforms/cuda.py (modified, +19/-0)
  • vllm/platforms/interface.py (modified, +35/-0)
  • vllm/platforms/rocm.py (modified, +19/-0)
  • vllm/platforms/xpu.py (modified, +20/-0)
  • vllm/v1/structured_output/utils.py (modified, +12/-2)
  • vllm/v1/worker/gpu_model_runner.py (modified, +7/-1)

Code Example

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=user_selected_backend)

---

def apply_token_bitmask_inplace(
    logits: torch.Tensor,
    bitmask: torch.Tensor,
    *,
    vocab_size: Optional[int] = None,
    indices: Optional[List[int]] = None,
    backend: Literal["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"] = "auto",
) -> None:

---

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor)
RAW_BUFFERClick to expand / collapse

The feature, motivation and pitch

Feature, Motivation, and Pitch

Background

In vLLM, when structured output tasks are enabled (such as using a JSON schema, grammar, or regex constraint), the backend used to apply the structured output token bitmask is an important source of both correctness and performance. The function xgr.apply_token_bitmask_inplace from xgrammar supports different backends (e.g., "auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"), but currently in vLLM the backend is hard-coded as "auto" and not exposed to the user.

Motivation

  • Users may want to explicitly choose a backend to maximize performance for their hardware or to debug incompatibility issues, e.g., forcing cpu or cuda mode if they hit issues with the triton kernel or auto selection.
  • Exposing this option provides more transparency and flexibility for power users.

Pitch

  • Allow structured output backend selection for bitmask application by exposing a command-line flag, config file option, or Python API parameter (e.g., --structured-outputs-config.bitmask_backend or StructuredOutputsConfig(bitmask_backend=...)).
  • This value should be plumbed so it is finally passed as the backend argument to xgr.apply_token_bitmask_inplace, impacting calls such as the following:
xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=user_selected_backend)
  • The option should be documented and behave like other StructuredOutputsConfig choices. Default behavior should remain "auto" for compatibility.

Alternatives

  • Relying only on auto backend selection, hoping libraries do the right thing.
  • Forcing users to patch the code manually if they require a specific backend.
  • Exposing the option only in Python and not via the CLI/config.

Additional context

Relevant context: The interface for xgrammar is

def apply_token_bitmask_inplace(
    logits: torch.Tensor,
    bitmask: torch.Tensor,
    *,
    vocab_size: Optional[int] = None,
    indices: Optional[List[int]] = None,
    backend: Literal["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"] = "auto",
) -> None:

Currently, vllm/v1/structured_output/utils.py uses it as:

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor)

But does not pipe through a configurable backend. The StructuredOutputsConfig class already exists for related config.

Let me know if additional context or example code is needed.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To expose the structured output backend selection, follow these steps:

  • Add a bitmask_backend parameter to the StructuredOutputsConfig class:
class StructuredOutputsConfig:
    def __init__(self, ..., bitmask_backend: str = "auto"):
        ...
        self.bitmask_backend = bitmask_backend
  • Update the apply_token_bitmask_inplace call to use the bitmask_backend from the config:
xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=config.bitmask_backend)
  • Add a command-line flag and config file option to set the bitmask_backend:
parser.add_argument("--structured-outputs-config.bitmask_backend", default="auto")
  • Document the new option and its behavior.

Verification

To verify the fix, test the following scenarios:

  • Set the bitmask_backend to a specific value (e.g., "cpu") and verify that the apply_token_bitmask_inplace call uses the correct backend.
  • Test with different backends (e.g., "cuda", "triton") to ensure compatibility.
  • Verify that the default behavior remains "auto" when no bitmask_backend is specified.

Extra Tips

  • Make sure to update the documentation to reflect the new bitmask_backend option.
  • Consider adding error handling for invalid bitmask_backend values.
  • Test the fix on different hardware configurations to ensure performance and correctness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Allow user selection of structured output (xgrammar) backend for bitmask application [2 pull requests, 2 comments, 2 participants]