vllm - ✅(Solved) Fix [Feature]: Allow user selection of structured output (xgrammar) backend for bitmask application [2 pull requests, 2 comments, 2 participants]

vllm2026-03-20 06:56:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37650•Fetched 2026-04-08 01:04:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×2cross-referenced ×2assigned ×1labeled ×1

Fix Action

Fix / Workaround

Relying only on auto backend selection, hoping libraries do the right thing.
Forcing users to patch the code manually if they require a specific backend.
Exposing the option only in Python and not via the CLI/config.

PR fix notes

PR #37654: [Feature] Expose xgrammar bitmask backend selection in StructuredOutputsConfig

Repository: vllm-project/vllm
Author: chaunceyjiang
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37654

Description (problem / solution / changelog)

Fix https://github.com/vllm-project/vllm/issues/37650

Purpose

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

vllm/config/structured_outputs.py (modified, +8/-0)
vllm/v1/structured_output/utils.py (modified, +10/-2)
vllm/v1/worker/gpu_model_runner.py (modified, +6/-1)

PR #1: feat(structured-outputs): expose xgrammar bitmask backend selection via platform interface

Repository: xyDong0223/vllm
Author: Copilot
State: closed | merged: True
Link: https://github.com/xyDong0223/vllm/pull/1

Description (problem / solution / changelog)

xgr.apply_token_bitmask_inplace supports multiple backends ("auto", "cpu", "cuda", "triton", "torch_compile", "torch_native") but the backend was hardcoded to "auto" with no user control. This adds a platform-aware backend selection mechanism following the existing get_vit_attn_backend pattern.

Changes

vllm/config/structured_outputs.py: Add BitmaskBackend type alias and bitmask_backend: BitmaskBackend = "auto" field to StructuredOutputsConfig
vllm/platforms/interface.py: Add get_supported_bitmask_backends() and get_bitmask_backend() to Platform base class; base implementation supports ["auto", "cpu"] and returns "auto" unchanged
Platform overrides:
- cuda.py: supports all backends; "auto" delegates to xgrammar
- cpu.py: supports ["auto", "cpu"]; resolves "auto" → "cpu"
- rocm.py: supports ["auto", "cpu", "cuda", "torch_compile", "torch_native"]; "auto" delegates to xgrammar
- xpu.py: supports ["auto", "cpu"]; resolves "auto" → "cpu"
vllm/v1/structured_output/utils.py: Add bitmask_backend: str = "auto" param to apply_grammar_bitmask(), thread it into both xgr.apply_token_bitmask_inplace() calls
vllm/v1/worker/gpu_model_runner.py: Resolve backend once at init via current_platform.get_bitmask_backend(config.bitmask_backend), cache as self._bitmask_backend, pass on each call

Usage

# Via Python API
from vllm import LLM, SamplingParams
from vllm.config import StructuredOutputsConfig

llm = LLM(model="...", structured_outputs_config=StructuredOutputsConfig(bitmask_backend="triton"))

# Via CLI
vllm serve ... --structured-outputs-config.bitmask_backend=triton

Note: This only affects the legacy GPU model runner path that calls xgr.apply_token_bitmask_inplace. The new GPU model runner (vllm/v1/worker/gpu/model_runner.py) uses a custom Triton kernel directly and is unaffected.

<details> <summary>Original prompt</summary>

Feature: Allow user selection of structured output (xgrammar) backend for bitmask application

Implements: https://github.com/vllm-project/vllm/issues/37650

Background

In vLLM, when structured output tasks are enabled, the function xgr.apply_token_bitmask_inplace from xgrammar supports different backends ("auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"), but currently the backend is hard-coded as "auto" and not exposed to the user.

Design

Following @shen-shanshan's suggestion in the issue comments (https://github.com/vllm-project/vllm/issues/37650#issuecomment-4096210982), the implementation adds a Platform interface for various hardware platforms to customize their xgrammar backend selection logic, similar to the existing get_vit_attn_backend pattern in vllm/platforms/interface.py#L243-L278.

Implementation Plan

1. Add `bitmask_backend` field to `StructuredOutputsConfig` (`vllm/config/structured_outputs.py`)

Add a new type alias and field:

BitmaskBackend = Literal[
    "auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"
]

Add to the StructuredOutputsConfig class:

bitmask_backend: BitmaskBackend = "auto"
"""Select the backend for applying the structured output token bitmask
via xgrammar's apply_token_bitmask_inplace. Options: "auto", "cpu",
"cuda", "triton", "torch_compile", "torch_native".
Default "auto" lets the platform decide the best backend."""

This exposes it via CLI as --structured-outputs-config.bitmask_backend=triton and via Python API as StructuredOutputsConfig(bitmask_backend="triton").

2. Add Platform interface methods (`vllm/platforms/interface.py`)

Add two new classmethods to the Platform base class, following the get_supported_vit_attn_backends / get_vit_attn_backend pattern:

@classmethod
def get_supported_bitmask_backends(cls) -> list[str]:
    """Return the list of supported bitmask backends on this platform."""
    return ["auto", "cpu"]

@classmethod
def get_bitmask_backend(cls, backend: str = "auto") -> str:
    """
    Get the bitmask backend for structured output on this platform.
    If user specifies a backend explicitly, validate it's supported and use it.
    If "auto", let the platform choose the best default.
    """
    if backend != "auto":
        supported = cls.get_supported_bitmask_backends()
        if backend not in supported:
            raise ValueError(
                f"Bitmask backend '{backend}' is not supported on "
                f"{cls.device_name}. Supported: {supported}"
            )
        return backend
    return "auto"

3. Override in platform subclasses

CudaPlatform (vllm/platforms/cuda.py): Support all backends ["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"]. Default "auto" lets xgrammar choose.
CpuPlatform (vllm/platforms/cpu.py): Support ["auto", "cpu"]. When "auto", resolve to "cpu" since only CPU kernels are available.
RocmPlatform (vllm/platforms/rocm.py): Support ["auto", "cpu", "cuda", "torch_compile", "torch_native"]. Default "auto".
XPUPlatform (vllm/platforms/xpu.py): Support ["auto", "cpu"]. When "auto", resolve to "cpu".

4. Modify `apply_grammar_bitmask` in `vllm/v1/structured_output/utils.py`

Add a bitmask_backend: str = "auto" parameter and pass it through to xgr.apply_token_bitmask_inplace:

def apply_grammar_bitmask(
    scheduler_output: SchedulerOutput,
    grammar_output: GrammarOutput,
    input_batch: InputBatch,
    logits: torch.Tensor,
    bitmask_backend: str = "auto",  # NEW
) -> None:
    # ... existing code ...
    
    if logits.device.type == "cpu" and logits.dtype != torch.float32:
        logits_float32 = logits.to(torch.float32)
        xgr.apply_token_bitmask_inplace(
            logits_float32, grammar_bitmask, indices=index_tensor,
            backend=bitmask_backend,  # NEW
        )
        logits.copy_(logits_float32.to(logits.dtype))
    else:
        xgr.apply_token_bitmask_inplace(
            logits, grammar_bitmask, indices=index_tensor,
            backend=bitmask_backend,  # NEW
        )

5. Modify callers to resolve and pass the backend

In vllm/v1/worker/gpu_model_runner.py (the legacy GPU model runner that calls apply_grammar_bitmask from utils.py):

During initialization, resolve the bitmask backend using the platform interface:

from vllm.platforms import current_platform
self._bitmask_backend = current_platform.get_bitmask_backend(
    self.vllm_config.structured_outputs_config.bitmask_backend
)

Pass it when calling:

if grammar_output is not None:
    apply_grammar_bitmask(
        scheduler_output, grammar_output, self.input_batch, logits,
        bitmask_backend=self._bitmask_backend,
    )

Note: The new GPU model runner path (vllm/v1/worker/gpu/model_runner.py) uses its own...

</details>

This pull request was created from Copilot chat.

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Changed files

vllm/config/structured_outputs.py (modified, +9/-0)
vllm/platforms/cpu.py (modified, +20/-0)
vllm/platforms/cuda.py (modified, +19/-0)
vllm/platforms/interface.py (modified, +35/-0)
vllm/platforms/rocm.py (modified, +19/-0)
vllm/platforms/xpu.py (modified, +20/-0)
vllm/v1/structured_output/utils.py (modified, +12/-2)
vllm/v1/worker/gpu_model_runner.py (modified, +7/-1)

Code Example

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=user_selected_backend)

---

def apply_token_bitmask_inplace(
    logits: torch.Tensor,
    bitmask: torch.Tensor,
    *,
    vocab_size: Optional[int] = None,
    indices: Optional[List[int]] = None,
    backend: Literal["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"] = "auto",
) -> None:

---

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor)

RAW_BUFFERClick to expand / collapse

The feature, motivation and pitch

Feature, Motivation, and Pitch

Background

In vLLM, when structured output tasks are enabled (such as using a JSON schema, grammar, or regex constraint), the backend used to apply the structured output token bitmask is an important source of both correctness and performance. The function xgr.apply_token_bitmask_inplace from xgrammar supports different backends (e.g., "auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"), but currently in vLLM the backend is hard-coded as "auto" and not exposed to the user.

Motivation

Users may want to explicitly choose a backend to maximize performance for their hardware or to debug incompatibility issues, e.g., forcing cpu or cuda mode if they hit issues with the triton kernel or auto selection.
Exposing this option provides more transparency and flexibility for power users.

Pitch

Allow structured output backend selection for bitmask application by exposing a command-line flag, config file option, or Python API parameter (e.g., --structured-outputs-config.bitmask_backend or StructuredOutputsConfig(bitmask_backend=...)).
This value should be plumbed so it is finally passed as the backend argument to xgr.apply_token_bitmask_inplace, impacting calls such as the following:

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=user_selected_backend)

The option should be documented and behave like other StructuredOutputsConfig choices. Default behavior should remain "auto" for compatibility.

Alternatives

Relying only on auto backend selection, hoping libraries do the right thing.
Forcing users to patch the code manually if they require a specific backend.
Exposing the option only in Python and not via the CLI/config.

Additional context

Relevant context: The interface for xgrammar is

def apply_token_bitmask_inplace(
    logits: torch.Tensor,
    bitmask: torch.Tensor,
    *,
    vocab_size: Optional[int] = None,
    indices: Optional[List[int]] = None,
    backend: Literal["auto", "cpu", "cuda", "triton", "torch_compile", "torch_native"] = "auto",
) -> None:

Currently, vllm/v1/structured_output/utils.py uses it as:

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor)

But does not pipe through a configurable backend. The StructuredOutputsConfig class already exists for related config.

Let me know if additional context or example code is needed.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To expose the structured output backend selection, follow these steps:

Add a bitmask_backend parameter to the StructuredOutputsConfig class:

class StructuredOutputsConfig:
    def __init__(self, ..., bitmask_backend: str = "auto"):
        ...
        self.bitmask_backend = bitmask_backend

Update the apply_token_bitmask_inplace call to use the bitmask_backend from the config:

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=index_tensor, backend=config.bitmask_backend)

Add a command-line flag and config file option to set the bitmask_backend:

parser.add_argument("--structured-outputs-config.bitmask_backend", default="auto")

Document the new option and its behavior.

Verification

To verify the fix, test the following scenarios:

Set the bitmask_backend to a specific value (e.g., "cpu") and verify that the apply_token_bitmask_inplace call uses the correct backend.
Test with different backends (e.g., "cuda", "triton") to ensure compatibility.
Verify that the default behavior remains "auto" when no bitmask_backend is specified.

Extra Tips

Make sure to update the documentation to reflect the new bitmask_backend option.
Consider adding error handling for invalid bitmask_backend values.
Test the fix on different hardware configurations to ensure performance and correctness.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Feature]: Allow user selection of structured output (xgrammar) backend for bitmask application [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #37654: [Feature] Expose xgrammar bitmask backend selection in StructuredOutputsConfig

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #1: feat(structured-outputs): expose xgrammar bitmask backend selection via platform interface

Description (problem / solution / changelog)

Changes

Usage

Feature: Allow user selection of structured output (xgrammar) backend for bitmask application

Background

Design

Implementation Plan

1. Add bitmask_backend field to StructuredOutputsConfig (vllm/config/structured_outputs.py)

2. Add Platform interface methods (vllm/platforms/interface.py)

3. Override in platform subclasses

4. Modify apply_grammar_bitmask in vllm/v1/structured_output/utils.py

5. Modify callers to resolve and pass the backend

Changed files

Code Example

The feature, motivation and pitch

Feature, Motivation, and Pitch

Background

Motivation

Pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Add `bitmask_backend` field to `StructuredOutputsConfig` (`vllm/config/structured_outputs.py`)

2. Add Platform interface methods (`vllm/platforms/interface.py`)

4. Modify `apply_grammar_bitmask` in `vllm/v1/structured_output/utils.py`