vllm - ✅(Solved) Fix [Bug]: Structured output crashes on CPU with pin_memory=True in apply_grammar_bitmask() [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37705Fetched 2026-04-08 01:08:44
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1cross-referenced ×1referenced ×1

Error Message

RuntimeError: pin_memory=True requires a CUDA or other accelerator backend; no pinned memory allocator is available on this system.

Root Cause

Root cause: Two issues in apply_grammar_bitmask():

Fix Action

Fix / Workaround

  1. pin_memory=True is hardcodedtorch.tensor(out_indices, dtype=torch.int32, device="cpu", pin_memory=True) requires CUDA; fails on CPU-only systems.
  2. xgrammar CPU kernel expects Sequence[int], not torch.Tensor — even after fixing bug 1, the index_tensor is passed to xgr.apply_token_bitmask_inplace() which dispatches to apply_token_bitmask_inplace_cpu(), and that function only accepts a Python list for the indices argument, not a tensor.

Note: there is already a CPU-specific workaround below this code (dtype conversion for float32, added in #31901), but it can never be reached because the pin_memory=True crashes first.

PR fix notes

PR #37706: [Bugfix] Fix structured output crash on CPU due to pin_memory=True

Description (problem / solution / changelog)

Essential Checks

  • PR title follows the pattern [Tag] Short description
  • I have searched for related issues and checked existing PRs
  • I have run linting/formatting locally

Purpose

Fix RuntimeError: pin_memory=True requires a CUDA or other accelerator backend crash when using structured output (guided decoding) on CPU-only deployments.

Fixes #37705

Problem

apply_grammar_bitmask() in vllm/v1/structured_output/utils.py crashes on CPU when handling mixed batches (concurrent structured + non-structured requests):

  1. pin_memory=True is hardcodedtorch.tensor(out_indices, ..., pin_memory=True) requires CUDA; fails on CPU-only systems.
  2. xgrammar CPU kernel expects Sequence[int], not torch.Tensorapply_token_bitmask_inplace_cpu() only accepts a Python list for the indices argument.

Note: the existing CPU float32 workaround (added in #31901) was never reachable because the pin_memory=True crash occurs first.

Fix

On CPU, pass out_indices as a plain Python list directly instead of converting to a pinned tensor. The GPU path with pinned memory is preserved.

Test Plan

Tested by starting vLLM on CPU with ibm-granite/granite-3.2-2b-instruct, then sending concurrent plain + structured output (response_format: json_schema) requests. Without the fix, both requests return 500 and the EngineCore dies. With the fix, both succeed and the server stays healthy.

import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
MODEL = "ibm-granite/granite-3.2-2b-instruct"

def plain_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Tell me a story"}],
        max_tokens=200,
    )

def structured_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        max_tokens=50,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "resp", "strict": True,
                "schema": {
                    "type": "object",
                    "properties": {"capital": {"type": "string"}},
                    "required": ["capital"],
                    "additionalProperties": False,
                },
            },
        },
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    f1 = executor.submit(plain_request)
    f2 = executor.submit(structured_request)
    print(f1.result())
    print(f2.result())

Changed files

  • vllm/v1/structured_output/utils.py (modified, +22/-16)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Dec  9 2025, 19:02:36) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-5.14.0-570.12.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               256
On-line CPU(s) list:                  0-255
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7763 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             4890.62

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.4.3
[pip3] torch==2.10.0+cpu
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.1.dev1+g37aadf623 (git sha: 37aadf623)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

RuntimeError: pin_memory=True requires a CUDA or other accelerator backend;
no pinned memory allocator is available on this system.

---

import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
MODEL = "ibm-granite/granite-3.2-2b-instruct"  # or any chat model

def plain_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Tell me a story"}],
        max_tokens=200,
    )

def structured_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        max_tokens=50,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "resp",
                "strict": True,
                "schema": {
                    "type": "object",
                    "properties": {"capital": {"type": "string"}},
                    "required": ["capital"],
                    "additionalProperties": False,
                },
            },
        },
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    f1 = executor.submit(plain_request)
    f2 = executor.submit(structured_request)
    print(f1.result())  # Both fail with 500
    print(f2.result())

---

indices: torch.Tensor | list[int] | None = None
if not skip_out_indices:
    if logits.device.type == "cpu":
        indices = out_indices
    else:
        indices = torch.tensor(
            out_indices, dtype=torch.int32, device="cpu", pin_memory=True,
        )
        indices = indices.to(logits.device, non_blocking=True)

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=indices)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Dec  9 2025, 19:02:36) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-5.14.0-570.12.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               256
On-line CPU(s) list:                  0-255
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7763 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             4890.62

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.4.3
[pip3] torch==2.10.0+cpu
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.1.dev1+g37aadf623 (git sha: 37aadf623)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Structured output (guided decoding) requests crash the vLLM EngineCore process on CPU-only deployments when there is a mixed batch of structured and non-structured requests (concurrent requests where at least one uses response_format with json_schema).

The crash occurs in apply_grammar_bitmask() in vllm/v1/structured_output/utils.py:

RuntimeError: pin_memory=True requires a CUDA or other accelerator backend;
no pinned memory allocator is available on this system.

Root cause: Two issues in apply_grammar_bitmask():

  1. pin_memory=True is hardcodedtorch.tensor(out_indices, dtype=torch.int32, device="cpu", pin_memory=True) requires CUDA; fails on CPU-only systems.
  2. xgrammar CPU kernel expects Sequence[int], not torch.Tensor — even after fixing bug 1, the index_tensor is passed to xgr.apply_token_bitmask_inplace() which dispatches to apply_token_bitmask_inplace_cpu(), and that function only accepts a Python list for the indices argument, not a tensor.

Note: there is already a CPU-specific workaround below this code (dtype conversion for float32, added in #31901), but it can never be reached because the pin_memory=True crashes first.

Reproduction:

Start vLLM on CPU with any instruction-tuned model, then send concurrent plain + structured output requests:

import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
MODEL = "ibm-granite/granite-3.2-2b-instruct"  # or any chat model

def plain_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Tell me a story"}],
        max_tokens=200,
    )

def structured_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        max_tokens=50,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "resp",
                "strict": True,
                "schema": {
                    "type": "object",
                    "properties": {"capital": {"type": "string"}},
                    "required": ["capital"],
                    "additionalProperties": False,
                },
            },
        },
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    f1 = executor.submit(plain_request)
    f2 = executor.submit(structured_request)
    print(f1.result())  # Both fail with 500
    print(f2.result())

Both requests return 500 and the EngineCore process dies. The server does not recover.

Proposed fix: On CPU, pass out_indices as a plain Python list directly instead of converting to a pinned tensor:

indices: torch.Tensor | list[int] | None = None
if not skip_out_indices:
    if logits.device.type == "cpu":
        indices = out_indices
    else:
        indices = torch.tensor(
            out_indices, dtype=torch.int32, device="cpu", pin_memory=True,
        )
        indices = indices.to(logits.device, non_blocking=True)

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=indices)

I'll open a PR with this fix.

extent analysis

Fix Plan

To resolve the issue, we need to modify the apply_grammar_bitmask() function in vllm/v1/structured_output/utils.py. The proposed fix involves passing out_indices as a plain Python list directly instead of converting to a pinned tensor when running on CPU.

Step-by-Step Solution:

  1. Modify the apply_grammar_bitmask() function to handle CPU and non-CPU devices differently.
  2. Remove the hardcoded pin_memory=True for CPU devices, as it requires a CUDA or other accelerator backend.
  3. Pass out_indices as a Python list when the device is CPU.

Example code snippet:

indices: torch.Tensor | list[int] | None = None
if not skip_out_indices:
    if logits.device.type == "cpu":
        indices = out_indices  # Pass as a Python list for CPU
    else:
        indices = torch.tensor(
            out_indices, dtype=torch.int32, device="cpu", pin_memory=True,
        )
        indices = indices.to(logits.device, non_blocking=True)

xgr.apply_token_bitmask_inplace(logits, grammar_bitmask, indices=indices)

Verification

To verify that the fix worked:

  1. Run the reproduction script with the modified apply_grammar_bitmask() function.
  2. Check for any crashes or 500 errors when sending concurrent plain and structured output requests.
  3. Verify the responses from both types of requests to ensure they are correct and as expected.

Extra Tips

  • Always consider the device type when working with tensors and memory allocation.
  • Be cautious when using pin_memory=True, as it has specific requirements and limitations.
  • Ensure to test the fix thoroughly, including different scenarios and edge cases, to prevent regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING