vllm - 💡(How to fix) Fix [Bug]: xgrammar bitmask lets <end_of_turn> escape during structured outputs, terminating generation mid-JSON (Gemma 4)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ERROR scheduler.py:1421] Unexpected: grammar rejected tokens [106] for request <id>. Terminating request. ERROR serving.py:388] Request <id> failed with an internal error during generation INFO POST /v1/chat/completions HTTP/1.1 500 Internal Server Error

Root Cause

  • Gemma 4 31B (dense), specifically the NVFP4 quantized variant. The MoE 26B-A4B is less susceptible (likely because per-token compute uses fewer weights, so quantization noise per sampling step is smaller), but the same bug class affects it.
  • A structured_outputs JSON schema with multiple required string fields.
  • vLLM running without --reasoning-parser gemma4 and with enable_thinking=false in chat template kwargs (non-thinking mode). This is distinct from #39130 / #39138, which sit in the reasoning-parser code path.

Fix Action

Fix / Workaround

  • #40080 — Gemma 4 infinite repetition with structured output (closed; documents the symptom from a repetition angle)
  • #40097 / #40099 — proposed auto-enabling repetition detection for structured outputs (open; partial mitigation, doesn't fix the EOS leak)
  • #40911 — tool call leaks into content (different leak mechanism)
  • #39130 / #39138 — --reasoning-parser gemma4 silently disables xgrammar (different code path)
  • #29632 — RFC: force EOS at grammar terminal (closed, not planned; would fix the opposite direction — mask non-EOS after termination)
  • #27210, #29379 — FSM advancement / rollback bugs in adjacent paths

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 18.1.3 (1ubuntu1)
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, May 26 2025, 18:50:19) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-19-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : Could not collect (diagnostic context; vLLM at runtime runs on RTX 5090, Blackwell sm_120)
Nvidia driver version        : 575.57.08 (from CUDA_DRIVER_VERSION env)
cuDNN version                : 9.10.2
HIP runtime version          : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:           x86_64
CPU(s):                 96
Vendor ID:              GenuineIntel
Model name:             Intel(R) Xeon(R) 6527P
Core(s) per socket:     24
Socket(s):              2

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.8.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.1
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
Container                    : nvcr.io/nvidia/tritonserver:25.06-vllm-python-py3

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
CUDA_VERSION=12.9.1.010
CUDA_DRIVER_VERSION=575.57.08
CUDNN_VERSION=9.10.2.21
NCCL_VERSION=2.27.3
VLLM_CACHE_ROOT=/usr/src/models

---

{"field_a": "...", "field_b": "...", "field_c": "@{!}

---

finish_reason: "stop"
stop_reason:    106<end_of_turn>

---

ERROR scheduler.py:1421] Unexpected: grammar rejected tokens [106] for request <id>. Terminating request.
ERROR serving.py:388]   Request <id> failed with an internal error during generation
INFO  POST /v1/chat/completions HTTP/1.1  500 Internal Server Error
RAW_BUFFERClick to expand / collapse

Current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 18.1.3 (1ubuntu1)
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, May 26 2025, 18:50:19) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-19-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : Could not collect (diagnostic context; vLLM at runtime runs on RTX 5090, Blackwell sm_120)
Nvidia driver version        : 575.57.08 (from CUDA_DRIVER_VERSION env)
cuDNN version                : 9.10.2
HIP runtime version          : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:           x86_64
CPU(s):                 96
Vendor ID:              GenuineIntel
Model name:             Intel(R) Xeon(R) 6527P
Core(s) per socket:     24
Socket(s):              2

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.8.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.1
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
Container                    : nvcr.io/nvidia/tritonserver:25.06-vllm-python-py3

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
CUDA_VERSION=12.9.1.010
CUDA_DRIVER_VERSION=575.57.08
CUDNN_VERSION=9.10.2.21
NCCL_VERSION=2.27.3
VLLM_CACHE_ROOT=/usr/src/models
</details>

🐛 Describe the bug

When using structured_outputs (JSON schema) with Gemma 4, the model occasionally samples <end_of_turn> (token id 106) while the grammar FSM is still inside a JSON string value, which terminates generation mid-string and produces an unterminated JSON response. xgrammar's bitmask should forbid <end_of_turn> until the grammar reaches its terminal state, but in v0.20.1 it does not.

Symptom

Sometimes Gemma 4 wanders into garbage tokens like @{!} mid-string, and then samples <end_of_turn>. vLLM honors it as a stop signal. A representative truncated response:

{"field_a": "...", "field_b": "...", "field_c": "@{!}

Returned with:

finish_reason: "stop"
stop_reason:    106        ← <end_of_turn>

The grammar FSM is clearly not at a terminal state here (open string, missing closing ", missing fields, missing }), yet <end_of_turn> was sampled successfully.

Trigger conditions

The bug is reproducible — and high-rate (~20–40% over 10–20 samples) — under this combination:

  • Gemma 4 31B (dense), specifically the NVFP4 quantized variant. The MoE 26B-A4B is less susceptible (likely because per-token compute uses fewer weights, so quantization noise per sampling step is smaller), but the same bug class affects it.
  • A structured_outputs JSON schema with multiple required string fields.
  • vLLM running without --reasoning-parser gemma4 and with enable_thinking=false in chat template kwargs (non-thinking mode). This is distinct from #39130 / #39138, which sit in the reasoning-parser code path.

Happy to provide a minimal Python reproducer on request.

Confirmation that xgrammar is aware EOT should be rejected

If you add "ignore_eos": true to the request, vLLM crashes the request with a 500. The vLLM scheduler logs show:

ERROR scheduler.py:1421] Unexpected: grammar rejected tokens [106] for request <id>. Terminating request.
ERROR serving.py:388]   Request <id> failed with an internal error during generation
INFO  POST /v1/chat/completions HTTP/1.1  500 Internal Server Error

This is the interesting part: when ignore_eos=true forces the model to keep sampling past EOS, xgrammar does reject token 106 (presumably because the grammar FSM is non-terminal). But the scheduler escalates the rejection to a 500 instead of resampling. Both behaviors suggest the same underlying issue — special EOS-class tokens are not masked by the grammar bitmask in non-terminal FSM states (so without ignore_eos, EOS slips through and stops generation; with ignore_eos, EOS is rejected at the scheduler and crashes).

Hypothesis

In vllm/v1/structured_output/, the grammar bitmask is built per-step from the FSM's current state. For Gemma-style models, <end_of_turn> (id 106) is the chat-template EOS marker, not part of the grammar vocabulary. It looks like the bitmask construction treats EOS-class tokens as always-allowed regardless of FSM state. Two plausible fixes (likely the same fix, two angles):

  1. When the FSM is in a non-terminal state, set bitmask = 0 for all EOS / stop tokens.
  2. When ignore_eos=true is combined with structured outputs and the grammar rejects EOS, resample instead of escalating to Unexpected and 500-ing the request.

Why this is impactful

Without the fix, any model that occasionally emits EOS mid-string under grammar constraints will produce invalid JSON that callers cannot parse. The grammar's central guarantee — "the output will match the schema" — silently fails.

Related issues / PRs

The following issues describe adjacent symptoms or different mechanisms in the same area. To my reading, none of them currently track the EOS-mid-string bitmask issue specifically:

  • #40080 — Gemma 4 infinite repetition with structured output (closed; documents the symptom from a repetition angle)
  • #40097 / #40099 — proposed auto-enabling repetition detection for structured outputs (open; partial mitigation, doesn't fix the EOS leak)
  • #40911 — tool call leaks into content (different leak mechanism)
  • #39130 / #39138 — --reasoning-parser gemma4 silently disables xgrammar (different code path)
  • #29632 — RFC: force EOS at grammar terminal (closed, not planned; would fix the opposite direction — mask non-EOS after termination)
  • #27210, #29379 — FSM advancement / rollback bugs in adjacent paths

Expected behavior

While the grammar FSM is in a non-terminal state, EOS-class special tokens (<end_of_turn>, <eos>, etc.) should be masked to zero in the bitmask, so they are never sampled. The model should only be able to terminate after the JSON has been closed (FSM reaches the accept state).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

While the grammar FSM is in a non-terminal state, EOS-class special tokens (<end_of_turn>, <eos>, etc.) should be masked to zero in the bitmask, so they are never sampled. The model should only be able to terminate after the JSON has been closed (FSM reaches the accept state).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: xgrammar bitmask lets <end_of_turn> escape during structured outputs, terminating generation mid-JSON (Gemma 4)