vllm - 💡(How to fix) Fix [Bug] PR #36138 grammar-mask spec-decode fix doesn't handle multi-token reasoning boundaries (gpt-oss/openai_gptoss still bleeds; Qwen3 fixed)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

The PR's transition detector in vllm/v1/structured_output/__init__.py::_find_reasoning_end_in_tokens scans for the boundary only within spec_token_ids (the 2-5 token speculative batch):

def _find_reasoning_end_in_tokens(self, token_ids: list[int]) -> int | None:
    if self.reasoner is None or self.enable_in_reasoning:
        return None
    for i, token in enumerate(token_ids):
        prefix = token_ids[: i + 1]
        if self.reasoner.is_reasoning_end_streaming(prefix, [token]):
            return i
    return None
  • For the qwen3 parser (</think> = 1 token; covers Qwen3.x family including 3.6): the boundary token appears in spec_token_ids whenever the model crosses it. Detection works.
  • For openai_gptoss (<|channel|>final<|message|> = 3-4 tokens after tokenization): the full sequence is almost never fully contained in a 2-5 token spec batch. is_reasoning_end_streaming(small_prefix) falls back to the base is_reasoning_end(small_prefix) which scans the prefix in isolation — never finds the boundary.

is_reasoning_end_streaming(prior_context, spec_batch) would be detectable IF the detector were passed prior context that includes the analysis-channel tokens preceding the spec batch. Currently it only sees spec_token_ids.

Fix Action

Fix / Workaround

Effect in practice: a vLLM build at PR #36138's HEAD SHA (94f4dc2e98d63dc1a3ff7cca2b35a9235df667e1) serving openai/gpt-oss-120b with EAGLE3 + response_format: {type: "json_schema", strict: true} still produces ~27% of responses with garbage-prefix-before-JSON (e.g. **Best{...}, (no{...}, chunk_0{...}) — identical pattern to pre-patch behavior. By contrast, Qwen3.6-35B-A3B-FP8 (using the qwen3 reasoning parser) + MTP + the same patched vLLM shows 0% prefix-bleed.

Code Example

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 2 \
  --reasoning-parser openai_gptoss \
  --max-model-len 131072 \
  --enable-prefix-caching \
  --speculative-config '{"model":"<eagle3-draft>","num_speculative_tokens":3,"method":"eagle3","draft_tensor_parallel_size":1}'

---

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

---

def _find_reasoning_end_in_tokens(self, token_ids: list[int]) -> int | None:
    if self.reasoner is None or self.enable_in_reasoning:
        return None
    for i, token in enumerate(token_ids):
        prefix = token_ids[: i + 1]
        if self.reasoner.is_reasoning_end_streaming(prefix, [token]):
            return i
    return None

---

def _find_reasoning_end_in_tokens(
    self,
    spec_token_ids: list[int],
    prior_token_ids: list[int] = None,  # tokens already accepted before the spec batch
) -> int | None:
    if self.reasoner is None or self.enable_in_reasoning:
        return None
    prior = list(prior_token_ids or [])
    for i, token in enumerate(spec_token_ids):
        # Pass the cumulative input so multi-token boundaries can be detected
        # when they straddle the (prior, spec_batch) boundary.
        full_prefix = prior + spec_token_ids[: i + 1]
        if self.reasoner.is_reasoning_end_streaming(full_prefix, [token]):
            return i
    return None
RAW_BUFFERClick to expand / collapse

Symptom

PR #36138 ("Grammar was ignored when reasoning ended within speculated tokens") fixes the spec-decode × structured-output bypass for reasoning parsers whose end marker is a single token (e.g., the qwen3 parser's </think> — tested by the PR with Qwen/Qwen3-8B and confirmed by me with Qwen/Qwen3.6-35B-A3B-FP8). It does not fix parsers whose end marker is multi-token — most notably gpt-oss (openai_gptoss), where the boundary is <|channel|>final<|message|> (3-4 tokens after tokenization).

Effect in practice: a vLLM build at PR #36138's HEAD SHA (94f4dc2e98d63dc1a3ff7cca2b35a9235df667e1) serving openai/gpt-oss-120b with EAGLE3 + response_format: {type: "json_schema", strict: true} still produces ~27% of responses with garbage-prefix-before-JSON (e.g. **Best{...}, (no{...}, chunk_0{...}) — identical pattern to pre-patch behavior. By contrast, Qwen3.6-35B-A3B-FP8 (using the qwen3 reasoning parser) + MTP + the same patched vLLM shows 0% prefix-bleed.

Reproducer (gpt-oss, fails)

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 2 \
  --reasoning-parser openai_gptoss \
  --max-model-len 131072 \
  --enable-prefix-caching \
  --speculative-config '{"model":"<eagle3-draft>","num_speculative_tokens":3,"method":"eagle3","draft_tensor_parallel_size":1}'

Send a chat completion with response_format: {type: "json_schema", json_schema: {strict: true, schema: {...}}}. ~27% of responses come back with non-JSON prefixes glued to otherwise-valid JSON: (no{"relevantChunkIds":[]}, **Best{"results":[...]}, etc.

Counter-test (Qwen3.6 with qwen3 parser, works)

Same vLLM build, same response_format request, with Qwen3.6-35B-A3B-FP8 weights and the qwen3 reasoning parser:

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

0% prefix-bleed across 150 production prompts. Confirms the PR fix engages and works correctly for single-token boundaries.

Root cause analysis

The PR's transition detector in vllm/v1/structured_output/__init__.py::_find_reasoning_end_in_tokens scans for the boundary only within spec_token_ids (the 2-5 token speculative batch):

def _find_reasoning_end_in_tokens(self, token_ids: list[int]) -> int | None:
    if self.reasoner is None or self.enable_in_reasoning:
        return None
    for i, token in enumerate(token_ids):
        prefix = token_ids[: i + 1]
        if self.reasoner.is_reasoning_end_streaming(prefix, [token]):
            return i
    return None
  • For the qwen3 parser (</think> = 1 token; covers Qwen3.x family including 3.6): the boundary token appears in spec_token_ids whenever the model crosses it. Detection works.
  • For openai_gptoss (<|channel|>final<|message|> = 3-4 tokens after tokenization): the full sequence is almost never fully contained in a 2-5 token spec batch. is_reasoning_end_streaming(small_prefix) falls back to the base is_reasoning_end(small_prefix) which scans the prefix in isolation — never finds the boundary.

is_reasoning_end_streaming(prior_context, spec_batch) would be detectable IF the detector were passed prior context that includes the analysis-channel tokens preceding the spec batch. Currently it only sees spec_token_ids.

What I tried that doesn't help

I overrode reasoning_start_str / reasoning_end_str on the GptossReasoningParser to silence the auto-init warning and populate ReasoningConfig._reasoning_end_token_ids. This doesn't change the detector's behavior because _find_reasoning_end_in_tokens doesn't consume those token IDs — it calls is_reasoning_end_streaming directly on the small spec batch.

Proposed fix direction

identify_constrained_draft_tokens already has access to request (so it has request.all_token_ids). The detector could be passed the prior context:

def _find_reasoning_end_in_tokens(
    self,
    spec_token_ids: list[int],
    prior_token_ids: list[int] = None,  # tokens already accepted before the spec batch
) -> int | None:
    if self.reasoner is None or self.enable_in_reasoning:
        return None
    prior = list(prior_token_ids or [])
    for i, token in enumerate(spec_token_ids):
        # Pass the cumulative input so multi-token boundaries can be detected
        # when they straddle the (prior, spec_batch) boundary.
        full_prefix = prior + spec_token_ids[: i + 1]
        if self.reasoner.is_reasoning_end_streaming(full_prefix, [token]):
            return i
    return None

Then identify_constrained_draft_tokens passes request.all_token_ids as prior_token_ids (the tokens accepted up to but excluding the spec batch).

This generalizes the PR cleanly to multi-token boundaries without changing single-token behavior.

Test data

RunVariantPrefix-bled rate
Ngpt-oss + EAGLE3 + json_schema (pre-PR)51% (156/300)
Agpt-oss + EAGLE3 + PR coherent build27% (81/300)
Bgpt-oss + EAGLE3 + PR + reasoning_start/end_str port26% (79/300)
CQwen3.6-35B-A3B-FP8 (--reasoning-parser qwen3) + MTP + PR coherent build0% (0/150)

(N is the pre-PR baseline, A/B/C use a from-source build of PR head SHA 94f4dc2e.)

Build notes

Anyone needing to reproduce: vllm/docker/Dockerfile builds cleanly at the PR SHA with --build-arg RUN_WHEEL_CHECK=false --build-arg torch_cuda_arch_list=9.0a --build-arg max_jobs=24 --build-arg nvcc_threads=2 on a 48-vcpu / 192GB box in ~40 min. The default vllm-openai target wheel is 630MB which exceeds the 500MB CI sanity threshold, hence the disable.

cc @sfbemerk @njhill — original PR author and reviewer.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING