vllm - ✅(Solved) Fix [Bug]: ngram speculative decoding default prompt_lookup_min=2 causes tool-call output corruption on Qwen3-class models with structured output (config-only fix: prompt_lookup_min=8) [6 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40875Fetched 2026-04-26 05:06:16
View on GitHub
Comments
0
Participants
1
Timeline
20
Reactions
1
Participants
Timeline (top)
mentioned ×9subscribed ×9cross-referenced ×2

When using --speculative-config '{"method":"ngram","prompt_lookup_min":2,...}' (the default prompt_lookup_min) with a Qwen3-class model AND tool_choice=auto requests AND a tools array in the system prompt, tool-call output is corrupted in ~50% of requests even on a stack with all known related fixes applied (#40738 Phase 1+2, #36138, #40783, #39055).

The corruption manifests as wrong-token cascades like <tool_call>\n<<tool_code>\nprint(...), parameter=parameter=name, <<argname>...<argvalue> instead of the correct chat-template format <tool_call>\n<<function=NAME>\n<parameter=KEY>\nVALUE\n</parameter>....

Config-only workaround (no code changes): set prompt_lookup_min=8. Achieves 100% clean tool-call rate (n=30 single-query, 96% n=25 multi-query) on the same hardware/model where default prompt_lookup_min=2 gave ~50%.

This is a separate bug class from PR #40738 (GDN state corruption, which we backported and confirmed gives +30-40% improvement). The remaining ~50% residual that the four upstream PRs don't close is the rejection-sampling artifact described here.

Root Cause

The chat template defines the tool-call format using XML markers that include <<function=, <parameter=, </parameter>, </function>. These tokens appear MULTIPLE TIMES in the system prompt's tool definitions section.

When ngram with prompt_lookup_min=2 is asked "what comes after [27, 27] (two <)?", the algorithm searches for any 2-token suffix match in the entire context. It finds matches in:

  1. The actual <<function=...> template definition (correct match — would draft function)
  2. The <tool_call> opening tag (matches [27, ...] partially → can draft tool)
  3. Any literal << markers in user content
  4. Multi-tool definitions (each with its own <<function=NAME> line)

The KMP-style longest-prefix-suffix matching in vllm/v1/spec_decode/ngram_proposer.py returns ONE of these matches (the longest, but ties are broken by position) and drafts the next k tokens from there.

The rejection sampler then verifies each drafted token against the target model's logits. For the function vs tool decision at position [4], the target model has function at probability ~0.7 and tool at probability ~0.2 (both are valid template-context tokens). The rejection sampler accepts a draft token when target_prob / draft_prob >= random(). Since ngram doesn't provide draft probabilities, the comparison effectively reduces to "is target_prob above some threshold". For per-token accept rate ≈ 0.8, the structural ceiling for clean rate becomes:

clean_rate ≈ (per_token_accept_rate)^num_speculative_tokens
0.8^3 ≈ 0.51 ← matches our 53% empirical measurement with default config
0.8^1 = 0.80 ← matches our 65% with num_spec=1 (statistical variance)

Why prompt_lookup_min=8 fixes it

Setting prompt_lookup_min=8 requires ngram to find an 8-token suffix match before drafting. For tool-call output, the 8-token suffix [27, 27, 1628, 27362, 67017, 29, 198, 27] (<<function=get_weather>\n<) is unique enough that it almost never matches spurious system-prompt fragments — it only matches the FIRST tool definition (where the model is currently reproducing the template) OR doesn't match at all.

Result: ngram drafts almost nothing on tool-call requests, producing close to no-spec correctness. Natural-language repetitions (where 8+ token matches do occur, e.g. repeated phrases in long text) still get speculative speedup.

Fix Action

Fix / Workaround

Config-only workaround (no code changes): set prompt_lookup_min=8. Achieves 100% clean tool-call rate (n=30 single-query, 96% n=25 multi-query) on the same hardware/model where default prompt_lookup_min=2 gave ~50%.

Full integrated patch tree + investigation reports: Sandermage/genesis-vllm-patches @ 852b73f. Specific files relevant to this bug report:

PR fix notes

PR #40738: [Bugfix] Fix GDN conv + SSM state corruption with ngram spec decode

Description (problem / solution / changelog)

Summary

Fix output corruption when using ngram speculative decoding with hybrid GDN models (e.g., Qwen3.5) in mamba_cache_mode="none".

After a spec decode step accepts N tokens, the next non-spec decode step must read SSM state from block N-1 and conv state from an offset position. Two bugs prevented this:

  1. num_accepted_tokens was not passed to SSM metadata builders on non-spec steps
  2. causal_conv1d_fn had no mechanism to offset conv state reads based on accepted tokens

Changes

  • gdn_attn.py: Compute spec_decode_src_indices for SSM state correction; pad num_accepted_tokens with 1s for prefill sequences in mixed batches
  • gdn_linear_attn.py: Pre-copy SSM state from accepted block to block 0; pass num_accepted_tokens to conv kernels gated on whether correction is needed
  • causal_conv1d.py: Add IS_SPEC_DECODING path to _causal_conv1d_fwd_kernel that offsets conv state reads/writes by num_accepted_tokens - 1
  • gpu_model_runner.py: Pass num_accepted_tokens to GDN/Mamba2 builders on non-spec steps

Test plan

  • Single-prompt: baseline vs ngram match token-for-token (Qwen3.5-0.8B)
  • Mixed-batch: short prompt matches baseline; long prompt generates coherent output
  • Kernel fix verified necessary: disabling conv offset causes regression
  • Existing GDN + spec decode CI tests
<details> <summary>Reproducers</summary>

Single-prompt:

from vllm import LLM, SamplingParams
MODEL, PROMPT = "Qwen/Qwen3.5-0.8B", "<code>\nclass Calculator:\n    def add(self, a, b):\n        return a + b\n</code>\n<update>\nAdd subtract and multiply methods\n</update>"
ARGS = dict(model=MODEL, trust_remote_code=True, enforce_eager=True, enable_chunked_prefill=True, max_model_len=4096)
SPEC = {"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 10, "prompt_lookup_min": 2}
S = SamplingParams(max_tokens=200, temperature=0)
b = list(LLM(**ARGS).generate([PROMPT], S)[0].outputs[0].token_ids)
n = list(LLM(**ARGS, speculative_config=SPEC).generate([PROMPT], S)[0].outputs[0].token_ids)
print("PASS" if b == n else "FAIL")

Mixed-batch (short + long prompt, max_num_batched_tokens=64): see reproduce_gdn_ngram_mixed.py in the branch.

</details>

Fixes #39273

AI-assisted: Yes (Claude). Not duplicating any existing PR.

Changed files

  • vllm/model_executor/layers/mamba/gdn_linear_attn.py (modified, +27/-0)
  • vllm/model_executor/layers/mamba/ops/causal_conv1d.py (modified, +20/-2)
  • vllm/v1/attention/backends/gdn_attn.py (modified, +32/-2)
  • vllm/v1/worker/gpu_model_runner.py (modified, +11/-0)

PR #36138: [Bugfix] Grammar was ignored when reasoning ended within speculated tokens

Description (problem / solution / changelog)

Purpose

This PR attempts to fix a bug (#31858, #34650) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.

Test Plan

In general, the bug seems to be independent of the specific SpecDecode method; originally I had observed it with DeepSeek models and MTP, but for testing I recommend a smaller model like Qwen3-8B and using the same model as draft model. This way, we have high acceptance rates for our tests and a high likelihood that the original bug appears.

vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"draft_model","model":"Qwen/Qwen3-8B","num_speculative_tokens":5}'

The test request should have response_format=json_schema and a prompt that lurkes the model into generating not pure json, e.g.

<details> <summary>example payload</summary> <code> { "model": "Qwen/Qwen3-8B", "messages": [ { "role": "user", "content": "Imagine a Fantasy hero (10). Return valid json, wrapped in markdown fences: ```json\n[...]\n```" } ], "response_format": { "type": "json_schema", "json_schema": { "name": "hero", "schema": { "$defs": { "CharacterRole": {"enum": ["mage", "warrior", "healer"], "title": "CharacterRole", "type": "string"} }, "properties": { "name": {"description": "Character name", "title": "Name", "type": "string"}, "age": {"description": "Character age", "title": "Age", "type": "integer"}, "role": {"allOf": [{"$ref": "#/$defs/CharacterRole"}], "description": "Character class"} }, "required": ["name", "age", "role"], "title": "Character", "type": "object" } } } } </code> </details>

The original bug can also be reproduced for Model Runner V2, this bugfix works there as well. For testing, you should choose a different speculative method (since draft_model is not supported yet):

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve "Qwen/Qwen3-8B" \
  --max-model-len 40960 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"eagle3","model":"RedHatAI/Qwen3-8B-speculator.eagle3","num_speculative_tokens":5}'

The original bug is still present in vllm v0.17.1.

Test Result

without bugfix, the content field contains invalid json, e.g. because of markdown fences

"content": "```json\n{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}```"

with the bugfix, the content field contains valid json that satisfies the requested grammar

"content": "{\n\n\"name\": \"Eldrin the Flameheart\",\n\"age\": 32,\n\"role\": \"warrior\"\n}"

I am happy to receive feedback and suggestions on how to improve the PR: the interplay of spec decode, grammar, reasoning, and async scheduling seems to be quite complex.

Related

There had been several attempts to fix this bug before: my first attempt in #34241 would reject all speculated tokens in the step where reasoning_end was detected, which was working fine, but was suboptimal. #34978 started with a better approach that would validate all speculative tokens following reasoning_end, but contained some bugs in the end and was discontinued.

Changed files

  • tests/v1/structured_output/test_reasoning_structured_output.py (modified, +243/-39)
  • vllm/v1/core/sched/scheduler.py (modified, +39/-22)
  • vllm/v1/structured_output/__init__.py (modified, +121/-34)

PR #40783: [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery

Description (problem / solution / changelog)

To be used with https://github.com/vllm-project/vllm/pull/40861

Purpose

This PR fixes several critical edge cases in the Qwen3 reasoning parser where reasoning text was lost or incorrectly classified.

A key issue addressed is that Qwen3, while following system prompt instructions during reasoning, often outputs tool call tags as raw text fragments (e.g., < + tool + _call + >) instead of a single special token. Since these fragments arrive as multiple regular text tokens across different streaming deltas, the previous ID-based detection would fail or corrupt the output.

Key Changes

  • Raw Text & Fragmented Tag Support: The parser now tracks the literal string <tool_call> across deltas. It correctly identifies the reasoning-to-content transition even when the tag is "built" over several steps from regular text tokens, preventing the parser from missing the start of a tool call.
  • Transition Data Integrity: Fixed a bug in vllm/parser/abstract_parser.py where the final fragment of reasoning was silently overwritten by tool call content when both occurred in the same delta.
  • Speculative Decoding Fix: Added is_reasoning_end_streaming, a delta-only variant used by parse_delta instead of is_reasoning_end. The paired-token guard in is_reasoning_end is intentionally preserved (it prevents system-prompt tool-call examples from triggering an early reasoning end); only its comment was clarified. The new streaming check inspects delta_ids without the guard, so a complete <tool_call>…</tool_call> delivered in a single delta (MTP / speculative decoding) correctly signals the end of reasoning.
  • Preserve Multiple Tool Calls: Fixed extract_content_ids to search for the first occurrence of a tool call instead of the last, ensuring no tool calls are dropped when reasoning ends implicitly.
  • Reasoning/Content Boundary: Ensured that any content emitted after an explicit </think> tag is never swallowed back into the reasoning field, even if a tool call follows shortly after.

Additional fixes (commit d3be2250f)

  • count_reasoning_tokens override for the Qwen3.5+ template: the inherited depth-counter starts at depth=0 and is never incremented (because <think> lives in the prompt, not the model output), so it returned 0 reasoning tokens for every Qwen3.5+ generation. This silently broke usage.completion_tokens_details.reasoning_tokens reporting in API responses. The override treats the start of the output as the reasoning region and stops at the first </think> (or, for the Qwen3.5 implicit-end case, the first <tool_call>).

  • Recover withheld <tool prefix when the continuation is a false positive: the partial-overlap guard withholds the trailing bytes of current_text when they look like the start of <tool_call>. Previously, if the model emitted an unrelated continuation (e.g. <tool_use>, or just <tool_belt), those bytes were silently dropped. The parser now re-emits them as reasoning.

Test Result

All new tests pass, and existing reasoning tests remain green.

Changed files

  • tests/reasoning/test_qwen3_reasoning_parser.py (modified, +557/-0)
  • vllm/parser/abstract_parser.py (modified, +25/-2)
  • vllm/reasoning/qwen3_reasoning_parser.py (modified, +120/-32)

PR #39055: Fix Qwen3 reasoning tool calls embedded inside think

Description (problem / solution / changelog)

Summary

This PR fixes a Qwen3/Qwen3.5 non-streaming compatibility issue when using:

  • --reasoning-parser qwen3
  • --tool-call-parser qwen3_coder

Qwen models can emit XML tool calls inside <think> ... </think>. The current non-streaming pipeline extracts reasoning first and only parses tool calls from content, so valid XML tool calls embedded in reasoning are lost.

This patch updates qwen3_reasoning_parser to promote valid XML tool-call blocks out of reasoning into content, allowing the existing qwen3_coder tool parser to recover them without changing the generic serving stack.

Why this scope

This PR fixes parser recovery, not model generation behavior. It does not try to prevent Qwen3.5 from emitting tool calls inside <think>; it makes vLLM robust when that output pattern appears.

Tests

Added tests cover:

  • unchanged behavior for normal reasoning extraction
  • embedded tool call promotion from reasoning to content
  • successful parsing by qwen3_coder
  • truncated reasoning recovery without </think>
  • preservation of post-</think> content

Limitation

This change fixes the non-streaming path. Streaming recovery would require additional serving-layer changes and is intentionally left out of this minimal patch.

Changed files

  • docs/design/qwen3_reasoning_tool_call_recovery.md (added, +88/-0)
  • tests/reasoning/test_qwen3_reasoning_parser.py (modified, +114/-0)
  • vllm/reasoning/qwen3_reasoning_parser.py (modified, +49/-2)

PR #40768: [Bugfix] Fix CUDA crash caused by stale async placeholder tokens in speculative decoding

Description (problem / solution / changelog)

Fixes #37159

Summary

This PR fixes a CUDA device-side assert (vectorized_gather_kernel: ind >= 0) triggered by speculative decoding with async scheduling.

The root cause is that -1 placeholder tokens could leak into input_ids, leading to invalid GPU embedding lookups.

Reproduced on Tesla V100-PCIE-16GB with zai-org/GLM-OCR + MTP speculative decoding under concurrent multimodal requests (both vLLM v0.17.1 and vLLM v0.19.1).


Background

Async speculative decoding uses -1 as a placeholder token to indicate:

"This token will be filled on GPU later."

However, under certain scheduling conditions, these placeholders were not properly overwritten, and eventually reached the embedding kernel.

Failure scenario

This occurs when:

  • A request carries async placeholder tokens (-1)
  • It was not present in the previous worker batch (e.g., newly scheduled, or re-added after preemption)

In this case, worker-side overwrite logic is skipped:

# gpu_model_runner.py: _prepare_input_ids()
for cur_index in range(num_reqs):
    prev_index = prev_positions[cur_index]
    if prev_index < 0:
        continue  # overwrite skipped, -1 survives

As a result, -1 propagates through the pipeline:

token_ids_cpu
  → input_ids.copy_to_gpu()
    → embedding lookup
      → vectorized_gather_kernel
        → CUDA assert (index -1 out of bounds)

Why V100 is particularly affected

V100 (16 GB) forces aggressive chunked prefill, which creates mixed batches (prefill + decode) more frequently than larger-VRAM GPUs. Mixed batches increase the probability that a decode request with async placeholder intent has no previous worker slot, triggering the stale -1 path.


Root Cause

The core issue is that spec_token_ids was overloaded with two different meanings:

  1. Real speculative draft tokens — valid token IDs from the proposer
  2. Async placeholder markers-1 sentinels awaiting GPU-side overwrite

Because of this coupling:

  • Scheduler sometimes emits placeholders as if they were real tokens
  • Worker-side overwrite is not guaranteed for requests absent from the previous batch
  • -1 can reach the GPU execution path

Fix

Decouple token semantics from placeholder state.

Key point: -1 must never reach GPU unless overwrite is guaranteed.

1. Separate real tokens from placeholder state

# request.py
self.spec_token_ids: list[int] = []           # ONLY real draft tokens
self.num_pending_async_spec_placeholders = 0  # placeholder count only

2. Stop injecting -1 into spec_token_ids

Before (old code):

request.spec_token_ids = self._spec_token_placeholders  # writes [-1, -1, ...] directly

After (new code):

request.num_pending_async_spec_placeholders = self.num_spec_tokens  # just a count

spec_token_ids is no longer touched here; it stays clean for real draft tokens only.

3. Scheduler-level safe materialization

New method _consume_spec_decode_tokens_for_step() replaces the inline logic in schedule() with three clear branches:

if request.spec_token_ids:
    # Real draft tokens from proposer — use directly, no gating needed.
    spec_token_ids = request.spec_token_ids
elif (
    self.scheduler_config.async_scheduling
    and pending_placeholders > 0
    and request.request_id in self.prev_step_scheduled_req_ids
):
    # Request was in previous worker batch → GPU overwrite is guaranteed.
    spec_token_ids = [-1] * pending_placeholders
else:
    # No previous slot → drop placeholders. -1 never reaches worker.
    return None

Guarantee:

  • If GPU overwrite will NOT happen → never emit -1
  • If GPU overwrite WILL happen → -1 is safe

4. Reset pending state on lifecycle transitions

Clear num_pending_async_spec_placeholders = 0 on:

  • _preempt_request() — request evicted from running batch
  • update_draft_token_ids() prefill chunk branch — draft tokens ignored during chunked prefill
  • update_draft_token_ids() real draft branch — real draft tokens replace any pending placeholder intent

This prevents stale placeholder state from leaking across steps.

Why this is sufficient

Now align scheduler and worker invariants:

Condition
SchedulerOnly emits -1 if request existed in previous batch
WorkerOnly overwrites -1 if request existed in previous batch

These conditions match exactly, so -1 is always safe by construction. The worker-side code requires no changes.


Files changed

FileChange
vllm/v1/request.pyAdd num_pending_async_spec_placeholders field
vllm/v1/core/sched/scheduler.pyExtract _consume_spec_decode_tokens_for_step() with three-branch gating; clear pending state on preempt and draft update paths
vllm/v1/core/sched/async_scheduler.pyWrite placeholder count to new field instead of spec_token_ids; remove _spec_token_placeholders list
tests/v1/core/sched/test_async_scheduler.pyAdd 4 regression tests + test helper
tests/v1/core/sched/utils.pyUse ngram_gpu method when async scheduling + spec decode are both enabled

Test Plan

Unit tests

Four regression tests covering the full state space of the fix:

pytest tests/v1/core/sched/test_async_scheduler.py -v -k "test_consume_async_spec"
TestVerifies
test_consume_async_spec_placeholders_requires_prev_step_membershipStale placeholders are dropped when request has no previous worker slot
test_consume_async_spec_placeholders_materializes_for_prev_step_memberLegitimate placeholders still flow when previous slot exists
test_consume_async_spec_prefers_real_spec_tokens_over_placeholdersReal draft token IDs are never blocked by placeholder gating
test_consume_async_spec_clears_pending_when_no_spec_budgetPending state does not leak across zero-budget steps

End-to-end validation

Reproduced the original crash on Tesla V100-PCIE-16GB (vLLM v0.19.1) with zai-org/GLM-OCR + MTP speculative decoding under concurrent multimodal requests. After applying this fix, the same stress test (blast.py, 12 concurrent image OCR requests × 4 rounds) runs to completion with all requests returning 200.

<details> <summary>Reproduction commands and test script</summary>

Server launch:

CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 \
vllm serve zai-org/GLM-OCR \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --served-model-name glm-ocr \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 131072 \
  --limit-mm-per-prompt '{"image": 1}' \
  --tensor-parallel-size 1

Test (blast.py):

import asyncio, aiohttp, base64, random, os

IMAGE_DIR = "/workspace/test"
URL = "http://localhost:8000/v1/chat/completions"
CONCURRENCY = 12
ROUNDS = 4

images = [f for f in os.listdir(IMAGE_DIR)
          if f.endswith(('.jpg', '.png', '.jpeg'))]

def get_payload():
    img_path = os.path.join(IMAGE_DIR, random.choice(images))
    with open(img_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()
    return {
        "model": "glm-ocr",
        "messages": [{"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "请识别图片中的所有文字"}
        ]}],
        "max_tokens": 1024
    }

async def send(session, req_id):
    try:
        async with session.post(URL, json=get_payload(),
                timeout=aiohttp.ClientTimeout(total=3600)) as r:
            body = await r.json()
            if r.status != 200:
                print(f"[{req_id}] ERROR {r.status}: {body}")
            else:
                text = body["choices"][0]["message"]["content"]
                print(f"[{req_id}] OK: {text[:60]}...")
    except Exception as e:
        print(f"[{req_id}] EXCEPTION: {type(e).__name__}: {e}")

async def main():
    connector = aiohttp.TCPConnector(limit=CONCURRENCY)
    async with aiohttp.ClientSession(connector=connector) as session:
        for r in range(ROUNDS):
            print(f"\n=== Round {r+1}/{ROUNDS} ===")
            await asyncio.gather(
                *[send(session, f"{r}-{i}") for i in range(CONCURRENCY)])
            await asyncio.sleep(0.2)
    print("\n=== Done. Server survived all rounds. ===")

asyncio.run(main())
</details>

Test Results

Unit tests — all 13 tests in test_async_scheduler.py pass (including 4 new regression tests):

<img width="800" alt="unit_tests_passed" src="https://github.com/user-attachments/assets/32e04edc-3b65-430c-a594-552a19197317" />

End-to-end — blast.py stress test on V100, all requests return 200 OK with no crash:

<img width="800" alt="e2e_validation" src="https://github.com/user-attachments/assets/fa758ef0-aa6f-4a38-96b6-151155d760c4" />
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/core/test_async_scheduler.py (modified, +105/-1)
  • tests/v1/core/utils.py (modified, +13/-3)
  • vllm/v1/core/sched/async_scheduler.py (modified, +9/-8)
  • vllm/v1/core/sched/scheduler.py (modified, +66/-11)
  • vllm/v1/request.py (modified, +9/-1)

Code Example

import json
import requests

PROMPT = {
    "model": "Qwen/Qwen3.6-35B-A3B-FP8",
    "messages": [{"role": "user", "content": "Get weather for Tokyo"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                    "unit": {"type": "string"},
                },
            },
        },
    }],
    "tool_choice": "auto",
    "max_tokens": 150,
    "chat_template_kwargs": {"enable_thinking": False},
}

# Run 30 times, count clean (parsed JSON without `<<` or `parameter=` artifacts)
clean = 0
for _ in range(30):
    r = requests.post("http://localhost:8000/v1/chat/completions", json=PROMPT).json()
    tc = r["choices"][0]["message"]["tool_calls"]
    args = tc[0]["function"]["arguments"] if tc else ""
    if "<<" not in args and "parameter=" not in args and args.startswith("{"):
        clean += 1
print(f"clean: {clean}/30")

---

[ 0] 248058 = '<tool_call>'
[ 1]    198 = '\n'
[ 2]     27 = '<'FIRST `<`
[ 3]     27 = '<'SECOND `<` (the qwen3_coder template uses `<<function=>`)
[ 4]   1628 = 'function'CORRECT word
[ 5]  27362 = '=get'
[ 6]  67017 = '_weather'
...

---

[ 0] 248058 = '<tool_call>'
[ 1]    198 = '\n'
[ 2]     27 = '<'
[ 3]     27 = '<'    ← also two `<` (correct per template)
[ 4]  13766 = 'tool'WRONG token here. Should be 'function' (1628)
[ 5]   4000 = '_code'
[ 6]     29 = '>'
[ 7]    198 = '\n'
[ 8]   1302 = 'print'Python code follows because of wrong [4] token

---

clean_rate  (per_token_accept_rate)^num_speculative_tokens
0.8^30.51 ← matches our 53% empirical measurement with default config
0.8^1 = 0.80 ← matches our 65% with num_spec=1 (statistical variance)

---

--speculative-config '{"method":"ngram","num_speculative_tokens":3,"prompt_lookup_max":10,"prompt_lookup_min":8}'
RAW_BUFFERClick to expand / collapse

Summary

When using --speculative-config '{"method":"ngram","prompt_lookup_min":2,...}' (the default prompt_lookup_min) with a Qwen3-class model AND tool_choice=auto requests AND a tools array in the system prompt, tool-call output is corrupted in ~50% of requests even on a stack with all known related fixes applied (#40738 Phase 1+2, #36138, #40783, #39055).

The corruption manifests as wrong-token cascades like <tool_call>\n<<tool_code>\nprint(...), parameter=parameter=name, <<argname>...<argvalue> instead of the correct chat-template format <tool_call>\n<<function=NAME>\n<parameter=KEY>\nVALUE\n</parameter>....

Config-only workaround (no code changes): set prompt_lookup_min=8. Achieves 100% clean tool-call rate (n=30 single-query, 96% n=25 multi-query) on the same hardware/model where default prompt_lookup_min=2 gave ~50%.

This is a separate bug class from PR #40738 (GDN state corruption, which we backported and confirmed gives +30-40% improvement). The remaining ~50% residual that the four upstream PRs don't close is the rejection-sampling artifact described here.

Environment

  • vLLM: 0.19.2rc1.dev205+g07351e088 (latest nightly tested)
  • Model: Qwen3.6-35B-A3B-FP8 (hybrid linear-attention + MoE + full-attention layers)
  • Hardware: 2× NVIDIA RTX A5000 (Ampere SM 8.6), TP=2
  • KV cache: turboquant_k8v4
  • Speculative config: {"method":"ngram","num_speculative_tokens":3,"prompt_lookup_max":4,"prompt_lookup_min":2} (defaults)
  • Reasoning parser: qwen3
  • Tool call parser: qwen3_coder
  • chat_template_kwargs.enable_thinking: false
  • --enable-chunked-prefill, --enable-prefix-caching

Reproducer

import json
import requests

PROMPT = {
    "model": "Qwen/Qwen3.6-35B-A3B-FP8",
    "messages": [{"role": "user", "content": "Get weather for Tokyo"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                    "unit": {"type": "string"},
                },
            },
        },
    }],
    "tool_choice": "auto",
    "max_tokens": 150,
    "chat_template_kwargs": {"enable_thinking": False},
}

# Run 30 times, count clean (parsed JSON without `<<` or `parameter=` artifacts)
clean = 0
for _ in range(30):
    r = requests.post("http://localhost:8000/v1/chat/completions", json=PROMPT).json()
    tc = r["choices"][0]["message"]["tool_calls"]
    args = tc[0]["function"]["arguments"] if tc else ""
    if "<<" not in args and "parameter=" not in args and args.startswith("{"):
        clean += 1
print(f"clean: {clean}/30")

Expected with default prompt_lookup_min=2: ~15-16 clean (50-53%). Expected with prompt_lookup_min=8: 30 clean (100%).

How we found it (investigation methodology)

This bug was located after backporting four existing upstream PRs failed to fully close the corruption rate. The investigation timeline:

  1. First hypothesis: TurboQuant KV cache + cudagraph capture/replay (per noonghunna's six-probe ladder at #40807 / #40831). Disproved by Probe 6 (cudagraph_mode=NONE works) and Probe 9 (kernel determinism via bit-equality test).

  2. Second pass: backported #40768 (z1ying, async-scheduler placeholder fix). No measurable improvement — vLLM auto-disables async_scheduling for ngram method (vllm.py:815) so the buggy code path doesn't execute on our config.

  3. Third pass: backported #39055 (ZenoAFfectionate, Qwen3 reasoning embedded tool_call). +20% improvement (20% → 40%) — confirmed parser-level fixes were on the right track.

  4. Fourth pass: backported #40738 (tdoublep, GDN+ngram SSM/conv state corruption). +30-40% improvement (43-70% clean). This is the largest single fix and addresses the GDN state corruption originally reported by @bhaktatejas922 in #39273.

  5. Fifth pass: backported #36138 (sfbemerk, structured-output spec-decode reasoning-end timing) and #40783 (ExtReMLapin, Qwen3 multi-tool first-occurrence + streaming overlap guard). Combined +3-6% additional improvement (53-56% clean). Both are correct fixes for their respective bug classes; just not our dominant residual.

  6. Token-level tracing pass (where this bug was discovered): ran the reproducer with return_token_ids: true + logprobs: true, captured 30 responses, decoded raw token IDs via the Qwen3 tokenizer.

What the token traces showed

CLEAN response — parsed as {"city": "Tokyo", "unit": "Celsius"}:

[ 0] 248058 = '<tool_call>'
[ 1]    198 = '\n'
[ 2]     27 = '<'    ← FIRST `<`
[ 3]     27 = '<'    ← SECOND `<` (the qwen3_coder template uses `<<function=>`)
[ 4]   1628 = 'function'   ← CORRECT word
[ 5]  27362 = '=get'
[ 6]  67017 = '_weather'
...

BROKEN response — parsed as <tool_call>\n<<tool_code>\nprint(get_weather(city='Tokyo')):

[ 0] 248058 = '<tool_call>'
[ 1]    198 = '\n'
[ 2]     27 = '<'
[ 3]     27 = '<'    ← also two `<` (correct per template)
[ 4]  13766 = 'tool'   ← WRONG token here. Should be 'function' (1628)
[ 5]   4000 = '_code'
[ 6]     29 = '>'
[ 7]    198 = '\n'
[ 8]   1302 = 'print'   ← Python code follows because of wrong [4] token

The model picks the wrong token at exactly position [4] in the broken cases. Position [4] follows the [27, 27] << prefix, and the wrong token is one of: tool (token 13766, from the <tool_call> system-prompt definition), parameter (15704), argname, argkey, etc.

Root cause analysis

The chat template defines the tool-call format using XML markers that include <<function=, <parameter=, </parameter>, </function>. These tokens appear MULTIPLE TIMES in the system prompt's tool definitions section.

When ngram with prompt_lookup_min=2 is asked "what comes after [27, 27] (two <)?", the algorithm searches for any 2-token suffix match in the entire context. It finds matches in:

  1. The actual <<function=...> template definition (correct match — would draft function)
  2. The <tool_call> opening tag (matches [27, ...] partially → can draft tool)
  3. Any literal << markers in user content
  4. Multi-tool definitions (each with its own <<function=NAME> line)

The KMP-style longest-prefix-suffix matching in vllm/v1/spec_decode/ngram_proposer.py returns ONE of these matches (the longest, but ties are broken by position) and drafts the next k tokens from there.

The rejection sampler then verifies each drafted token against the target model's logits. For the function vs tool decision at position [4], the target model has function at probability ~0.7 and tool at probability ~0.2 (both are valid template-context tokens). The rejection sampler accepts a draft token when target_prob / draft_prob >= random(). Since ngram doesn't provide draft probabilities, the comparison effectively reduces to "is target_prob above some threshold". For per-token accept rate ≈ 0.8, the structural ceiling for clean rate becomes:

clean_rate ≈ (per_token_accept_rate)^num_speculative_tokens
0.8^3 ≈ 0.51 ← matches our 53% empirical measurement with default config
0.8^1 = 0.80 ← matches our 65% with num_spec=1 (statistical variance)

Why prompt_lookup_min=8 fixes it

Setting prompt_lookup_min=8 requires ngram to find an 8-token suffix match before drafting. For tool-call output, the 8-token suffix [27, 27, 1628, 27362, 67017, 29, 198, 27] (<<function=get_weather>\n<) is unique enough that it almost never matches spurious system-prompt fragments — it only matches the FIRST tool definition (where the model is currently reproducing the template) OR doesn't match at all.

Result: ngram drafts almost nothing on tool-call requests, producing close to no-spec correctness. Natural-language repetitions (where 8+ token matches do occur, e.g. repeated phrases in long text) still get speculative speedup.

Empirical proof (clean-rate progression)

All measurements on Qwen3.6-35B-A3B-FP8, 2× A5000, vLLM dev205+g07351e088, with all four upstream backports (#40738 Phase 1+2, #36138, #40783, #39055) applied. n=20-30 reproducer runs per row.

ConfigClean rate
prompt_lookup_min=2 (default)53% (16/30)
prompt_lookup_min=2, num_speculative_tokens=165% (13/20)
prompt_lookup_min=8 (single-query reproducer)100% (30/30)
prompt_lookup_min=8 (multi-query reproducer, 5 different queries × 5 trials)96% (24/25)
no spec-decode (reference)100%

Latency for 300-token responses: ~1.0s with prompt_lookup_min=8 vs ~0.9s with default. Tradeoff acceptable for tool-call workloads.

Working production configuration

--speculative-config '{"method":"ngram","num_speculative_tokens":3,"prompt_lookup_max":10,"prompt_lookup_min":8}'

Proposed upstream changes

Option A — change prompt_lookup_min default

Change the default value of SpeculativeConfig.prompt_lookup_min from 2 (current) to 4 or higher. Document that low values trade correctness for speedup on workloads where the prompt contains repetitive XML/JSON template fragments (tool definitions, structured output schemas, few-shot examples).

Option B — add per-request opt-out

Add a speculative_config_override field to chat completion requests so callers can disable spec decoding for individual requests where correctness is paramount. Existing example: tool-call requests could set enable_speculative=false while general text requests keep speculative enabled.

Option C — context-aware ngram

Modify _find_longest_matched_ngram_and_propose_tokens to skip matches that would land inside a structured-output template fragment (heuristic: prompt section identified by chat template markers). This is more invasive but addresses the root cause directly.

Option D — documentation

At minimum, document in docs/features/speculative_decoding.md that prompt_lookup_min=2 is unsuitable for tool-calling workloads and recommend prompt_lookup_min=4-8 for those cases.

What we still need

  1. Confirmation on other Qwen3 variants — we tested only Qwen3.6-35B-A3B-FP8. The same template format (<<function=) is used by Qwen3-Coder family, so the bug should reproduce there too. Independent confirmation welcomed.

  2. Confirmation on other tool-call parsers — we tested only qwen3_coder. Hermes / Llama tool-call formats may have different vulnerable positions. Worth testing.

  3. Sweet-spot exploration — we tested prompt_lookup_min ∈ {2, 8}. Values 4, 5, 6 may give better speed/correctness tradeoff. We picked 8 as obviously safe; 4-6 might be sufficient with similar speedup.

  4. Reasoning-on test — we tested with enable_thinking=false. With enable_thinking=true the model emits <think>...</think> blocks where ngram has even MORE template fragments to match against. Likely worse without strict prompt_lookup_min.

  5. Default-change PR reception — open question whether the maintainers prefer Option A (change default), B (per-request opt-out), C (context-aware ngram), or D (documentation only). Happy to write the actual PR for whichever option is preferred.

References + acknowledgements

This bug class was located only after backporting four existing fixes that addressed adjacent bug classes. Credit to the authors of those PRs whose work narrowed our search:

  • @tdoublep + @bhaktatejas922 — vllm#40738 / vllm#39273 (GDN+ngram SSM/conv state corruption — gave +30-40% improvement)
  • @sfbemerk + @cicirori — vllm#36138 / vllm#34650 (structured-output spec-decode reasoning-end timing)
  • @ExtReMLapin — vllm#40783 (Qwen3 reasoning parser fragmented tags + multi-tool)
  • @ZenoAFfectionate — vllm#39055 (Qwen3 reasoning embedded tool_call XML extraction)
  • @z1ying + @sweihub — vllm#40768 / vllm#37159 (async-scheduler placeholder leakage)
  • @noonghunna — vllm#40807 / vllm#40831 (six-probe ladder isolation methodology that made cross-rig data sharing possible)

Full integrated patch tree + investigation reports: Sandermage/genesis-vllm-patches @ 852b73f. Specific files relevant to this bug report:

extent analysis

TL;DR

Setting prompt_lookup_min to 8 in the speculative config fixes the tool-call output corruption issue by ensuring a unique 8-token suffix match, reducing incorrect token drafts.

Guidance

  • Identify if your workload involves tool-calling with repetitive XML/JSON template fragments, as these are prone to corruption with low prompt_lookup_min values.
  • Consider increasing prompt_lookup_min to 4 or higher for a better speed/correctness tradeoff, especially if you're using Qwen3 variants or similar tool-call parsers.
  • Test the prompt_lookup_min=8 fix in your specific environment to confirm its effectiveness, as the issue may manifest differently with other models or configurations.
  • Explore values between 4 and 8 for prompt_lookup_min to find the optimal balance between speed and correctness for your use case.

Example

To apply the fix, update your speculative config to include "prompt_lookup_min": 8, like so:

--speculative-config '{"method":"ngram","num_speculative_tokens":3,"prompt_lookup_max":10,"prompt_lookup_min":8}'

Notes

The provided fix assumes that the corruption issue is primarily caused by the ngram algorithm's tendency to draft incorrect tokens due to repetitive template fragments. However, other factors, such as model variants or specific tool-call parsers, might influence the effectiveness of this fix. Further testing and exploration of prompt_lookup_min values may be necessary to achieve optimal results.

Recommendation

Apply the workaround by setting prompt_lookup_min=8 in your speculative config, as it has been shown to achieve a 100% clean tool-call rate in the provided test cases, with an acceptable latency tradeoff.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING