vllm - 💡(How to fix) Fix [Bug]: Responses API + `text.format` json_schema is not grammar-constrained when a reasoning parser is enabled — output escapes into the unconstrained `reasoning` channel

vllm2026-05-29 18:02:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On a server started with --reasoning-parser, structured output via the Responses API (/v1/responses, text.format.type="json_schema") is not enforced by the structured-outputs backend. The model's text is emitted into the unconstrained reasoning item instead of a grammar-constrained message, so xgrammar never binds. With Qwen it intermittently degenerates into bracket-spam (}]}]}]}...) and returns invalid JSON in roughly half of runs.

The Chat Completions API (/v1/chat/completions, response_format.type="json_schema") on the same server, same model, same schema, thinking disabled keeps the output in the grammar-constrained content channel and is valid 100% of the time.

This looks like the same failure family as #34650 (</think> detection failure → json_schema not enforced after the thinking phase), but here it is reproducible on the plain Responses path — no speculative decoding required — and is triggered by the reasoning parser being active at all (especially with a custom reasoning_end_str).

Error Message

#!/usr/bin/env python3 """Minimal repro: Responses API + json_schema structured output is NOT grammar- constrained on a server with --reasoning-parser, while Chat Completions IS.

Same model, same schema, same prompt, thinking disabled on both paths.

/v1/responses + text.format=json_schema -> output lands in an unconstrained reasoning item; xgrammar never binds; Qwen intermittently runs away (bracket-spam }]}]}... -> invalid JSON).
/v1/chat/completions + response_format=json_schema -> output stays in the grammar-constrained content channel; always valid.

Run: export VLLM_BASE_URL=https://your-vllm/v1 export VLLM_API_KEY=... # if your server requires it export VLLM_MODEL=Qwen/Qwen3.6-35B-A3B-FP8 python repro.py [N] # N iterations per path, default 20

Only dependency: httpx. """

import json import os import sys

import httpx

BASE_URL = os.environ["VLLM_BASE_URL"].rstrip("/") API_KEY = os.environ.get("VLLM_API_KEY", "") MODEL = os.environ.get("VLLM_MODEL", "Qwen/Qwen3.6-35B-A3B-FP8") N = int(sys.argv[1]) if len(sys.argv) > 1 else 20

HEADERS = {"Content-Type": "application/json"} if API_KEY: HEADERS["Authorization"] = f"Bearer {API_KEY}"

A faithful-but-small extraction schema (list of objects), like the real

workload that triggered the runaway. Bounded list, free-text strings.

SCHEMA = { "type": "object", "properties": { "offerings": { "type": "array", "maxItems": 8, "items": { "type": "object", "properties": { "name": {"type": "string"}, "category": {"type": "string"}, "description": {"type": "string"}, }, "required": ["name", "category", "description"], "additionalProperties": False, }, } }, "required": ["offerings"], "additionalProperties": False, }

SYSTEM = ( "You extract a company's product/service offerings from the text and return " "them as structured data per the provided JSON schema. Be exhaustive." )

Long-ish, real-world-shaped input. The runaway is far more likely on a faithful

prompt than on a toy one ("answer/score") — keep this representative.

USER = ( "Company profile:\n" "Acme Logistics GmbH is a mid-size 3PL provider. We operate temperature-" "controlled warehousing across three sites, run a national same-day courier " "fleet, and offer customs brokerage for EU/non-EU shipments. Our digital arm " "ships a SaaS track-and-trace portal with a REST API, plus a managed EDI " "onboarding service for retail partners. We also do reverse-logistics / " "returns handling and ad-hoc project freight (oversized, hazmat). Recently we " "added a carbon-reporting dashboard for shippers.\n\n" "Extract the structured data per the schema." )

TIMEOUT = httpx.Timeout(120.0)

def classify(text: str) -> str: """valid | empty | invalid-json | schema-mismatch""" text = (text or "").strip() if not text: return "empty" try: obj = json.loads(text) except json.JSONDecodeError: return "invalid-json" if not isinstance(obj, dict) or "offerings" not in obj: return "schema-mismatch" return "valid"

def run_responses(client: httpx.Client) -> tuple[str, str]: """Returns (classification, channel). channel = where the JSON text landed: 'message' (grammar-constrained) vs 'reasoning' (unconstrained).""" body = { "model": MODEL, "input": [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": USER}, ], # reasoning omitted on purpose: our vLLM thinking-patch maps that to # enable_thinking=False. On stock vLLM, add the equivalent below. "text": { "format": { "type": "json_schema", "name": "offerings", "schema": SCHEMA, "strict": True, } }, "max_output_tokens": 4096, } r = client.post(f"{BASE_URL}/responses", headers=HEADERS, json=body) r.raise_for_status() data = r.json()

# Find where the text actually came out.
msg_text, reasoning_text = "", ""
for item in data.get("output", []):
    itype = item.get("type")
    parts = item.get("content") or []
    text = "".join(
        p.get("text", "") for p in parts if isinstance(p, dict)
    )
    if itype == "message":
        msg_text += text
    elif itype == "reasoning":
        reasoning_text += text + "".join(
            p.get("text", "") for p in [item] if "text" in item
        )
# vLLM also exposes top-level output_text aggregating message items.
output_text = data.get("output_text") or msg_text
if output_text.strip():
    return classify(output_text), "message"
if reasoning_text.strip():
    return classify(reasoning_text), "reasoning"
return "empty", "none"

def run_chat(client: httpx.Client) -> tuple[str, str]: body = { "model": MODEL, "messages": [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": USER}, ], "response_format": { "type": "json_schema", "json_schema": {"name": "offerings", "schema": SCHEMA, "strict": True}, }, "max_tokens": 4096, # thinking patch does NOT hook the chat path -> disable explicitly. "chat_template_kwargs": {"enable_thinking": False}, } r = client.post(f"{BASE_URL}/chat/completions", headers=HEADERS, json=body) r.raise_for_status() data = r.json() content = data["choices"][0]["message"].get("content") or "" return classify(content), "content"

def main() -> None: print(f"model={MODEL} base={BASE_URL} N={N}\n") with httpx.Client(timeout=TIMEOUT) as client: for label, fn in (("responses", run_responses), ("chat", run_chat)): counts: dict[str, int] = {} channels: dict[str, int] = {} for i in range(N): try: cls, ch = fn(client) except Exception as e: # noqa: BLE001 - repro, surface everything cls, ch = f"http-error:{type(e).name}", "none" counts[cls] = counts.get(cls, 0) + 1 channels[ch] = channels.get(ch, 0) + 1 print(f" {label:9} {i + 1:>3}: {cls:16} channel={ch}") print(f"== {label}: {dict(counts)} channels={dict(channels)}\n")

if name == "main": main()

Root Cause

Because generation is unconstrained, with real (longer, messier) extraction transcripts it intermittently runs away into bracket-spam }]}]}... and returns invalid JSON — ~1/2 of production calls. On the clean synthetic prompt in repro.py the unconstrained output happened to parse 20/20, so the deterministic defect here is the missing grammar binding / wrong channel, and the invalid JSON is the intermittent downstream symptom.

Fix Action

Workaround

Use Chat Completions with response_format json_schema and chat_template_kwargs={"enable_thinking": false}. This keeps output in the grammar-constrained channel and fully fixes the runaway for us.

Code Example

--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--structured-outputs-config '{"backend": "xgrammar", "disable_any_whitespace": true}'
--reasoning-parser qwen3
--reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": "I have to give the solution based on the reasoning directly now.</think>"}'
--enable-prefix-caching
--enable-chunked-prefill
--max-num-batched-tokens 32768 --max-num-seqs 32
--override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5}'

---

export VLLM_BASE_URL=https://your-vllm/v1
export VLLM_API_KEY=...
export VLLM_MODEL=Qwen/Qwen3.6-35B-A3B-FP8
python repro.py 20

---

== responses: {'valid': 20} channels={'reasoning': 20}
== chat:      {'valid': 20} channels={'content': 20}

---

{"output": [{"type": "reasoning", "...": "the JSON answer lives here"}],
   "output_text": null}

---

#!/usr/bin/env python3
"""Minimal repro: Responses API + json_schema structured output is NOT grammar-
constrained on a server with --reasoning-parser, while Chat Completions IS.

Same model, same schema, same prompt, thinking disabled on both paths.
- /v1/responses + text.format=json_schema   -> output lands in an unconstrained
  `reasoning` item; xgrammar never binds; Qwen intermittently runs away
  (bracket-spam `}]}]}...` -> invalid JSON).
- /v1/chat/completions + response_format=json_schema -> output stays in the
  grammar-constrained `content` channel; always valid.

Run:
    export VLLM_BASE_URL=https://your-vllm/v1
    export VLLM_API_KEY=...           # if your server requires it
    export VLLM_MODEL=Qwen/Qwen3.6-35B-A3B-FP8
    python repro.py [N]               # N iterations per path, default 20

Only dependency: httpx.
"""

import json
import os
import sys

import httpx

BASE_URL = os.environ["VLLM_BASE_URL"].rstrip("/")
API_KEY = os.environ.get("VLLM_API_KEY", "")
MODEL = os.environ.get("VLLM_MODEL", "Qwen/Qwen3.6-35B-A3B-FP8")
N = int(sys.argv[1]) if len(sys.argv) > 1 else 20

HEADERS = {"Content-Type": "application/json"}
if API_KEY:
    HEADERS["Authorization"] = f"Bearer {API_KEY}"

# A faithful-but-small extraction schema (list of objects), like the real
# workload that triggered the runaway. Bounded list, free-text strings.
SCHEMA = {
    "type": "object",
    "properties": {
        "offerings": {
            "type": "array",
            "maxItems": 8,
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "category": {"type": "string"},
                    "description": {"type": "string"},
                },
                "required": ["name", "category", "description"],
                "additionalProperties": False,
            },
        }
    },
    "required": ["offerings"],
    "additionalProperties": False,
}

SYSTEM = (
    "You extract a company's product/service offerings from the text and return "
    "them as structured data per the provided JSON schema. Be exhaustive."
)
# Long-ish, real-world-shaped input. The runaway is far more likely on a faithful
# prompt than on a toy one ("answer/score") — keep this representative.
USER = (
    "Company profile:\n"
    "Acme Logistics GmbH is a mid-size 3PL provider. We operate temperature-"
    "controlled warehousing across three sites, run a national same-day courier "
    "fleet, and offer customs brokerage for EU/non-EU shipments. Our digital arm "
    "ships a SaaS track-and-trace portal with a REST API, plus a managed EDI "
    "onboarding service for retail partners. We also do reverse-logistics / "
    "returns handling and ad-hoc project freight (oversized, hazmat). Recently we "
    "added a carbon-reporting dashboard for shippers.\n\n"
    "Extract the structured data per the schema."
)

TIMEOUT = httpx.Timeout(120.0)


def classify(text: str) -> str:
    """valid | empty | invalid-json | schema-mismatch"""
    text = (text or "").strip()
    if not text:
        return "empty"
    try:
        obj = json.loads(text)
    except json.JSONDecodeError:
        return "invalid-json"
    if not isinstance(obj, dict) or "offerings" not in obj:
        return "schema-mismatch"
    return "valid"


def run_responses(client: httpx.Client) -> tuple[str, str]:
    """Returns (classification, channel). channel = where the JSON text landed:
    'message' (grammar-constrained) vs 'reasoning' (unconstrained)."""
    body = {
        "model": MODEL,
        "input": [
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": USER},
        ],
        # reasoning omitted on purpose: our vLLM thinking-patch maps that to
        # enable_thinking=False. On stock vLLM, add the equivalent below.
        "text": {
            "format": {
                "type": "json_schema",
                "name": "offerings",
                "schema": SCHEMA,
                "strict": True,
            }
        },
        "max_output_tokens": 4096,
    }
    r = client.post(f"{BASE_URL}/responses", headers=HEADERS, json=body)
    r.raise_for_status()
    data = r.json()

    # Find where the text actually came out.
    msg_text, reasoning_text = "", ""
    for item in data.get("output", []):
        itype = item.get("type")
        parts = item.get("content") or []
        text = "".join(
            p.get("text", "") for p in parts if isinstance(p, dict)
        )
        if itype == "message":
            msg_text += text
        elif itype == "reasoning":
            reasoning_text += text + "".join(
                p.get("text", "") for p in [item] if "text" in item
            )
    # vLLM also exposes top-level output_text aggregating message items.
    output_text = data.get("output_text") or msg_text
    if output_text.strip():
        return classify(output_text), "message"
    if reasoning_text.strip():
        return classify(reasoning_text), "reasoning"
    return "empty", "none"


def run_chat(client: httpx.Client) -> tuple[str, str]:
    body = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": USER},
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": {"name": "offerings", "schema": SCHEMA, "strict": True},
        },
        "max_tokens": 4096,
        # thinking patch does NOT hook the chat path -> disable explicitly.
        "chat_template_kwargs": {"enable_thinking": False},
    }
    r = client.post(f"{BASE_URL}/chat/completions", headers=HEADERS, json=body)
    r.raise_for_status()
    data = r.json()
    content = data["choices"][0]["message"].get("content") or ""
    return classify(content), "content"


def main() -> None:
    print(f"model={MODEL} base={BASE_URL} N={N}\n")
    with httpx.Client(timeout=TIMEOUT) as client:
        for label, fn in (("responses", run_responses), ("chat", run_chat)):
            counts: dict[str, int] = {}
            channels: dict[str, int] = {}
            for i in range(N):
                try:
                    cls, ch = fn(client)
                except Exception as e:  # noqa: BLE001 - repro, surface everything
                    cls, ch = f"http-error:{type(e).__name__}", "none"
                counts[cls] = counts.get(cls, 0) + 1
                channels[ch] = channels.get(ch, 0) + 1
                print(f"  {label:9} {i + 1:>3}: {cls:16} channel={ch}")
            print(f"== {label}: {dict(counts)} channels={dict(channels)}\n")


if __name__ == "__main__":
    main()

RAW_BUFFERClick to expand / collapse

Summary

Environment

vLLM v0.21.0 (vllm/vllm-openai:v0.21.0)
Model Qwen/Qwen3.6-35B-A3B-FP8 (MoE, FP8), 1× A100 80 GB
Relevant serve args:

--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--structured-outputs-config '{"backend": "xgrammar", "disable_any_whitespace": true}'
--reasoning-parser qwen3
--reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": "I have to give the solution based on the reasoning directly now.</think>"}'
--enable-prefix-caching
--enable-chunked-prefill
--max-num-batched-tokens 32768 --max-num-seqs 32
--override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 1.5}'

Note the custom reasoning_end_str — not a bare </think>.

Reproduction

Self-contained script (httpx only) attached as repro.py. It fires N requests on each API path with thinking disabled and reports, per request, the classification (valid / invalid-json / empty / schema-mismatch) and the channel the text landed in (message/content = grammar-constrained, reasoning = unconstrained).

export VLLM_BASE_URL=https://your-vllm/v1
export VLLM_API_KEY=...
export VLLM_MODEL=Qwen/Qwen3.6-35B-A3B-FP8
python repro.py 20

Thinking is disabled on both paths:

Responses: reasoning omitted (on stock vLLM use the equivalent effort-disabling form for your build).
Chat: chat_template_kwargs={"enable_thinking": false}.

Observed (vLLM v0.21.0, 20 runs per path, thinking disabled on both)

== responses: {'valid': 20} channels={'reasoning': 20}
== chat:      {'valid': 20} channels={'content': 20}

/v1/responses + text.format json_schema: the answer is emitted only as a reasoning item — there is no message item and output_text is null. The structured-outputs grammar is therefore never applied to the generation. This is 100% reproducible (20/20). Minimal raw output:
```
{"output": [{"type": "reasoning", "...": "the JSON answer lives here"}],
 "output_text": null}
```
Because generation is unconstrained, with real (longer, messier) extraction transcripts it intermittently runs away into bracket-spam }]}]}... and returns invalid JSON — ~1/2 of production calls. On the clean synthetic prompt in repro.py the unconstrained output happened to parse 20/20, so the deterministic defect here is the missing grammar binding / wrong channel, and the invalid JSON is the intermittent downstream symptom.
/v1/chat/completions + response_format json_schema: answer in the grammar-constrained content channel, 20/20 valid. No runaway in production.

Expected

text.format.type="json_schema" on the Responses API should bind the structured-outputs grammar to the emitted answer regardless of whether a reasoning parser is configured (and regardless of whether thinking is on/off), exactly as response_format does on Chat Completions.

Hypothesis / pointer

With a reasoning parser active, the json_schema grammar is only applied after the parser flips reasoning_ended. On the Responses path that flag is not flipped before the answer is generated (the answer is itself classified as reasoning content), so the grammar never engages. A custom reasoning_end_str makes this worse: the end marker is essentially never emitted, so the parser treats the entire generation as reasoning. Same root shape as #34650, but on the non-speculative Responses path.

Workaround

Use Chat Completions with response_format json_schema and chat_template_kwargs={"enable_thinking": false}. This keeps output in the grammar-constrained channel and fully fixes the runaway for us.

#34650 — MTP speculative decoding causes </think> detection failure in structured output + reasoning mode (json_schema not enforced after thinking).
#38245 — Responses API text.format.type="json_schema" serialization/streaming bug (separate issue, but same surface).
#15670 — poor quality with reasoning models + structured output (discussion).

<details> <summary><code>repro.py</code> (httpx only)</summary>

#!/usr/bin/env python3
"""Minimal repro: Responses API + json_schema structured output is NOT grammar-
constrained on a server with --reasoning-parser, while Chat Completions IS.

Same model, same schema, same prompt, thinking disabled on both paths.
- /v1/responses + text.format=json_schema   -> output lands in an unconstrained
  `reasoning` item; xgrammar never binds; Qwen intermittently runs away
  (bracket-spam `}]}]}...` -> invalid JSON).
- /v1/chat/completions + response_format=json_schema -> output stays in the
  grammar-constrained `content` channel; always valid.

Run:
    export VLLM_BASE_URL=https://your-vllm/v1
    export VLLM_API_KEY=...           # if your server requires it
    export VLLM_MODEL=Qwen/Qwen3.6-35B-A3B-FP8
    python repro.py [N]               # N iterations per path, default 20

Only dependency: httpx.
"""

import json
import os
import sys

import httpx

BASE_URL = os.environ["VLLM_BASE_URL"].rstrip("/")
API_KEY = os.environ.get("VLLM_API_KEY", "")
MODEL = os.environ.get("VLLM_MODEL", "Qwen/Qwen3.6-35B-A3B-FP8")
N = int(sys.argv[1]) if len(sys.argv) > 1 else 20

HEADERS = {"Content-Type": "application/json"}
if API_KEY:
    HEADERS["Authorization"] = f"Bearer {API_KEY}"

# A faithful-but-small extraction schema (list of objects), like the real
# workload that triggered the runaway. Bounded list, free-text strings.
SCHEMA = {
    "type": "object",
    "properties": {
        "offerings": {
            "type": "array",
            "maxItems": 8,
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "category": {"type": "string"},
                    "description": {"type": "string"},
                },
                "required": ["name", "category", "description"],
                "additionalProperties": False,
            },
        }
    },
    "required": ["offerings"],
    "additionalProperties": False,
}

SYSTEM = (
    "You extract a company's product/service offerings from the text and return "
    "them as structured data per the provided JSON schema. Be exhaustive."
)
# Long-ish, real-world-shaped input. The runaway is far more likely on a faithful
# prompt than on a toy one ("answer/score") — keep this representative.
USER = (
    "Company profile:\n"
    "Acme Logistics GmbH is a mid-size 3PL provider. We operate temperature-"
    "controlled warehousing across three sites, run a national same-day courier "
    "fleet, and offer customs brokerage for EU/non-EU shipments. Our digital arm "
    "ships a SaaS track-and-trace portal with a REST API, plus a managed EDI "
    "onboarding service for retail partners. We also do reverse-logistics / "
    "returns handling and ad-hoc project freight (oversized, hazmat). Recently we "
    "added a carbon-reporting dashboard for shippers.\n\n"
    "Extract the structured data per the schema."
)

TIMEOUT = httpx.Timeout(120.0)


def classify(text: str) -> str:
    """valid | empty | invalid-json | schema-mismatch"""
    text = (text or "").strip()
    if not text:
        return "empty"
    try:
        obj = json.loads(text)
    except json.JSONDecodeError:
        return "invalid-json"
    if not isinstance(obj, dict) or "offerings" not in obj:
        return "schema-mismatch"
    return "valid"


def run_responses(client: httpx.Client) -> tuple[str, str]:
    """Returns (classification, channel). channel = where the JSON text landed:
    'message' (grammar-constrained) vs 'reasoning' (unconstrained)."""
    body = {
        "model": MODEL,
        "input": [
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": USER},
        ],
        # reasoning omitted on purpose: our vLLM thinking-patch maps that to
        # enable_thinking=False. On stock vLLM, add the equivalent below.
        "text": {
            "format": {
                "type": "json_schema",
                "name": "offerings",
                "schema": SCHEMA,
                "strict": True,
            }
        },
        "max_output_tokens": 4096,
    }
    r = client.post(f"{BASE_URL}/responses", headers=HEADERS, json=body)
    r.raise_for_status()
    data = r.json()

    # Find where the text actually came out.
    msg_text, reasoning_text = "", ""
    for item in data.get("output", []):
        itype = item.get("type")
        parts = item.get("content") or []
        text = "".join(
            p.get("text", "") for p in parts if isinstance(p, dict)
        )
        if itype == "message":
            msg_text += text
        elif itype == "reasoning":
            reasoning_text += text + "".join(
                p.get("text", "") for p in [item] if "text" in item
            )
    # vLLM also exposes top-level output_text aggregating message items.
    output_text = data.get("output_text") or msg_text
    if output_text.strip():
        return classify(output_text), "message"
    if reasoning_text.strip():
        return classify(reasoning_text), "reasoning"
    return "empty", "none"


def run_chat(client: httpx.Client) -> tuple[str, str]:
    body = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": USER},
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": {"name": "offerings", "schema": SCHEMA, "strict": True},
        },
        "max_tokens": 4096,
        # thinking patch does NOT hook the chat path -> disable explicitly.
        "chat_template_kwargs": {"enable_thinking": False},
    }
    r = client.post(f"{BASE_URL}/chat/completions", headers=HEADERS, json=body)
    r.raise_for_status()
    data = r.json()
    content = data["choices"][0]["message"].get("content") or ""
    return classify(content), "content"


def main() -> None:
    print(f"model={MODEL} base={BASE_URL} N={N}\n")
    with httpx.Client(timeout=TIMEOUT) as client:
        for label, fn in (("responses", run_responses), ("chat", run_chat)):
            counts: dict[str, int] = {}
            channels: dict[str, int] = {}
            for i in range(N):
                try:
                    cls, ch = fn(client)
                except Exception as e:  # noqa: BLE001 - repro, surface everything
                    cls, ch = f"http-error:{type(e).__name__}", "none"
                counts[cls] = counts.get(cls, 0) + 1
                channels[ch] = channels.get(ch, 0) + 1
                print(f"  {label:9} {i + 1:>3}: {cls:16} channel={ch}")
            print(f"== {label}: {dict(counts)} channels={dict(channels)}\n")


if __name__ == "__main__":
    main()

</details>

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Responses API + `text.format` json_schema is not grammar-constrained when a reasoning parser is enabled — output escapes into the unconstrained `reasoning` channel

Recommended Tools

GitHub issue graph ai analysis

Error Message

A faithful-but-small extraction schema (list of objects), like the real

workload that triggered the runaway. Bounded list, free-text strings.

Long-ish, real-world-shaped input. The runaway is far more likely on a faithful

prompt than on a toy one ("answer/score") — keep this representative.

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Observed (vLLM v0.21.0, 20 runs per path, thinking disabled on both)

Expected

Hypothesis / pointer

Workaround

Related

Still need to ship something?

TRENDING