vllm - 💡(How to fix) Fix [Bug]: V1 structured outputs: a malformed grammar request after a valid one crashes EngineCore

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

HTTP 500: {"error":{"message":"EngineCore encountered an issue. See stack trace (above) for the root cause.",...}}

Root Cause

HTTP 500: {"error":{"message":"EngineCore encountered an issue. See stack trace (above) for the root cause.",...}}

Fix Action

Fix / Workaround

Happy to test any patch against the reproducer above.

Code Example

PyTorch version: 2.4+ (any)
Is debug build: False

OS: Linux (any)
Python version: 
.12

vLLM Version: confirmed on v0.20.0, v0.20.1, v0.21.0
xgrammar version: 0.2.0
Structured outputs backend (CLI): default `auto`
Engine: V1 (default)

---

# NOTE: We only support a single backend. We do NOT support different
# backends on a per-request basis in V1 (for now, anyway...).
# _backend is set in Processor._validate_structured_output
if self.backend is None:
    backend = request.sampling_params.structured_outputs._backend
    if backend == "xgrammar":
        self.backend = XgrammarBackend(...)
    elif backend == "guidance":
        self.backend = GuidanceBackend(...)
    ...

---

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 --enforce-eager
# default --structured-outputs-config backend=auto

---

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "x"}],
    "max_tokens": 4,
    "stream": false,
    "structured_outputs": {"grammar": "root ::= \"a\" | \"b\""}
  }'
# expected: HTTP 200

---

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "x"}],
    "max_tokens": 4,
    "stream": false,
    "structured_outputs": {"grammar": "root ::= \"a\"\nroot ::= \"b\""}
  }'

---

HTTP 500: {"error":{"message":"EngineCore encountered an issue. See stack trace (above) for the root cause.",...}}

---

(EngineCore pid=...) ERROR [core.py] EngineCore encountered a fatal error.
  ...
  File ".../vllm/v1/structured_output/__init__.py", line 183, in _create_grammar
    return self.backend.compile_grammar(request_type, grammar_spec)
  File ".../vllm/v1/structured_output/backend_xgrammar.py", line 89, in compile_grammar
    ctx = self.compiler.compile_grammar(grammar_spec)
  File ".../xgrammar/compiler.py", line 320, in compile_grammar
    self._handle.compile_grammar_from_strings(grammar, root_rule_name)
RuntimeError: [...] /project/cpp/grammar_parser.cc:564: EBNF parser error at line 2, column 1: Rule "root" is defined multiple times
EngineDeadError: EngineCore encountered an issue.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
PyTorch version: 2.4+ (any)
Is debug build: False

OS: Linux (any)
Python version: 
.12

vLLM Version: confirmed on v0.20.0, v0.20.1, v0.21.0
xgrammar version: 0.2.0
Structured outputs backend (CLI): default `auto`
Engine: V1 (default)
</details>

🐛 Describe the bug

The V1 StructuredOutputManager initializes its backend (XgrammarBackend / GuidanceBackend / etc.) exactly once per engine process, on the first grammar request that reaches the scheduler, based on that request's _backend field. See the explicit comment in vllm/v1/structured_output/__init__.py:

# NOTE: We only support a single backend. We do NOT support different
# backends on a per-request basis in V1 (for now, anyway...).
# _backend is set in Processor._validate_structured_output
if self.backend is None:
    backend = request.sampling_params.structured_outputs._backend
    if backend == "xgrammar":
        self.backend = XgrammarBackend(...)
    elif backend == "guidance":
        self.backend = GuidanceBackend(...)
    ...

The per-request auto-fallback in SamplingParams._validate_structured_outputs runs only on submission of that request. If the first grammar request has a grammar xgrammar accepts, the fallback chooses xgrammar and XgrammarBackend is cached for the lifetime of the engine. All subsequent grammar requests then go through XgrammarBackend.compile_grammar regardless of what their own auto-fallback would have chosen.

Combined with the fact that XgrammarBackend.compile_grammar (vllm/v1/structured_output/backend_xgrammar.py:89) has no exception handling around self.compiler.compile_grammar(...), this means: once xgrammar is latched, any later request whose grammar xgrammar rejects at compile time will raise out of the background grammar-compile future, propagate through _try_promote_blocked_waiting_requestEngineCore.run_busy_loopEngineDeadError, and kill the engine process. All TP workers shut down; the pod restarts.

The TODO comment in _create_grammar is aware of this gap:

"we still need to handle xgrammar compilation failures, though it should be unlikely as we test that up front as well."

The "tested up front" claim holds only on the first request through a fresh engine; once the backend is latched, subsequent requests bypass that protection.

Reproducer

Server (any V1 + chat model works; small public one used here so the full repro is ~2 minutes including model download and engine init):

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 --enforce-eager
# default --structured-outputs-config backend=auto

Step 1 — valid grammar (latches XgrammarBackend into the engine):

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "x"}],
    "max_tokens": 4,
    "stream": false,
    "structured_outputs": {"grammar": "root ::= \"a\" | \"b\""}
  }'
# expected: HTTP 200

Step 2 — malformed EBNF (xgrammar rejects at compile time, kills the engine):

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "x"}],
    "max_tokens": 4,
    "stream": false,
    "structured_outputs": {"grammar": "root ::= \"a\"\nroot ::= \"b\""}
  }'

Observed:

HTTP 500: {"error":{"message":"EngineCore encountered an issue. See stack trace (above) for the root cause.",...}}

Server log:

(EngineCore pid=...) ERROR [core.py] EngineCore encountered a fatal error.
  ...
  File ".../vllm/v1/structured_output/__init__.py", line 183, in _create_grammar
    return self.backend.compile_grammar(request_type, grammar_spec)
  File ".../vllm/v1/structured_output/backend_xgrammar.py", line 89, in compile_grammar
    ctx = self.compiler.compile_grammar(grammar_spec)
  File ".../xgrammar/compiler.py", line 320, in compile_grammar
    self._handle.compile_grammar_from_strings(grammar, root_rule_name)
RuntimeError: [...] /project/cpp/grammar_parser.cc:564: EBNF parser error at line 2, column 1: Rule "root" is defined multiple times
EngineDeadError: EngineCore encountered an issue.

Any EBNF that xgrammar rejects works; duplicate-root is the shortest. Pod readiness drops and the engine process exits.

Ordering matters

If both requests in the reproducer are the malformed one (i.e. you skip Step 1), the bug does not fire — the first request's auto-fallback raises ValueError in xgrammar's pre-compile check, the fallback latches _backend="guidance", GuidanceBackend is cached for the engine lifetime, and the malformed EBNF compiles leniently afterwards. The bug requires (1) a first grammar that xgrammar accepts, then (2) a later grammar that xgrammar rejects.

This ordering subtlety is likely why this failure mode has been intermittent in production-traffic-only reports and difficult to reproduce locally with a single curl.

Both the deprecated guided_grammar top-level field (where still accepted) and the current structured_outputs.grammar field converge on the same engine path, so the surface used to submit the grammar doesn't matter.

Affected versions

Reproduced cleanly on fresh single-instance stock vllm/vllm-openai builds:

  • v0.20.0
  • v0.20.1
  • v0.21.0

The relevant source lines in vllm/v1/structured_output/__init__.py (the backend lock-in) and vllm/v1/structured_output/backend_xgrammar.py:89 (the unguarded compile_grammar call) are identical across these versions; xgrammar is 0.2.0 in all three.

Related work

Existing open PRs that target this exact class of bug:

  • #37642[Bugfix] Fix engine crash when structured output grammar compilation fails with auto backend. Catches the exception in _try_promote_blocked_waiting_request, introduces a FinishReason.VALIDATION to surface the failure as 400. Status needs-rebase. One concern: the PR catches only except ValueError, but the xgrammar compile_grammar_from_strings binding raises RuntimeError (verified with xgr.Grammar.from_ebnf("root ::= \"a\"\nroot ::= \"b\"") and in the trace above). The unwrapped Future.result() re-raises that RuntimeError unchanged, so except ValueError would not catch the EBNF case shown here. Broadening to except Exception (or at least except (ValueError, RuntimeError)) would make it cover the full bug class.

  • #30346[Core] Major fix catch backend grammar exceptions (xgrammar, outlines, etc.) in scheduler. Broader scope, also addresses outlines. Code review flagged correctness issues with how aborted-request info is propagated.

Earlier closely-related fixes that addressed adjacent failure modes but didn't close the latched-backend hole:

  • #28209 — engine crash when structured_outputs.grammar is empty string
  • #19270guided_regex parsing error crashes the server
  • #17313 — clients can crash the openai server with invalid regex
  • #20584 — Handle grammar compilation exceptions gracefully

Suggested fix surface

Either:

  1. vllm/v1/structured_output/backend_xgrammar.py:77-122 — wrap compile_grammar in try/except Exception and either return a sentinel grammar that fails the single request as 400, or fall back to the guidance backend just for that request. This is the most localized fix.

  2. vllm/v1/structured_output/__init__.py — drop the single-backend-per-engine constraint and instantiate the requested backend per request, so per-request auto-fallback isn't bypassed. Larger architectural change.

Happy to test any patch against the reproducer above.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING