vllm - 💡(How to fix) Fix `--reasoning-config` breaks Nemotron v3 reasoning parser (content always null, thinking unbounded) [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39103Fetched 2026-04-08 03:01:58
View on GitHub
Comments
1
Participants
1
Timeline
1
Reactions
2
Author
Participants
Timeline (top)
commented ×1

Root Cause

For Nemotron-3-Super (and similar "thinking" models):

  • We need bounded thinking for latency and cost, especially with small max_tokens.
  • Right now we must choose between:
    • Working parser but no thinking budget (no --reasoning-config), causing small max_tokens calls to be consumed entirely by thinking; or
    • Enabled reasoning-config that allows budgets, but breaks content completely (content: null).

Code Example

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --reasoning-parser nemotron_v3 \
  # variant A (works, but no thinking cap)
  # [no --reasoning-config]

  # variant B (broken; used for repro)
  --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'
RAW_BUFFERClick to expand / collapse

Summary

When running Nemotron-3-Super with --reasoning-parser nemotron_v3, adding --reasoning-config causes all responses to have content: null, while all generated tokens go into the reasoning trace. Without --reasoning-config, the parser works, but there's no way to cap thinking tokens, so short max_tokens calls can be completely consumed by thinking.

This looks like an interaction/compatibility bug between --reasoning-config and the nemotron_v3 reasoning parser.


Environment

  • vLLM version: 0.19.1rc1.dev35+g968ed02ac (cu130-nightly)
  • Backend: NVIDIA GB10 (DGX Spark, SM121)
  • Model: NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  • Launch flags (repro case):
python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --reasoning-parser nemotron_v3 \
  # variant A (works, but no thinking cap)
  # [no --reasoning-config]

  # variant B (broken; used for repro)
  --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'

Client: OpenAI-compatible /v1/chat/completions (via LiteLLM proxy, but behavior confirmed directly against vLLM as well).


Expected behavior

  • With --reasoning-parser nemotron_v3 and --reasoning-config, the server should:
    • Accept a thinking budget parameter (e.g. thinking_token_budget or equivalent).
    • Route tokens before the Nemotron thinking delimiter into reasoning_content.
    • Route the final answer after the delimiter into content.
    • Enforce a cap on reasoning tokens, especially when max_tokens is small.

Actual behavior

  1. Without --reasoning-config:

    • Requests with large max_tokens (e.g. 32768): responses are correct — message.content has the answer, message.reasoning_content has the thinking trace.
    • Requests with small max_tokens (e.g. 512): model spends all tokens on thinking, message.content is null.
    • There is no way to cap reasoning tokens from the client; thinking_token_budget is rejected with HTTP 400: "thinking_token_budget is set but reasoning_config is not configured."
  2. With --reasoning-config enabled:

    • All responses return:
      • message.content = null
      • message.reasoning_content = non-empty string
      • finish_reason = "stop"
      • usage.completion_tokens > 0
    • This happens even when not passing any thinking/budget parameters from the client.
    • Removing --reasoning-config and restarting restores normal behavior immediately.

What I've tried

  1. Server-side reasoning config--reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'. Result: content always null.

  2. Client-side thinking budgetthinking_token_budget: 4096 directly to vLLM. Worked once before enabling --reasoning-config. After enabling, all requests broken regardless of budget params.

  3. LiteLLM proxy — Verified not the root cause; behavior reproduced calling vLLM directly.


Hypothesis

There is a double-parsing / conflicting logic between:

  • The per-model reasoning parser nemotron_v3, which already knows how to split thinking vs answer for Nemotron 3 Super, and
  • The global reasoning configuration applied via --reasoning-config.

When both are active, the model produces tokens, but either the answer segment is never seen as final content, or the reasoning-config logic treats the entire output as reasoning.


Why this matters

For Nemotron-3-Super (and similar "thinking" models):

  • We need bounded thinking for latency and cost, especially with small max_tokens.
  • Right now we must choose between:
    • Working parser but no thinking budget (no --reasoning-config), causing small max_tokens calls to be consumed entirely by thinking; or
    • Enabled reasoning-config that allows budgets, but breaks content completely (content: null).

Requested

  • Confirm whether --reasoning-config is expected to work together with --reasoning-parser nemotron_v3.
  • If yes, identify the correct shape/name of the thinking budget parameter and any configuration needed to avoid double parsing.
  • If this is a bug, fix the interaction so Nemotron v3 reasoning parser can be used with --reasoning-config, and client-side thinking budgets can be enforced without losing content.

Happy to provide full logs and a minimal docker-compose repro if helpful.

extent analysis

TL;DR

The issue can be resolved by correctly configuring the --reasoning-config to work with the nemotron_v3 reasoning parser, potentially by adjusting the reasoning start and end strings or the thinking budget parameter.

Guidance

  • Verify that the --reasoning-config is correctly formatted and compatible with the nemotron_v3 parser, checking the documentation for any specific requirements or restrictions.
  • Test different combinations of --reasoning-config and thinking_token_budget to identify the correct configuration that allows for bounded thinking and non-null content.
  • Check the server logs for any error messages or warnings related to the --reasoning-config or nemotron_v3 parser to gain insight into the double-parsing issue.
  • Consider reaching out to the vLLM community or support team for further guidance on configuring the --reasoning-config with the nemotron_v3 parser.

Example

No code snippet is provided as the issue is related to configuration and compatibility rather than code.

Notes

The root cause of the issue appears to be a conflict between the nemotron_v3 reasoning parser and the global reasoning configuration applied via --reasoning-config. Resolving this conflict will likely require adjusting the configuration to ensure compatibility between the two.

Recommendation

Apply a workaround by adjusting the --reasoning-config to correctly work with the nemotron_v3 parser, allowing for bounded thinking and non-null content. This may involve modifying the reasoning start and end strings or the thinking budget parameter.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • With --reasoning-parser nemotron_v3 and --reasoning-config, the server should:
    • Accept a thinking budget parameter (e.g. thinking_token_budget or equivalent).
    • Route tokens before the Nemotron thinking delimiter into reasoning_content.
    • Route the final answer after the delimiter into content.
    • Enforce a cap on reasoning tokens, especially when max_tokens is small.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix `--reasoning-config` breaks Nemotron v3 reasoning parser (content always null, thinking unbounded) [1 comments, 1 participants]