- With `--reasoning-parser nemotron_v3` and `--reasoning-config`, the server should: - Accept a thinking budget parameter (e.g. `thinking_token_budget` or equivalent). - Route tokens before the Nemotron thinking delimiter into `reasoning_content`. - Route the final answer after the delimiter into `content`. - Enforce a cap on reasoning tokens, especially when `max_tokens` is small. ---

vllm - 💡(How to fix) Fix `--reasoning-config` breaks Nemotron v3 reasoning parser (content always null, thinking unbounded) [1 comments, 1 participants]

vllm2026-04-06 19:06:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39103•Fetched 2026-04-08 03:01:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

redhelix

Participants

redhelix

Timeline (top)

commented ×1

Root Cause

For Nemotron-3-Super (and similar "thinking" models):

We need bounded thinking for latency and cost, especially with small max_tokens.
Right now we must choose between:
- Working parser but no thinking budget (no --reasoning-config), causing small max_tokens calls to be consumed entirely by thinking; or
- Enabled reasoning-config that allows budgets, but breaks content completely (content: null).

Code Example

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --reasoning-parser nemotron_v3 \
  # variant A (works, but no thinking cap)
  # [no --reasoning-config]

  # variant B (broken; used for repro)
  --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'

RAW_BUFFERClick to expand / collapse

Summary

When running Nemotron-3-Super with --reasoning-parser nemotron_v3, adding --reasoning-config causes all responses to have content: null, while all generated tokens go into the reasoning trace. Without --reasoning-config, the parser works, but there's no way to cap thinking tokens, so short max_tokens calls can be completely consumed by thinking.

This looks like an interaction/compatibility bug between --reasoning-config and the nemotron_v3 reasoning parser.

Environment

vLLM version: 0.19.1rc1.dev35+g968ed02ac (cu130-nightly)
Backend: NVIDIA GB10 (DGX Spark, SM121)
Model: NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Launch flags (repro case):

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --reasoning-parser nemotron_v3 \
  # variant A (works, but no thinking cap)
  # [no --reasoning-config]

  # variant B (broken; used for repro)
  --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'

Client: OpenAI-compatible /v1/chat/completions (via LiteLLM proxy, but behavior confirmed directly against vLLM as well).

Expected behavior

With --reasoning-parser nemotron_v3 and --reasoning-config, the server should:
- Accept a thinking budget parameter (e.g. thinking_token_budget or equivalent).
- Route tokens before the Nemotron thinking delimiter into reasoning_content.
- Route the final answer after the delimiter into content.
- Enforce a cap on reasoning tokens, especially when max_tokens is small.

Actual behavior

Without --reasoning-config:
- Requests with large max_tokens (e.g. 32768): responses are correct — message.content has the answer, message.reasoning_content has the thinking trace.
- Requests with small max_tokens (e.g. 512): model spends all tokens on thinking, message.content is null.
- There is no way to cap reasoning tokens from the client; thinking_token_budget is rejected with HTTP 400: "thinking_token_budget is set but reasoning_config is not configured."
With --reasoning-config enabled:
- All responses return:
  - message.content = null
  - message.reasoning_content = non-empty string
  - finish_reason = "stop"
  - usage.completion_tokens > 0
- This happens even when not passing any thinking/budget parameters from the client.
- Removing --reasoning-config and restarting restores normal behavior immediately.

What I've tried

Server-side reasoning config — --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}'. Result: content always null.
Client-side thinking budget — thinking_token_budget: 4096 directly to vLLM. Worked once before enabling --reasoning-config. After enabling, all requests broken regardless of budget params.
LiteLLM proxy — Verified not the root cause; behavior reproduced calling vLLM directly.

Hypothesis

There is a double-parsing / conflicting logic between:

The per-model reasoning parser nemotron_v3, which already knows how to split thinking vs answer for Nemotron 3 Super, and
The global reasoning configuration applied via --reasoning-config.

When both are active, the model produces tokens, but either the answer segment is never seen as final content, or the reasoning-config logic treats the entire output as reasoning.

Why this matters

For Nemotron-3-Super (and similar "thinking" models):

We need bounded thinking for latency and cost, especially with small max_tokens.
Right now we must choose between:
- Working parser but no thinking budget (no --reasoning-config), causing small max_tokens calls to be consumed entirely by thinking; or
- Enabled reasoning-config that allows budgets, but breaks content completely (content: null).

Requested

Confirm whether --reasoning-config is expected to work together with --reasoning-parser nemotron_v3.
If yes, identify the correct shape/name of the thinking budget parameter and any configuration needed to avoid double parsing.
If this is a bug, fix the interaction so Nemotron v3 reasoning parser can be used with --reasoning-config, and client-side thinking budgets can be enforced without losing content.

Happy to provide full logs and a minimal docker-compose repro if helpful.

extent analysis

TL;DR

The issue can be resolved by correctly configuring the --reasoning-config to work with the nemotron_v3 reasoning parser, potentially by adjusting the reasoning start and end strings or the thinking budget parameter.

Guidance

Verify that the --reasoning-config is correctly formatted and compatible with the nemotron_v3 parser, checking the documentation for any specific requirements or restrictions.
Test different combinations of --reasoning-config and thinking_token_budget to identify the correct configuration that allows for bounded thinking and non-null content.
Check the server logs for any error messages or warnings related to the --reasoning-config or nemotron_v3 parser to gain insight into the double-parsing issue.
Consider reaching out to the vLLM community or support team for further guidance on configuring the --reasoning-config with the nemotron_v3 parser.

Example

No code snippet is provided as the issue is related to configuration and compatibility rather than code.

Notes

The root cause of the issue appears to be a conflict between the nemotron_v3 reasoning parser and the global reasoning configuration applied via --reasoning-config. Resolving this conflict will likely require adjusting the configuration to ensure compatibility between the two.

Recommendation

Apply a workaround by adjusting the --reasoning-config to correctly work with the nemotron_v3 parser, allowing for bounded thinking and non-null content. This may involve modifying the reasoning start and end strings or the thinking budget parameter.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

With --reasoning-parser nemotron_v3 and --reasoning-config, the server should:
- Accept a thinking budget parameter (e.g. thinking_token_budget or equivalent).
- Route tokens before the Nemotron thinking delimiter into reasoning_content.
- Route the final answer after the delimiter into content.
- Enforce a cap on reasoning tokens, especially when max_tokens is small.

#api #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix `--reasoning-config` breaks Nemotron v3 reasoning parser (content always null, thinking unbounded) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Environment

Expected behavior

Actual behavior

What I've tried

Hypothesis

Why this matters

Requested

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix `--reasoning-config` breaks Nemotron v3 reasoning parser (content always null, thinking unbounded) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Environment

Expected behavior

Actual behavior

What I've tried

Hypothesis

Why this matters

Requested

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING