vllm - ✅(Solved) Fix [Bug]: `reasoning_effort` is silently ignored by nemotron_v3 reasoning parser, and `reasoning_effort: "none"` produces deceptive "hidden cost" output on Nemotron-H [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39581Fetched 2026-04-12 13:24:38
View on GitHub
Comments
2
Participants
3
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×2cross-referenced ×2referenced ×2labeled ×1

Root Cause

This is arguably worse than bug 1, because it gives the client the false impression that the reasoning knob worked.

Fix Action

Fixed

PR fix notes

PR #39587: fix(reasoning): apply reasoning_effort parameter in nemotron_v3 parser

Description (problem / solution / changelog)

Problem

The nemotron_v3 reasoning parser silently ignores the OpenAI-standard reasoning_effort field in chat completion requests. The Nemotron chat template reads low_effort and enable_thinking kwargs to control generation behavior, but nothing translated reasoning_effort into those native parameters.

Sending reasoning_effort: "low" produces identical output to no flag at all (full chain-of-thought, ~370 tokens). Sending reasoning_effort: "none" hides the reasoning from the response but still generates it (~315 tokens), giving the false impression that reasoning was skipped.

Fix

Add adjust_request to NemotronV3ReasoningParser that maps:

  • reasoning_effort="low"chat_template_kwargs["low_effort"] = True
  • reasoning_effort="none"chat_template_kwargs["enable_thinking"] = False

Existing user-provided chat_template_kwargs are preserved (not overwritten when already set).

Verification

Logic verified with unit tests covering all reasoning_effort values ("low", "none", "medium", "high", None), kwargs preservation, and merge behavior.

Fixes #39581

Changed files

  • vllm/reasoning/nemotron_v3_reasoning_parser.py (modified, +18/-0)

PR #39597: fix(reasoning): translate reasoning_effort in nemotron_v3 parser

Description (problem / solution / changelog)

Problem

\NemotronV3ReasoningParser\ silently ignores the OpenAI-standard
easoning_effort\ field. The Nemotron chat template reads \low_effort\ and \nable_thinking\ kwargs not
easoning_effort\ so nothing was bridging the two.


  • easoning_effort='low'\ full thinking (same as no flag)

  • easoning_effort='none'\ model still generates full CoT, response just hides it (
    easoning: null), giving a false impression that thinking was skipped

Fixes #39581.

Fix

Add \djust_request\ to \NemotronV3ReasoningParser\ that maps:

| \

easoning_effort\Effect
'low'\\chat_template_kwargs['low_effort'] = True\
'none'\\chat_template_kwargs['enable_thinking'] = False\
'medium'\ / 'high'\ / \None\no-op (full thinking is the default)

Key details:

  • Uses \getattr\ for safe access on \ResponsesRequest\ (which lacks \chat_template_kwargs)
  • Uses \lif\ the two branches are mutually exclusive
  • Never overwrites existing user-provided kwargs

Tests

Added 6 unit tests covering:

  • All effort values (\low,
    one, \medium, \high)
  • \None\ effort is a no-op
  • User-provided kwargs are not overwritten (e.g. \low_effort=False\ stays \False)
  • \None\ chat_template_kwargs is initialised correctly

  • easoning_effort='none'\ + preexisting \nable_thinking=True\ thinking stays on

Changed files

  • tests/reasoning/test_nemotron_v3_reasoning_parser.py (modified, +106/-0)
  • vllm/reasoning/nemotron_v3_reasoning_parser.py (modified, +25/-0)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
CUDA used to build PyTorch   : 12.9

==============================
      Python Environment
==============================
Python version               : 3.12.13 (64-bit runtime)
Python platform              : Linux-6.8.0-1047-oracle-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
Nvidia driver version        : 575.57.08

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] torch==2.10.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
  GPU0  GPU1  CPU Affinity    NUMA Affinity
GPU0   X    NV18  0-55,112-167    0
GPU1  NV18   X    0-55,112-167    0

Running inside the official image vllm/vllm-openai:v0.19.0 (production-docker-image).

---

reasoning_effort: Literal["none", "low", "medium", "high"] | None = None

---

chat_template_kwargs=merge_kwargs(
    self.chat_template_kwargs,
    dict(
        add_generation_prompt=self.add_generation_prompt,
        continue_final_message=self.continue_final_message,
        documents=self.documents,
        reasoning_effort=self.reasoning_effort,   # <-- forwarded
    ),
),

---

@model_validator(mode="before")
@classmethod
def set_include_reasoning_for_none_effort(cls, data: Any) -> Any:
    if data.get("reasoning_effort") == "none":
        data["include_reasoning"] = False
    return data

---

docker run -d --name nemotron-repro \
  --gpus '"device=0,1"' --ipc=host --shm-size=16g \
  -p 8095:8095 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.19.0 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --host 0.0.0.0 --port 8095 \
  --tensor-parallel-size 2 --max-model-len 65536 \
  --kv-cache-dtype fp8 --mamba-ssm-cache-dtype float32 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.9 --async-scheduling \
  --reasoning-parser nemotron_v3 \
  --trust-remote-code

---

import json, urllib.request

def ask(payload, label):
    req = urllib.request.Request(
        "http://localhost:8095/v1/chat/completions",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req) as resp:
        obj = json.load(resp)
    m = obj['choices'][0]['message']
    r = m.get('reasoning') or ''
    c = m.get('content') or ''
    u = obj['usage']
    print(f"[{label}]")
    print(f"  reasoning chars: {len(r):>5}  content chars: {len(c):>4}  "
          f"completion_tokens: {u['completion_tokens']:>4}")

q = "What is the capital city of France?"
base = {"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
        "messages": [{"role": "user", "content": q}],
        "max_tokens": 1024, "temperature": 0}

ask(base,                                                       "1. no flag")
ask({**base, "reasoning_effort": "low"},                        "2. reasoning_effort=low")
ask({**base, "reasoning_effort": "none"},                       "3. reasoning_effort=none")
ask({**base, "chat_template_kwargs": {"low_effort": True}},     "4. kwargs.low_effort=true")
ask({**base, "chat_template_kwargs": {"enable_thinking": False}}, "5. kwargs.enable_thinking=false")
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
CUDA used to build PyTorch   : 12.9

==============================
      Python Environment
==============================
Python version               : 3.12.13 (64-bit runtime)
Python platform              : Linux-6.8.0-1047-oracle-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
Nvidia driver version        : 575.57.08

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] torch==2.10.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
  GPU0  GPU1  CPU Affinity    NUMA Affinity
GPU0   X    NV18  0-55,112-167    0
GPU1  NV18   X    0-55,112-167    0

Running inside the official image vllm/vllm-openai:v0.19.0 (production-docker-image).
</details>

🐛 Describe the bug

vLLM v0.19.0, serving nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 with --reasoning-parser nemotron_v3, does not honor the OpenAI-standard reasoning_effort field in chat completion requests. There are actually two related bugs:

Bug 1 — reasoning_effort: "low" is silently ignored

ChatCompletionRequest declares reasoning_effort as a first-class field at vllm/entrypoints/openai/chat_completion/protocol.py:182:

reasoning_effort: Literal["none", "low", "medium", "high"] | None = None

…and auto-injects it into chat_template_kwargs at protocol.py:374:

chat_template_kwargs=merge_kwargs(
    self.chat_template_kwargs,
    dict(
        add_generation_prompt=self.add_generation_prompt,
        continue_final_message=self.continue_final_message,
        documents=self.documents,
        reasoning_effort=self.reasoning_effort,   # <-- forwarded
    ),
),

However:

  1. The Nemotron-3-Super chat template (chat_template.jinja in the HF repo) never reads the reasoning_effort Jinja variable. It only reads its own low_effort and enable_thinking kwargs.
  2. vllm/reasoning/nemotron_v3_reasoning_parser.py does not translate request.reasoning_effort into chat_template_kwargs["low_effort"] either. The full parser is 33 lines and only overrides extract_reasoning for output post-processing — it never touches request preprocessing.

Net effect: a client that sends reasoning_effort: "low" (the OpenAI-standard way to ask for brief reasoning) gets full thinking, identical to sending no flag at all, with no warning or indication.

Bug 2 — reasoning_effort: "none" is worse: it creates deceptive output with full hidden cost

chat_completion/protocol.py:786-790 handles reasoning_effort == "none" this way:

@model_validator(mode="before")
@classmethod
def set_include_reasoning_for_none_effort(cls, data: Any) -> Any:
    if data.get("reasoning_effort") == "none":
        data["include_reasoning"] = False
    return data

This only flips include_reasoning — a response-formatting switch that hides the reasoning field from the response body. The model still generates the full chain of thought. On Nemotron-3-Super this means:

  • The client sees "reasoning": null in the response → looks like the model skipped thinking → looks fast and cheap
  • In reality, completion_tokens is ~360 (matches full-thinking mode) and wall time is ~2.7 s (matches full-thinking mode)
  • The cost is hidden, not eliminated

This is arguably worse than bug 1, because it gives the client the false impression that the reasoning knob worked.

Minimal reproducer

Serve Nemotron-3-Super with the standard config:

docker run -d --name nemotron-repro \
  --gpus '"device=0,1"' --ipc=host --shm-size=16g \
  -p 8095:8095 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.19.0 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --host 0.0.0.0 --port 8095 \
  --tensor-parallel-size 2 --max-model-len 65536 \
  --kv-cache-dtype fp8 --mamba-ssm-cache-dtype float32 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.9 --async-scheduling \
  --reasoning-parser nemotron_v3 \
  --trust-remote-code

Then send five requests that differ only in how reasoning effort is controlled:

import json, urllib.request

def ask(payload, label):
    req = urllib.request.Request(
        "http://localhost:8095/v1/chat/completions",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req) as resp:
        obj = json.load(resp)
    m = obj['choices'][0]['message']
    r = m.get('reasoning') or ''
    c = m.get('content') or ''
    u = obj['usage']
    print(f"[{label}]")
    print(f"  reasoning chars: {len(r):>5}  content chars: {len(c):>4}  "
          f"completion_tokens: {u['completion_tokens']:>4}")

q = "What is the capital city of France?"
base = {"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
        "messages": [{"role": "user", "content": q}],
        "max_tokens": 1024, "temperature": 0}

ask(base,                                                       "1. no flag")
ask({**base, "reasoning_effort": "low"},                        "2. reasoning_effort=low")
ask({**base, "reasoning_effort": "none"},                       "3. reasoning_effort=none")
ask({**base, "chat_template_kwargs": {"low_effort": True}},     "4. kwargs.low_effort=true")
ask({**base, "chat_template_kwargs": {"enable_thinking": False}}, "5. kwargs.enable_thinking=false")

Observed output

Each mode run 3 times at temperature=0; values below are medians.

Modereasoning_charscontent_charscompletion_tokenswall_ms
1. no flag11463683723084
2. reasoning_effort="low"9853503102576
3. reasoning_effort="none"03733152620
4. kwargs.low_effort=true453722217
5. kwargs.enable_thinking=false0369108

What to notice

  • Row 2 vs row 1: completion_tokens 310 vs 372 — within run-to-run noise. Reasoning text is still a full chain of thought (~1000 chars). Wall time is unchanged (~2.6 s vs ~3.1 s). reasoning_effort: "low" has no meaningful effect on Nemotron-H.
  • Row 3 vs row 1: reasoning: null in the response, but completion_tokens 315 (≈ row 1's 372) and wall time 2620 ms (≈ row 1's 3084 ms). The model generated a full CoT and vLLM hid it from the response body. The client sees reasoning: null and thinks the model skipped thinking — but paid full thinking cost. Compare to row 5 where reasoning: null correctly correlates with 9 tokens / 108 ms.
  • Rows 4 and 5: The only modes that actually shorten generation. Row 4 uses 22 tokens (~17× fewer). Row 5 uses 9 tokens (~41× fewer, bit-identical across runs). These work because chat_template.jinja reads low_effort (line 180-181) and enable_thinking (line 206-208). Nothing in the template reads reasoning_effort.

Expected behavior

  1. reasoning_effort: "low" should produce output similar to row 4 (brief reasoning, ~20× fewer tokens than default).
  2. reasoning_effort: "none" should produce output similar to row 5 (no reasoning generated, ~40× fewer tokens).
  3. If the parser cannot honor the field, vLLM should either log a warning or raise 400 Bad Request, rather than silently degrading to full thinking.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To fix the issue where reasoning_effort is silently ignored or deceptively handled, modify the nemotron_v3_reasoning_parser.py to properly translate request.reasoning_effort into chat_template_kwargs["low_effort"] or chat_template_kwargs["enable_thinking"].

Guidance

  • Identify the nemotron_v3_reasoning_parser.py file and locate the section where request.reasoning_effort is handled.
  • Modify the parser to translate request.reasoning_effort into chat_template_kwargs["low_effort"] or chat_template_kwargs["enable_thinking"] based on the reasoning_effort value.
  • Test the modified parser with different reasoning_effort values to ensure it produces the expected output.
  • Consider adding logging or error handling to notify the client if the reasoning_effort field cannot be honored.

Example

# nemotron_v3_reasoning_parser.py
if request.reasoning_effort == "low":
    chat_template_kwargs["low_effort"] = True
elif request.reasoning_effort == "none":
    chat_template_kwargs["enable_thinking"] = False

Notes

The provided solution assumes that the nemotron_v3_reasoning_parser.py file is the correct location to modify the reasoning_effort handling. Additionally, the example code snippet is a simplified representation of the necessary changes and may require further modifications to work correctly.

Recommendation

Apply the workaround by modifying the nemotron_v3_reasoning_parser.py file to properly handle the reasoning_effort field, as this will allow for correct functionality until a fixed version is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. reasoning_effort: "low" should produce output similar to row 4 (brief reasoning, ~20× fewer tokens than default).
  2. reasoning_effort: "none" should produce output similar to row 5 (no reasoning generated, ~40× fewer tokens).
  3. If the parser cannot honor the field, vLLM should either log a warning or raise 400 Bad Request, rather than silently degrading to full thinking.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING