openclaw - 💡(How to fix) Fix [Bug]: openclaw openai-compat /v1/chat/completions strips chat_template_kwargs entirely on vLLM/Nemotron — causes reasoning-only death-spiral

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The POST /v1/chat/completions openai-compat HTTP endpoint strips chat_template_kwargs entirely from outbound requests to vLLM-served Nemotron-3-Super, despite full config wiring (agents.list[].thinkingDefault: off, models.providers.vllm.models[].params.qwenThinkingFormat: chat-template, model reasoning: true). The Nemotron-detection wrapper at dist/stream-*.js::setNemotronThinkingOffChatTemplateKwargs that injects {enable_thinking: false, force_nonempty_content: true} fires on the embedded/CLI agent path but is bypassed on the openai-compat HTTP path. Result: Nemotron defaults to thinking-on without force_nonempty_content, returns empty visible content, triggers the reasoning-only assistant turn detected retry loop until exhausted, then surfaces FailoverError: incomplete_result as HTTP 400. Reproduces identically on 2026.5.20, 2026.5.22 (latest), and 2026.5.24-beta.2 (beta).

Related closed: #82768 (enable_thinking defaulting to true — wrong value), #71891 (vLLM/Nemotron stored as thinking-only). This report shows the more severe variant where the kwargs object is entirely absent on the openai-compat path, not merely populated with the wrong value.

Error Message

#!/usr/bin/env python3 import http.server, json, sys, urllib.request

class H(http.server.BaseHTTPRequestHandler): def do_POST(self): n = int(self.headers.get("Content-Length", 0)) body = self.rfile.read(n) try: p = json.loads(body) print(f"[POST {self.path}] chat_template_kwargs={json.dumps(p.get('chat_template_kwargs'))}", file=sys.stderr, flush=True) print(f" body keys: {sorted(p.keys())}", file=sys.stderr, flush=True) except Exception as e: print(f"[err] {e}", file=sys.stderr, flush=True) req = urllib.request.Request(f"http://127.0.0.1:8000{self.path}", data=body, method="POST", headers={k: v for k, v in self.headers.items() if k.lower() != "host"}) with urllib.request.urlopen(req, timeout=180) as r: self.send_response(r.status) for k, v in r.headers.items(): if k.lower() not in ("transfer-encoding", "content-encoding"): self.send_header(k, v) self.end_headers(); self.wfile.write(r.read()) def log_message(self, *a): pass

if name == "main": http.server.ThreadingHTTPServer(("127.0.0.1", 8003), H).serve_forever()

Root Cause

Any caller using the openai-compat HTTP endpoint (POST /v1/chat/completions) with a vLLM-served Nemotron-3 model gets a 50+ second timeout followed by HTTP 400 incomplete_result on every request. This includes any external client speaking standard OpenAI API to the gateway. CLI users (openclaw agent) are unaffected because the wrapper fires on that path.

Fix Action

Fix / Workaround

Workaround for affected users: use the CLI/embedded path with explicit --thinking off instead of the openai-compat HTTP endpoint, until this is fixed.

Source-level hypothesis (not verified by reading minified dist source line-by-line, only by behavior + MITM capture): dist/openai-http-*.js (the openai-compat handler) dispatches via dist/agent-command-*.js, but the request-body construction strips chat_template_kwargs before forwarding to vLLM. The Nemotron wrapper at dist/stream-*.js::wrapVllmProviderStream + setNemotronThinkingOffChatTemplateKwargs only fires on the embedded agent runner path, not on the openai-compat REST handler.

Code Example

#!/usr/bin/env python3
import http.server, json, sys, urllib.request

class H(http.server.BaseHTTPRequestHandler):
    def do_POST(self):
        n = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(n)
        try:
            p = json.loads(body)
            print(f"[POST {self.path}] chat_template_kwargs={json.dumps(p.get('chat_template_kwargs'))}", file=sys.stderr, flush=True)
            print(f"  body keys: {sorted(p.keys())}", file=sys.stderr, flush=True)
        except Exception as e:
            print(f"[err] {e}", file=sys.stderr, flush=True)
        req = urllib.request.Request(f"http://127.0.0.1:8000{self.path}", data=body, method="POST",
            headers={k: v for k, v in self.headers.items() if k.lower() != "host"})
        with urllib.request.urlopen(req, timeout=180) as r:
            self.send_response(r.status)
            for k, v in r.headers.items():
                if k.lower() not in ("transfer-encoding", "content-encoding"):
                    self.send_header(k, v)
            self.end_headers(); self.wfile.write(r.read())
    def log_message(self, *a): pass

if __name__ == "__main__":
    http.server.ThreadingHTTPServer(("127.0.0.1", 8003), H).serve_forever()

---

sed -i 's|http://127.0.0.1:8000/v1|http://127.0.0.1:8003/v1|' ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

---

TOKEN=$(grep ^OPENCLAW_GATEWAY_TOKEN= ~/.config/openclaw-tokens.env | cut -d= -f2-)
curl -X POST http://localhost:18789/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model":"openclaw/main","messages":[{"role":"user","content":"hi"}],"stream":false,"max_tokens":10}'

---

HTTP 400
{"error":{"message":"vllm/nemotron-3-super:120b ended with an incomplete terminal response","type":"invalid_request_error","code":"incomplete_result"}}

---

curl -s http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"nemotron-3-super:120b","messages":[{"role":"user","content":"hi"}],"max_tokens":10,"chat_template_kwargs":{"enable_thinking":false,"force_nonempty_content":true}}'
# {"choices":[{"message":{"content":"ok","reasoning":null}, ...}], "usage":{"completion_tokens":2, ...}}

---

MITM capture (3 retries, all identical, on each of 3 versions tested):

  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']
  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']
  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']

Gateway journal at startup confirms config is parsed correctly:
  [gateway] agent model: vllm/nemotron-3-super:120b (thinking=off, fast=off)

Gateway journal during the failed request:
  [agent/embedded] reasoning-only assistant turn detected: ... — retrying 1/2 with visible-answer continuation
  [agent/embedded] reasoning-only assistant turn detected: ... — retrying 2/2 with visible-answer continuation
  [agent/embedded] reasoning-only retries exhausted: ... attempts=2/2 — surfacing incomplete-turn error
  [model-fallback/decision] decision=candidate_failed requested=vllm/nemotron-3-super:120b reason=format
  [openai-compat] chat completion failed: FailoverError: vllm/nemotron-3-super:120b ended with an incomplete terminal response

Same prompt via `openclaw agent --agent main --thinking off` (CLI/embedded path) returns clean response: "ready" in 12s, no death-spiral. So the wrapper IS reachable on that path.
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The POST /v1/chat/completions openai-compat HTTP endpoint strips chat_template_kwargs entirely from outbound requests to vLLM-served Nemotron-3-Super, despite full config wiring (agents.list[].thinkingDefault: off, models.providers.vllm.models[].params.qwenThinkingFormat: chat-template, model reasoning: true). The Nemotron-detection wrapper at dist/stream-*.js::setNemotronThinkingOffChatTemplateKwargs that injects {enable_thinking: false, force_nonempty_content: true} fires on the embedded/CLI agent path but is bypassed on the openai-compat HTTP path. Result: Nemotron defaults to thinking-on without force_nonempty_content, returns empty visible content, triggers the reasoning-only assistant turn detected retry loop until exhausted, then surfaces FailoverError: incomplete_result as HTTP 400. Reproduces identically on 2026.5.20, 2026.5.22 (latest), and 2026.5.24-beta.2 (beta).

Related closed: #82768 (enable_thinking defaulting to true — wrong value), #71891 (vLLM/Nemotron stored as thinking-only). This report shows the more severe variant where the kwargs object is entirely absent on the openai-compat path, not merely populated with the wrong value.

Steps to reproduce

Tested on Ubuntu 24.04 (aarch64, NVIDIA DGX Spark / GB10), OpenClaw 2026.5.22 from npm.

  1. Run vLLM with Nemotron-3-Super-120B on 127.0.0.1:8000 (NVIDIA's official Spark recipe): https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

  2. Configure OpenClaw vllm provider in ~/.openclaw/openclaw.json:

    • models.providers.vllm.baseUrl: http://127.0.0.1:8000/v1
    • models.providers.vllm.models[0].id: nemotron-3-super:120b
    • models.providers.vllm.models[0].reasoning: true
    • models.providers.vllm.models[0].params.qwenThinkingFormat: chat-template
    • agents.list[main].model: vllm/nemotron-3-super:120b
    • agents.list[main].thinkingDefault: off
    • gateway.http.endpoints.chatCompletions.enabled: true
  3. Start a tiny MITM proxy on :8003 that forwards to :8000 and logs request bodies (Python stdlib only):

#!/usr/bin/env python3
import http.server, json, sys, urllib.request

class H(http.server.BaseHTTPRequestHandler):
    def do_POST(self):
        n = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(n)
        try:
            p = json.loads(body)
            print(f"[POST {self.path}] chat_template_kwargs={json.dumps(p.get('chat_template_kwargs'))}", file=sys.stderr, flush=True)
            print(f"  body keys: {sorted(p.keys())}", file=sys.stderr, flush=True)
        except Exception as e:
            print(f"[err] {e}", file=sys.stderr, flush=True)
        req = urllib.request.Request(f"http://127.0.0.1:8000{self.path}", data=body, method="POST",
            headers={k: v for k, v in self.headers.items() if k.lower() != "host"})
        with urllib.request.urlopen(req, timeout=180) as r:
            self.send_response(r.status)
            for k, v in r.headers.items():
                if k.lower() not in ("transfer-encoding", "content-encoding"):
                    self.send_header(k, v)
            self.end_headers(); self.wfile.write(r.read())
    def log_message(self, *a): pass

if __name__ == "__main__":
    http.server.ThreadingHTTPServer(("127.0.0.1", 8003), H).serve_forever()
  1. Redirect OpenClaw to the proxy and restart:
sed -i 's|http://127.0.0.1:8000/v1|http://127.0.0.1:8003/v1|' ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway
  1. POST a chat completion to the openai-compat endpoint:
TOKEN=$(grep ^OPENCLAW_GATEWAY_TOKEN= ~/.config/openclaw-tokens.env | cut -d= -f2-)
curl -X POST http://localhost:18789/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model":"openclaw/main","messages":[{"role":"user","content":"hi"}],"stream":false,"max_tokens":10}'
  1. Read the MITM log — observe chat_template_kwargs=null on all forwarded requests.

Expected behavior

The openai-compat HTTP path should forward chat_template_kwargs to vLLM, matching the embedded/CLI path's behavior. Specifically, when the active agent's resolved thinkingLevel === "off" and the model matches the Nemotron-3 family detector, setNemotronThinkingOffChatTemplateKwargs should run and inject {enable_thinking: false, force_nonempty_content: true} into the outbound request body.

The CLI/embedded path (openclaw agent --agent main --thinking off) already does this correctly: MITM proxy capture confirms it sends chat_template_kwargs={enable_thinking: false, ...} and Nemotron returns a clean response.

Actual behavior

On all 3 versions tested (2026.5.20, 2026.5.22, 2026.5.24-beta.2), MITM proxy captures show OpenClaw's HTTP path sends chat_template_kwargs=null on every retry, with body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools'] — no chat_template_kwargs, no extra_body, nothing.

Nemotron defaults to thinking-on, burns the token budget on reasoning, returns empty content. OpenClaw classifies as reasoning-only assistant turn detected, retries 2 more times with the same stripped request, exhausts retries, returns:

HTTP 400
{"error":{"message":"vllm/nemotron-3-super:120b ended with an incomplete terminal response","type":"invalid_request_error","code":"incomplete_result"}}

Total request time ~50 seconds (3× retries timing out).

Control proves vLLM/Nemotron honor the kwargs when they arrive — same prompt direct to vLLM with the kwargs returns clean content in <1 second:

curl -s http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"nemotron-3-super:120b","messages":[{"role":"user","content":"hi"}],"max_tokens":10,"chat_template_kwargs":{"enable_thinking":false,"force_nonempty_content":true}}'
# {"choices":[{"message":{"content":"ok","reasoning":null}, ...}], "usage":{"completion_tokens":2, ...}}

OpenClaw version

2026.5.22 (also reproduced on 2026.5.20 and 2026.5.24-beta.2)

Operating system

Ubuntu 24.04.4 LTS, aarch64 (NVIDIA DGX Spark / GB10)

Install method

npm install -g openclaw@<version> into nvm node v22.22.2 prefix

Model

nemotron-3-super:120b (served-model-name from vllm-serve); checkpoint = nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Provider / routing chain

gateway.http openai-compat → agents.list[main] → models.providers.vllm → http://127.0.0.1:8000/v1 (vLLM container nvcr.io/nvidia/vllm:26.04-py3, served on Spark loopback)

Additional provider/model setup details

vllm provider config:

models.providers.vllm: baseUrl: http://127.0.0.1:8000/v1 apiKey: { source: env, id: OPENCLAW_VLLM_API_KEY } api: openai-completions timeoutSeconds: 600 models: - id: nemotron-3-super:120b reasoning: true contextWindow: 65536 maxTokens: 8192 api: openai-completions params: qwenThinkingFormat: chat-template

agents.list[main]: model: vllm/nemotron-3-super:120b thinkingDefault: off

gateway.http.endpoints.chatCompletions.enabled: true

vLLM serve flags: --quantization fp4 --kv-cache-dtype fp8 --reasoning-parser super_v3 --tool-call-parser qwen3_coder --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' --max-model-len 131072

Logs, screenshots, and evidence

MITM capture (3 retries, all identical, on each of 3 versions tested):

  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']
  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']
  [POST /v1/chat/completions] chat_template_kwargs=null
    body keys: ['max_completion_tokens', 'messages', 'model', 'stream', 'stream_options', 'tool_choice', 'tools']

Gateway journal at startup confirms config is parsed correctly:
  [gateway] agent model: vllm/nemotron-3-super:120b (thinking=off, fast=off)

Gateway journal during the failed request:
  [agent/embedded] reasoning-only assistant turn detected: ... — retrying 1/2 with visible-answer continuation
  [agent/embedded] reasoning-only assistant turn detected: ... — retrying 2/2 with visible-answer continuation
  [agent/embedded] reasoning-only retries exhausted: ... attempts=2/2 — surfacing incomplete-turn error
  [model-fallback/decision] decision=candidate_failed requested=vllm/nemotron-3-super:120b reason=format
  [openai-compat] chat completion failed: FailoverError: vllm/nemotron-3-super:120b ended with an incomplete terminal response

Same prompt via `openclaw agent --agent main --thinking off` (CLI/embedded path) returns clean response: "ready" in 12s, no death-spiral. So the wrapper IS reachable on that path.

Impact and severity

Any caller using the openai-compat HTTP endpoint (POST /v1/chat/completions) with a vLLM-served Nemotron-3 model gets a 50+ second timeout followed by HTTP 400 incomplete_result on every request. This includes any external client speaking standard OpenAI API to the gateway. CLI users (openclaw agent) are unaffected because the wrapper fires on that path.

Workaround for affected users: use the CLI/embedded path with explicit --thinking off instead of the openai-compat HTTP endpoint, until this is fixed.

Additional information

Source-level hypothesis (not verified by reading minified dist source line-by-line, only by behavior + MITM capture): dist/openai-http-*.js (the openai-compat handler) dispatches via dist/agent-command-*.js, but the request-body construction strips chat_template_kwargs before forwarding to vLLM. The Nemotron wrapper at dist/stream-*.js::wrapVllmProviderStream + setNemotronThinkingOffChatTemplateKwargs only fires on the embedded agent runner path, not on the openai-compat REST handler.

Likely fix sites:

  • Apply wrapVllmProviderStream to the openai-compat handler's outbound transport before forwarding
  • OR ensure the request-body construction in the openai-compat handler preserves chat_template_kwargs from the model-provider config

Related closed issues with similar but less severe symptoms:

  • #82768 — chat_template_kwargs.enable_thinking defaulting to true regardless of session thinkingLevel (wrong value, not missing entirely)
  • #71891 — vLLM/Nemotron response stored as thinking-only; assistantTexts empty (user-visible symptom, different root cause)
  • #71847 — Discord variant of #71891
  • #54983 — original vLLM thinking toggle feature add

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The openai-compat HTTP path should forward chat_template_kwargs to vLLM, matching the embedded/CLI path's behavior. Specifically, when the active agent's resolved thinkingLevel === "off" and the model matches the Nemotron-3 family detector, setNemotronThinkingOffChatTemplateKwargs should run and inject {enable_thinking: false, force_nonempty_content: true} into the outbound request body.

The CLI/embedded path (openclaw agent --agent main --thinking off) already does this correctly: MITM proxy capture confirms it sends chat_template_kwargs={enable_thinking: false, ...} and Nemotron returns a clean response.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING