hermes - 💡(How to fix) Fix [Bug]: gateway api_server streaming bypasses server-side tool-call loop when chat_template_kwargs.enable_thinking=false (model emits tool name as plain text)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Net effect: streaming clients (Pipecat voice loops, Open WebUI, anything using OpenAI's default streaming) cannot benefit from the substantial latency win of disabling Qwen3 hybrid thinking, because the moment a turn needs a tool, the client gets the tool name as spoken/displayed text instead of the tool result.

Code Example

custom_providers:
     - name: llama-local
       base_url: http://127.0.0.1:8080/v1
       extra_body:
         chat_template_kwargs:
           enable_thinking: false

---

platform_request_overrides:
     api_server:
       extra_body:
         chat_template_kwargs:
           enable_thinking: false

---

curl -sN -X POST http://127.0.0.1:8642/v1/chat/completions \
     -H "Authorization: Bearer $KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "hermes-agent",
       "stream": true,
       "messages": [
         {"role": "system", "content": "Use your tools when needed. Keep replies short."},
         {"role": "user", "content": "What time is it?"}
       ]
     }'

---

ROUND-TRIP 1.53s
reply: It's currently Thursday, May 28, 2026 at 10:42 AM.
tool_calls: 0
usage: {prompt_tokens: 17833, completion_tokens: 25, total_tokens: 17858}

---

run 1: "I don't have access to a clock tool in this session, but based on the conversation
       start time, it's Thursday, May 28, 2026 in your Round Rock, TX timezone (CDT, UTC-5),
       so approximately midday if we..."
run 2: "7:42 PM Central Time — Thursday, May 28, 2026. How can I help?"
run 3: "It's 9:47 PM Central Time."
RAW_BUFFERClick to expand / collapse

Bug Description

When chat_template_kwargs.enable_thinking: false is set in the request body sent to the api_server platform (via custom_providers[].extra_body or the new platform_request_overrides from PR #34007 — same effect either way), the streaming code path (stream: true) returns the model's raw plain-text output instead of running the server-side agent tool-call loop. The model — typically a hybrid-thinking template like Qwen3 / GLM-4.6 / Hunyuan — emits the tool name as content (e.g. "GetDateTime") instead of a proper tool_calls array, and that text streams straight through to the client.

The non-streaming code path (stream: false) on the same gateway, same model, same request body, same enable_thinking: false works correctly — it runs the agent loop, executes the tool, loops, and returns the resolved final answer.

Net effect: streaming clients (Pipecat voice loops, Open WebUI, anything using OpenAI's default streaming) cannot benefit from the substantial latency win of disabling Qwen3 hybrid thinking, because the moment a turn needs a tool, the client gets the tool name as spoken/displayed text instead of the tool result.

Steps to Reproduce

  1. Run a local llama.cpp serving a Qwen3-derived model (or any hybrid-thinking model whose chat template honors chat_template_kwargs.enable_thinking). Expose it on http://127.0.0.1:8080/v1.
  2. In ~/.hermes/config.yaml, set either:
    custom_providers:
      - name: llama-local
        base_url: http://127.0.0.1:8080/v1
        extra_body:
          chat_template_kwargs:
            enable_thinking: false
    or, with PR #34007 applied:
    platform_request_overrides:
      api_server:
        extra_body:
          chat_template_kwargs:
            enable_thinking: false
  3. Configure a tool the agent must call to answer the prompt — e.g. wire mcp-homeassistant with GetDateTime exposed (any tool will reproduce).
  4. Start the gateway with the api_server platform enabled.
  5. Send a request requiring tool use, streaming:
    curl -sN -X POST http://127.0.0.1:8642/v1/chat/completions \
      -H "Authorization: Bearer $KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "hermes-agent",
        "stream": true,
        "messages": [
          {"role": "system", "content": "Use your tools when needed. Keep replies short."},
          {"role": "user", "content": "What time is it?"}
        ]
      }'
  6. Compare to the same request with "stream": false.

Expected Behavior

Streaming and non-streaming code paths should both run the server-side agent tool-call loop and return the resolved final answer. With the above repro, both should stream/return text along the lines of "It's 1:02 PM CDT on Thursday, May 28th.".

Actual Behavior

Non-streaming (works correctly):

ROUND-TRIP 1.53s
reply: It's currently Thursday, May 28, 2026 at 10:42 AM.
tool_calls: 0
usage: {prompt_tokens: 17833, completion_tokens: 25, total_tokens: 17858}

Gateway ran the loop, called GetDateTime, returned the resolved final answer.

Streaming (broken — 3 fresh runs, same prompt and config):

run 1: "I don't have access to a clock tool in this session, but based on the conversation
       start time, it's Thursday, May 28, 2026 in your Round Rock, TX timezone (CDT, UTC-5),
       so approximately midday if we..."
run 2: "7:42 PM Central Time — Thursday, May 28, 2026. How can I help?"
run 3: "It's 9:47 PM Central Time."

(Actual wall-clock time at the moment of the runs was ~12:53 PM CDT — the model is either refusing, hallucinating a time, or emitting "GetDateTime" as plain text content. None of the streaming runs invoke the tool.)

Same prompt, same config, switching only stream: true -> stream: false makes the bug disappear. Switching only chat_template_kwargs.enable_thinking: false -> true (default) also makes the bug disappear (slow but correct in both streaming and non-streaming modes — measured ~10.1s round-trip).

The combination of all three (streaming + tool use + thinking off) is the trigger.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp), Agent Core (conversation loop, context compression, memory)

(Specifically: gateway/platforms/api_server.py streaming code path + agent/conversation_loop.py tool-call extraction. The api_server platform is the most likely entry point because that's where streaming OpenAI-compat clients connect.)

Messaging Platform (if gateway-related)

N/A (CLI only)

(More accurately: api_server — not in the dropdown options. The same code path would affect any OpenAI-compatible streaming consumer of the gateway.)

Debug Report

Not attached — local config has personal API keys and memory contents I'd rather not paste publicly. Happy to share specific sanitized files on request. Environment: macOS 15.5 (Darwin 25.5), Python 3.11.15, hermes-agent on main at commit 1e71b7180 (with PR #34007 applied locally for the per-platform-override reproduction; bug also reproduces using only the existing custom_providers[].extra_body mechanism).

Related work

  • PR #34007 (open) — feat(agent): per-platform request_overrides adds the per-platform extra_body plumbing. The thinking-off measurement in that PR (1.5s round-trip) was taken with stream: false. The PR itself is correct; this bug is in the streaming code path that exists independently and would manifest with the existing global custom_providers[].extra_body knob too.
  • PR #12427 (open) — chat_template_kwargs.enable_thinking=false for llama.cpp / vLLM. Companion to this report — if #12427 lands without fixing the streaming agent-loop short-circuit, every user who flips that flag will hit this bug on streaming clients.

Suggested fix direction (not prescriptive)

Either:

  1. Make the streaming api_server code path run the same agent tool-call loop the non-streaming path does, with the chunks streamed only after each tool result is resolved (or for the final answer only) — preserves streaming semantics from the client's perspective.
  2. Make conversation_loop.py's tool-call parser tolerate Qwen3-style plain-text tool-name emission as a fallback when no tool_calls array is present (more invasive — touches the agent core rather than just the streaming adapter).

(1) feels lower-risk because it's a localized api_server change and doesn't change semantics for any model that DOES emit proper tool_calls.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING