hermes - 💡(How to fix) Fix [Bug]: gateway api_server streaming bypasses server-side tool-call loop when chat_template_kwargs.enable_thinking=false (model emits tool name as plain text)

Root Cause

Net effect: streaming clients (Pipecat voice loops, Open WebUI, anything using OpenAI's default streaming) cannot benefit from the substantial latency win of disabling Qwen3 hybrid thinking, because the moment a turn needs a tool, the client gets the tool name as spoken/displayed text instead of the tool result.

Code Example

custom_providers:
     - name: llama-local
       base_url: http://127.0.0.1:8080/v1
       extra_body:
         chat_template_kwargs:
           enable_thinking: false

---

platform_request_overrides:
     api_server:
       extra_body:
         chat_template_kwargs:
           enable_thinking: false

---

curl -sN -X POST http://127.0.0.1:8642/v1/chat/completions \
     -H "Authorization: Bearer $KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "hermes-agent",
       "stream": true,
       "messages": [
         {"role": "system", "content": "Use your tools when needed. Keep replies short."},
         {"role": "user", "content": "What time is it?"}
       ]
     }'

---

ROUND-TRIP 1.53s
reply: It's currently Thursday, May 28, 2026 at 10:42 AM.
tool_calls: 0
usage: {prompt_tokens: 17833, completion_tokens: 25, total_tokens: 17858}

---

run 1: "I don't have access to a clock tool in this session, but based on the conversation
       start time, it's Thursday, May 28, 2026 in your Round Rock, TX timezone (CDT, UTC-5),
       so approximately midday if we..."
run 2: "7:42 PM Central Time — Thursday, May 28, 2026. How can I help?"
run 3: "It's 9:47 PM Central Time."

Bug Description

When chat_template_kwargs.enable_thinking: false is set in the request body sent to the api_server platform (via custom_providers[].extra_body or the new platform_request_overrides from PR #34007 — same effect either way), the streaming code path (stream: true) returns the model's raw plain-text output instead of running the server-side agent tool-call loop. The model — typically a hybrid-thinking template like Qwen3 / GLM-4.6 / Hunyuan — emits the tool name as content (e.g. "GetDateTime") instead of a proper tool_calls array, and that text streams straight through to the client.

The non-streaming code path (stream: false) on the same gateway, same model, same request body, same enable_thinking: false works correctly — it runs the agent loop, executes the tool, loops, and returns the resolved final answer.

Steps to Reproduce

Run a local llama.cpp serving a Qwen3-derived model (or any hybrid-thinking model whose chat template honors chat_template_kwargs.enable_thinking). Expose it on http://127.0.0.1:8080/v1.

In ~/.hermes/config.yaml, set either:

custom_providers:
  - name: llama-local
    base_url: http://127.0.0.1:8080/v1
    extra_body:
      chat_template_kwargs:
        enable_thinking: false

or, with PR #34007 applied:

platform_request_overrides:
  api_server:
    extra_body:
      chat_template_kwargs:
        enable_thinking: false

Configure a tool the agent must call to answer the prompt — e.g. wire mcp-homeassistant with GetDateTime exposed (any tool will reproduce).
Start the gateway with the api_server platform enabled.

Send a request requiring tool use, streaming:

curl -sN -X POST http://127.0.0.1:8642/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hermes-agent",
    "stream": true,
    "messages": [
      {"role": "system", "content": "Use your tools when needed. Keep replies short."},
      {"role": "user", "content": "What time is it?"}
    ]
  }'

Compare to the same request with "stream": false.

Expected Behavior

Streaming and non-streaming code paths should both run the server-side agent tool-call loop and return the resolved final answer. With the above repro, both should stream/return text along the lines of "It's 1:02 PM CDT on Thursday, May 28th.".

Actual Behavior

Non-streaming (works correctly):

ROUND-TRIP 1.53s
reply: It's currently Thursday, May 28, 2026 at 10:42 AM.
tool_calls: 0
usage: {prompt_tokens: 17833, completion_tokens: 25, total_tokens: 17858}

Gateway ran the loop, called GetDateTime, returned the resolved final answer.

Streaming (broken — 3 fresh runs, same prompt and config):

run 1: "I don't have access to a clock tool in this session, but based on the conversation
       start time, it's Thursday, May 28, 2026 in your Round Rock, TX timezone (CDT, UTC-5),
       so approximately midday if we..."
run 2: "7:42 PM Central Time — Thursday, May 28, 2026. How can I help?"
run 3: "It's 9:47 PM Central Time."

(Actual wall-clock time at the moment of the runs was ~12:53 PM CDT — the model is either refusing, hallucinating a time, or emitting "GetDateTime" as plain text content. None of the streaming runs invoke the tool.)

Same prompt, same config, switching only stream: true -> stream: false makes the bug disappear. Switching only chat_template_kwargs.enable_thinking: false -> true (default) also makes the bug disappear (slow but correct in both streaming and non-streaming modes — measured ~10.1s round-trip).

The combination of all three (streaming + tool use + thinking off) is the trigger.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp), Agent Core (conversation loop, context compression, memory)

(Specifically: gateway/platforms/api_server.py streaming code path + agent/conversation_loop.py tool-call extraction. The api_server platform is the most likely entry point because that's where streaming OpenAI-compat clients connect.)

Messaging Platform (if gateway-related)

N/A (CLI only)

(More accurately: api_server — not in the dropdown options. The same code path would affect any OpenAI-compatible streaming consumer of the gateway.)

Debug Report

Not attached — local config has personal API keys and memory contents I'd rather not paste publicly. Happy to share specific sanitized files on request. Environment: macOS 15.5 (Darwin 25.5), Python 3.11.15, hermes-agent on main at commit 1e71b7180 (with PR #34007 applied locally for the per-platform-override reproduction; bug also reproduces using only the existing custom_providers[].extra_body mechanism).

Related work

PR #34007 (open) — feat(agent): per-platform request_overrides adds the per-platform extra_body plumbing. The thinking-off measurement in that PR (1.5s round-trip) was taken with stream: false. The PR itself is correct; this bug is in the streaming code path that exists independently and would manifest with the existing global custom_providers[].extra_body knob too.
PR #12427 (open) — chat_template_kwargs.enable_thinking=false for llama.cpp / vLLM. Companion to this report — if #12427 lands without fixing the streaming agent-loop short-circuit, every user who flips that flag will hit this bug on streaming clients.

Suggested fix direction (not prescriptive)

Either:

Make the streaming api_server code path run the same agent tool-call loop the non-streaming path does, with the chunks streamed only after each tool result is resolved (or for the final answer only) — preserves streaming semantics from the client's perspective.
Make conversation_loop.py's tool-call parser tolerate Qwen3-style plain-text tool-name emission as a fallback when no tool_calls array is present (more invasive — touches the agent core rather than just the streaming adapter).

(1) feels lower-risk because it's a localized api_server change and doesn't change semantics for any model that DOES emit proper tool_calls.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering