hermes - 💡(How to fix) Fix Matrix gateway: no in-band channel to drive per-message LLM orchestration in a downstream dispatcher

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

We wrote a local Hermès plugin (hermes-sovepro) that attempts to handle a related case: when a transient cloud error happens, store the user's last question (15 min TTL, indexed by Matrix room), and on the next /model X from the user, rewrite event.text to /model X\n<stored question> so the gateway re-runs the question on the new model in one shot. Tests passed in synthetic mode (8/8 pytest), but live testing exposed two blockers:

Root Cause

We didn't dig further because the request below makes the internal mechanism less critical: the request is for a different, cleaner mechanism that doesn't rely on the in-process override map at all.

Fix Action

Fix / Workaround

Behind Hermès, we run a custom OpenAI-compatible LLM dispatcher that mediates every chat completions call the agent makes. The dispatcher's purpose is to enable concurrent use of local and cloud LLMs, routed automatically per request according to rules we define: admission policy, queue priorities, context-size thresholds, safety margins, fallbacks. Default routing is local Ollama (fast, free, available); Ollama Cloud Pro is selected when the rules say so (e.g., harder questions, larger context, local saturated, low-latency requirement on the room).

The dispatcher logs every served call, including the model field that Hermès actually requested. This is what gives us hard ground-truth about which model Hermès chose for any given user message.

In this setup, /model <provider/model> typed by a user in the Matrix room is the natural in-band lever for per-message override on top of those automatic dispatcher rules — typically: "for this specific, harder question, force a more capable cloud model, then revert to the normal automatic routing afterwards". It's a chat-driven UX: the team should be able to nudge the routing without leaving the Matrix room, without editing config files, without restarting anything.

Code Example

ts=...10:48:55.886Z  model=glm-4.7-flash:q4_K_M  status=200  latency=6816ms
ts=...10:48:56.363Z  model=glm-4.7-flash:q4_K_M  status=200  latency=452ms

---

{
  "model": "ollama-cloud/glm-5.1",
  "messages": [...],
  "metadata": {
    "hermes_room_id": "!IzxRltiFyqGSkTtDKP:chat.sovepro.eu",
    "hermes_user_id": "@a1:chat.sovepro.eu",
    "hermes_session_id": "matrix-room-abc-...",
    "hermes_user_intent": "force-cloud",
    "hermes_command_origin": "/model"
  }
}
RAW_BUFFERClick to expand / collapse

Context — how we use Hermès

Our deployment runs Hermès Agent as a long-lived profile (hermes profile create) with the Matrix gateway enabled. The profile is connected to a private Matrix homeserver (Synapse 1.151.0). An operations team interacts with Hermès in a single dedicated Matrix room via Element clients — that room is the team's main interface to the agent.

Behind Hermès, we run a custom OpenAI-compatible LLM dispatcher that mediates every chat completions call the agent makes. The dispatcher's purpose is to enable concurrent use of local and cloud LLMs, routed automatically per request according to rules we define: admission policy, queue priorities, context-size thresholds, safety margins, fallbacks. Default routing is local Ollama (fast, free, available); Ollama Cloud Pro is selected when the rules say so (e.g., harder questions, larger context, local saturated, low-latency requirement on the room).

The dispatcher logs every served call, including the model field that Hermès actually requested. This is what gives us hard ground-truth about which model Hermès chose for any given user message.

In this setup, /model <provider/model> typed by a user in the Matrix room is the natural in-band lever for per-message override on top of those automatic dispatcher rules — typically: "for this specific, harder question, force a more capable cloud model, then revert to the normal automatic routing afterwards". It's a chat-driven UX: the team should be able to nudge the routing without leaving the Matrix room, without editing config files, without restarting anything.

That's the use case we want to support. Today we cannot.

/model in v2026.5.7 — two documented forms

Per gateway/run.py:_handle_model_command (line 7871 in v2026.5.7):

  • /model X (default, session-only): in-memory override stored in _session_model_overrides[session_key]. No file write. Reverts at session end.
  • /model X --global: writes the new default to the profile's config.yaml (model.default). Persistent.

Both forms reply in chat with Model switched to <X>, ... (session only — add --global to persist) or its --global equivalent.

What we observe

Neither form actually changes the model used by subsequent LLM calls as seen by our dispatcher. The chat confirms the switch; the next call goes out with the previously configured model.default from config.yaml. The user has no in-chat signal that the switch had no effect — the chat says one thing, the dispatcher logs say another.

Reproduced on tag v2026.5.7 (latest stable as of 2026-05-09) and on main HEAD (with 1187 commits beyond v2026.4.30). Not a regression — pre-existing on multiple released versions.

Concretely, after a user types /model ollama-cloud/glm-5.1 and sends a follow-up question, the dispatcher logs show:

ts=...10:48:55.886Z  model=glm-4.7-flash:q4_K_M  status=200  latency=6816ms
ts=...10:48:56.363Z  model=glm-4.7-flash:q4_K_M  status=200  latency=452ms

glm-4.7-flash:q4_K_M is model.default for this profile. ollama-cloud/glm-5.1 is what the user requested. Same observation with --global.

Why we don't pinpoint the exact internal cause

Reading gateway/run.py, the code looks correct on paper:

  • _handle_model_command populates _session_model_overrides[session_key] = {model, provider, api_key, base_url, api_mode} and calls _evict_cached_agent(session_key) (line 8019) to force a rebuild.
  • _resolve_model_runtime (line 1630) reads _session_model_overrides.get(resolved_session_key) and is documented to honor session-scoped overrides.
  • The 4 _session_model_overrides.pop() we found are all attached to legitimate session boundaries (auto-reset on compression exhaustion, explicit /new, etc.) — not silent erasure on every message.

Without verbose internal logs of the gateway, we can't tell whether the rupture is:

  • (a) a session_key mismatch between store-time and lookup-time (different normalisation paths);
  • (b) the agent rebuild path bypasses _resolve_model_runtime;
  • (c) the override is resolved correctly but later overwritten in the payload assembly;
  • (d) something else entirely.

We didn't dig further because the request below makes the internal mechanism less critical: the request is for a different, cleaner mechanism that doesn't rely on the in-process override map at all.

What we'd like — an in-band channel in the OpenAI payload

Today, the only signal Hermès passes to anything in front of the LLM is the model field of the OpenAI chat completions request. Everything else (user identity, room, intent, priority, override hints) is invisible to a downstream dispatcher.

A clean solution would be: when Hermès assembles its outbound chat completions call, it includes a documented set of orchestration hints in a standard pass-through field. For example, using OpenAI's standard metadata object:

{
  "model": "ollama-cloud/glm-5.1",
  "messages": [...],
  "metadata": {
    "hermes_room_id": "!IzxRltiFyqGSkTtDKP:chat.sovepro.eu",
    "hermes_user_id": "@a1:chat.sovepro.eu",
    "hermes_session_id": "matrix-room-abc-...",
    "hermes_user_intent": "force-cloud",
    "hermes_command_origin": "/model"
  }
}

(Or via extra_body.hermesextra_body is already used by Hermès for provider, thinking, reasoning in agent/transports/chat_completions.py. The exact transport is your call.)

The dispatcher in front then reads these hints and applies its routing policy. Hermès no longer needs to maintain an internal override map for chat-driven model selection — it propagates the user choice into the payload, and the boundary that actually serves the call decides what to do with it.

Why this benefits the wider Hermès ecosystem (not just our deployment)

This isn't a Sovepro-specific need. Many self-hosted Hermès users have something between Hermès and the LLM:

  • LiteLLM proxy (very common): today routes on model only. With hints, it can route on user, intent, priority, conversation.
  • Cost-tracking proxies (Helicone, Langfuse, Portkey, ...): today see the model and the messages, with no way to attribute calls to a Matrix room or a Hermès user beyond inspecting message content. With hints, attribution is structural and reliable.
  • Observability stacks (Langfuse, LangSmith, Datadog, OpenTelemetry exporters): same point — proper trace metadata becomes possible.
  • Multi-tenant Hermès deployments (single process, multiple teams/rooms): today routing must be uniform; with hints, per-tenant rules become possible without forking the gateway.
  • Cost optimization patterns ("default cheap, override cloud on hard questions"): exactly our use case, but a generic pattern any cost-conscious self-hoster will want.
  • A/B testing models (metadata.experiment_arm: A): impossible today without hacks.
  • Plugin orchestration: a Hermès plugin currently has no clean way to influence the outgoing call. With hints, a plugin can append metadata that downstream proxies act on, without patching core.
  • Sticky session by conversation_id: the metadata.conversation_id field is already a known OpenAI pattern for this; making Hermès propagate it would address several long-standing requests at once.

In short: today's design implicitly couples "the agent picks the model" with "the call sent out is opaque to anything in front". Decoupling those — even just by documenting a metadata pass-through convention — unlocks an entire layer of self-hosted patterns that are currently impossible or hacky.

Relationship to commit 06a05a996

We noticed 06a05a996 on branch origin/hermes/hermes-ae6184d3, which removes /model entirely from the CLI and gateway, redirecting users to hermes model (out-of-band CLI) or direct config.yaml editing. We understand the motivation — the in-process override map is awkward.

But removing /model without offering an in-band alternative leaves chat-driven users (Matrix, Discord, Slack, Telegram) without any way to influence routing per-message. The fix this issue requests is complementary to 06a05a996, not opposed: if the in-process override map is too messy, fine, retire it — and provide a documented pass-through to a downstream layer instead. That preserves the user-facing capability via a cleaner mechanism.

What we tried locally

We wrote a local Hermès plugin (hermes-sovepro) that attempts to handle a related case: when a transient cloud error happens, store the user's last question (15 min TTL, indexed by Matrix room), and on the next /model X from the user, rewrite event.text to /model X\n<stored question> so the gateway re-runs the question on the new model in one shot. Tests passed in synthetic mode (8/8 pytest), but live testing exposed two blockers:

  1. The rewrite produces a multi-line message that the gateway's slash parser rejects (see related Issue #22716).
  2. Even if the parser accepted it, our observation here is that /model X has no visible effect on the next dispatcher call anyway.

So we have the full chain coded locally, and the chain doesn't work because the two upstream issues prevent it. The plugin is currently disabled in our runtime, waiting on a path forward upstream.

Reproduction (full)

In a Matrix room where a Hermès Matrix gateway is connected:

  1. Message 1 (user): /model ollama-cloud/glm-5.1
  2. Gateway reply: Model switched to ollama-cloud/glm-5.1, Provider: <provider>, Context: 65,536 tokens (session only — add --global to persist)
  3. Message 2 (user) a few seconds later: Bonjour test, repond OK
  4. Gateway reply: OK
  5. Inspect the LLM access log on the dispatcher / proxy in front of Hermès for the call(s) Hermès made to answer Message 2.

Expected: model=ollama-cloud/glm-5.1 in the served call. Actual: model=<config.yaml model.default>.

Same outcome with --global form (/model ollama-cloud/glm-5.1 --global).

Environment

  • hermes-agent at tag v2026.5.7 (also reproduced on main HEAD as of 2026-05-09 with 1187 commits beyond v2026.4.30).
  • Matrix gateway profile, deployed via hermes profile create + custom config.yaml.
  • Synapse 1.151.0 backend, mautrix==0.21.0.
  • Custom OpenAI-compatible dispatcher in front of Ollama local + Ollama Cloud Pro (logs every served call).

Related issue

  • Issue #22716: in the same Matrix gateway, slash command + question on a new line in one message is rejected. Independently meaningful and a blocker for any future in-band pattern.

Happy to provide more dispatcher logs, gateway internal logs, or test from a fresh profile if that helps reproduce or design on your side.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Matrix gateway: no in-band channel to drive per-message LLM orchestration in a downstream dispatcher