hermes - 💡(How to fix) Fix Possible interrupt/auxiliary-compression reliability issue causing stuck sessions and misleading partial outage state [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14722Fetched 2026-04-24 06:15:01
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Timeline (top)
labeled ×4

I ran into a Hermes failure mode where the system appeared to “stop,” but the outage was only partial:

  • UI surfaces were still up
    • Hermes dashboard on port 8080
    • Hermes Workspace on port 3000
  • The actual Hermes gateway/chat service on port 8642 was down

This made Hermes look completely dead from the user side even though the UI was still reachable.

At the same time, logs suggested a possible upstream reliability issue involving:

  • interrupt handling
  • auxiliary compression
  • model fallback behavior
  • stale/hung CLI session cleanup

Root Cause

From the user perspective, this looks like:

  • “Hermes randomly stopped”

But operationally it may actually be:

  • gateway/chat path failed
  • UI stayed up
  • auxiliary compression/interrupt logic may have left stale state behind

That combination is hard to diagnose quickly and could be improved either through:

  • a bugfix in the cleanup/interrupt path
  • clearer health reporting
  • or both

If useful, I can also provide:

  • the stuck session ID
  • the stale CLI PID that was cleaned up
  • the exact local config mitigation that reduced recurrence

Fix Action

Fix / Workaround

  • Hermes Workspace had a bad local patch in:
    • ~/hermes-workspace/src/server/gateway-capabilities.ts
  • that patch forced sessions, skills, config, and jobs capabilities to true
  • this made Workspace behave as if backend support existed even when the gateway on 8642 was not actually serving what Workspace expected

I’m including this only as context because it made diagnosis noisier; I do not think this specific patch is a Hermes core bug.

Fixes / mitigation applied locally

Code Example

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180
RAW_BUFFERClick to expand / collapse

Summary

I ran into a Hermes failure mode where the system appeared to “stop,” but the outage was only partial:

  • UI surfaces were still up
    • Hermes dashboard on port 8080
    • Hermes Workspace on port 3000
  • The actual Hermes gateway/chat service on port 8642 was down

This made Hermes look completely dead from the user side even though the UI was still reachable.

At the same time, logs suggested a possible upstream reliability issue involving:

  • interrupt handling
  • auxiliary compression
  • model fallback behavior
  • stale/hung CLI session cleanup

Environment

  • Hermes version: v0.10.0 (2026.4.16)
  • Install type: git-installed
  • Source path: ~/.hermes/hermes-agent
  • Model: gpt-5.4 via OpenAI Codex
  • Host: Ubuntu 24.04 on Oracle ARM/KVM
  • Gateway managed as: systemd --user service
  • Workspace also in use on the same machine

What happened

Hermes appeared to stop responding to chat, but investigation showed:

  • hermes-gateway.service was inactive
  • the dashboard/UI processes were still running
  • the actual chat/gateway path on port 8642 was down
  • there was also a stale long-running Hermes CLI process tied to an old session

This created a confusing “partial outage” condition:

  • 3000 and 8080 looked healthy
  • 8642 was the real failed component

Relevant observations from logs / behavior

I saw signs of auxiliary compression/fallback trouble around the same time:

  • compression attempted on nous / moonshotai/kimi-k2.6
  • request timed out
  • fallback to OpenRouter Gemini returned HTTP 400
  • one run logged:
    • Agent thread still alive after interrupt

There was also a stale stuck CLI session that had to be cleaned up manually afterward.

Local factors discovered during diagnosis

One local issue complicated debugging but does not appear to be the root Hermes bug:

  • Hermes Workspace had a bad local patch in:
    • ~/hermes-workspace/src/server/gateway-capabilities.ts
  • that patch forced sessions, skills, config, and jobs capabilities to true
  • this made Workspace behave as if backend support existed even when the gateway on 8642 was not actually serving what Workspace expected

That was corrected locally by restoring real capability probing.

I’m including this only as context because it made diagnosis noisier; I do not think this specific patch is a Hermes core bug.

Fixes / mitigation applied locally

  1. Restarted hermes-gateway.service

    • verified /health on 8642 returned OK
  2. Corrected the local Workspace capability patch

    • restored real capability probing
    • rebuilt Workspace successfully
  3. Verified Workspace API routes after rebuild

    • /api/sessions
    • /api/skills
    • /api/config
  4. Mitigated auxiliary compression instability with this config:

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180
  1. Restarted gateway again after config change

  2. Cleaned up stale stuck Hermes CLI process

    • old PID was manually terminated
    • stuck session ended cleanly afterward

Current status after mitigation

  • port 8642 healthy
  • Workspace on 3000 healthy
  • dashboard on 8080 healthy
  • old stuck session no longer hanging
  • new sessions appear healthier with the safer auxiliary compression config

Suspected upstream issue

The part that seems worth investigating upstream is:

  • auxiliary compression timeout/fallback may interact badly with interrupt handling
  • this may leave an agent thread/session in a stuck or partially interrupted state
  • the resulting system state can be confusing because UI surfaces remain up while the actual chat path is unhealthy

Suggested investigation areas

Likely code paths to inspect:

  • cli.py
    • interrupt path / cleanup behavior
  • agent/context_compressor.py
    • timeout handling
    • failure cleanup
    • auxiliary compression recovery
  • run_agent.py
    • auxiliary model selection
    • fallback behavior
    • post-failure session/thread cleanup

Developer-facing questions

  1. Should an auxiliary compression timeout ever be able to leave a live agent thread behind after interrupt?
  2. Is fallback after compression failure expected to cross providers in this way, and if so, is the cleanup path guaranteed?
  3. Could Hermes expose clearer degraded-state reporting when:
    • UI is alive
    • but gateway/chat on 8642 is down?
  4. Should the gateway or CLI be more aggressive about cleaning up stale sessions after:
    • interrupt
    • compression timeout
    • fallback failure?

Why this matters

From the user perspective, this looks like:

  • “Hermes randomly stopped”

But operationally it may actually be:

  • gateway/chat path failed
  • UI stayed up
  • auxiliary compression/interrupt logic may have left stale state behind

That combination is hard to diagnose quickly and could be improved either through:

  • a bugfix in the cleanup/interrupt path
  • clearer health reporting
  • or both

If useful, I can also provide:

  • the stuck session ID
  • the stale CLI PID that was cleaned up
  • the exact local config mitigation that reduced recurrence

extent analysis

TL;DR

The likely fix involves improving the interrupt handling and auxiliary compression timeout logic in the Hermes gateway to prevent stale agent threads and sessions.

Guidance

  • Investigate the cli.py, agent/context_compressor.py, and run_agent.py code paths to identify potential issues with interrupt handling, timeout handling, and fallback behavior.
  • Consider implementing clearer degraded-state reporting when the UI is alive but the gateway/chat on port 8642 is down.
  • Review the auxiliary compression configuration to ensure it is properly set up to handle timeouts and fallbacks without leaving stale state behind.
  • Verify that the gateway or CLI is aggressively cleaning up stale sessions after interrupt, compression timeout, or fallback failure.

Example

No specific code example is provided, but the auxiliary configuration block in the issue body shows a potential mitigation:

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180

This configuration sets a timeout of 180 seconds for auxiliary compression, which may help prevent stale state from building up.

Notes

The issue is complex and involves multiple components, including the Hermes gateway, UI, and auxiliary compression. Further investigation is needed to determine the root cause and develop a comprehensive fix.

Recommendation

Apply the workaround by adjusting the auxiliary compression configuration and restarting the hermes-gateway.service to mitigate the issue. This may help reduce the occurrence of stale agent threads and sessions, but a more permanent fix will likely require changes to the Hermes codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Possible interrupt/auxiliary-compression reliability issue causing stuck sessions and misleading partial outage state [1 participants]