hermes - 💡(How to fix) Fix Possible interrupt/auxiliary-compression reliability issue causing stuck sessions and misleading partial outage state [1 participants]

hermes2026-04-23 20:34:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#14722•Fetched 2026-04-24 06:15:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

QuantumQuill77-glitch

Participants

QuantumQuill77-glitch

Timeline (top)

labeled ×4

I ran into a Hermes failure mode where the system appeared to “stop,” but the outage was only partial:

UI surfaces were still up
- Hermes dashboard on port 8080
- Hermes Workspace on port 3000
The actual Hermes gateway/chat service on port 8642 was down

This made Hermes look completely dead from the user side even though the UI was still reachable.

At the same time, logs suggested a possible upstream reliability issue involving:

interrupt handling
auxiliary compression
model fallback behavior
stale/hung CLI session cleanup

Root Cause

From the user perspective, this looks like:

“Hermes randomly stopped”

But operationally it may actually be:

gateway/chat path failed
UI stayed up
auxiliary compression/interrupt logic may have left stale state behind

That combination is hard to diagnose quickly and could be improved either through:

a bugfix in the cleanup/interrupt path
clearer health reporting
or both

If useful, I can also provide:

the stuck session ID
the stale CLI PID that was cleaned up
the exact local config mitigation that reduced recurrence

Fix Action

Fix / Workaround

Hermes Workspace had a bad local patch in:
- ~/hermes-workspace/src/server/gateway-capabilities.ts
that patch forced sessions, skills, config, and jobs capabilities to true
this made Workspace behave as if backend support existed even when the gateway on 8642 was not actually serving what Workspace expected

I’m including this only as context because it made diagnosis noisier; I do not think this specific patch is a Hermes core bug.

Fixes / mitigation applied locally

Code Example

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180

RAW_BUFFERClick to expand / collapse

Summary

I ran into a Hermes failure mode where the system appeared to “stop,” but the outage was only partial:

UI surfaces were still up
- Hermes dashboard on port 8080
- Hermes Workspace on port 3000
The actual Hermes gateway/chat service on port 8642 was down

This made Hermes look completely dead from the user side even though the UI was still reachable.

At the same time, logs suggested a possible upstream reliability issue involving:

interrupt handling
auxiliary compression
model fallback behavior
stale/hung CLI session cleanup

Environment

Hermes version: v0.10.0 (2026.4.16)
Install type: git-installed
Source path: ~/.hermes/hermes-agent
Model: gpt-5.4 via OpenAI Codex
Host: Ubuntu 24.04 on Oracle ARM/KVM
Gateway managed as: systemd --user service
Workspace also in use on the same machine

What happened

Hermes appeared to stop responding to chat, but investigation showed:

hermes-gateway.service was inactive
the dashboard/UI processes were still running
the actual chat/gateway path on port 8642 was down
there was also a stale long-running Hermes CLI process tied to an old session

This created a confusing “partial outage” condition:

3000 and 8080 looked healthy
8642 was the real failed component

Relevant observations from logs / behavior

I saw signs of auxiliary compression/fallback trouble around the same time:

compression attempted on nous / moonshotai/kimi-k2.6
request timed out
fallback to OpenRouter Gemini returned HTTP 400
one run logged:
- Agent thread still alive after interrupt

There was also a stale stuck CLI session that had to be cleaned up manually afterward.

Local factors discovered during diagnosis

One local issue complicated debugging but does not appear to be the root Hermes bug:

Hermes Workspace had a bad local patch in:
- ~/hermes-workspace/src/server/gateway-capabilities.ts
that patch forced sessions, skills, config, and jobs capabilities to true
this made Workspace behave as if backend support existed even when the gateway on 8642 was not actually serving what Workspace expected

That was corrected locally by restoring real capability probing.

I’m including this only as context because it made diagnosis noisier; I do not think this specific patch is a Hermes core bug.

Fixes / mitigation applied locally

Restarted hermes-gateway.service
- verified /health on 8642 returned OK
Corrected the local Workspace capability patch
- restored real capability probing
- rebuilt Workspace successfully
Verified Workspace API routes after rebuild
- /api/sessions
- /api/skills
- /api/config
Mitigated auxiliary compression instability with this config:

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180

Restarted gateway again after config change
Cleaned up stale stuck Hermes CLI process
- old PID was manually terminated
- stuck session ended cleanly afterward

Current status after mitigation

port 8642 healthy
Workspace on 3000 healthy
dashboard on 8080 healthy
old stuck session no longer hanging
new sessions appear healthier with the safer auxiliary compression config

Suspected upstream issue

The part that seems worth investigating upstream is:

auxiliary compression timeout/fallback may interact badly with interrupt handling
this may leave an agent thread/session in a stuck or partially interrupted state
the resulting system state can be confusing because UI surfaces remain up while the actual chat path is unhealthy

Suggested investigation areas

Likely code paths to inspect:

cli.py
- interrupt path / cleanup behavior
agent/context_compressor.py
- timeout handling
- failure cleanup
- auxiliary compression recovery
run_agent.py
- auxiliary model selection
- fallback behavior
- post-failure session/thread cleanup

Developer-facing questions

Should an auxiliary compression timeout ever be able to leave a live agent thread behind after interrupt?
Is fallback after compression failure expected to cross providers in this way, and if so, is the cleanup path guaranteed?
Could Hermes expose clearer degraded-state reporting when:
- UI is alive
- but gateway/chat on 8642 is down?
Should the gateway or CLI be more aggressive about cleaning up stale sessions after:
- interrupt
- compression timeout
- fallback failure?

Why this matters

From the user perspective, this looks like:

“Hermes randomly stopped”

But operationally it may actually be:

gateway/chat path failed
UI stayed up
auxiliary compression/interrupt logic may have left stale state behind

That combination is hard to diagnose quickly and could be improved either through:

a bugfix in the cleanup/interrupt path
clearer health reporting
or both

If useful, I can also provide:

the stuck session ID
the stale CLI PID that was cleaned up
the exact local config mitigation that reduced recurrence

extent analysis

TL;DR

The likely fix involves improving the interrupt handling and auxiliary compression timeout logic in the Hermes gateway to prevent stale agent threads and sessions.

Guidance

Investigate the cli.py, agent/context_compressor.py, and run_agent.py code paths to identify potential issues with interrupt handling, timeout handling, and fallback behavior.
Consider implementing clearer degraded-state reporting when the UI is alive but the gateway/chat on port 8642 is down.
Review the auxiliary compression configuration to ensure it is properly set up to handle timeouts and fallbacks without leaving stale state behind.
Verify that the gateway or CLI is aggressively cleaning up stale sessions after interrupt, compression timeout, or fallback failure.

Example

No specific code example is provided, but the auxiliary configuration block in the issue body shows a potential mitigation:

auxiliary:
  compression:
    provider: nous
    model: openai/gpt-5.4
    timeout: 180

This configuration sets a timeout of 180 seconds for auxiliary compression, which may help prevent stale state from building up.

Notes

The issue is complex and involves multiple components, including the Hermes gateway, UI, and auxiliary compression. Further investigation is needed to determine the root cause and develop a comprehensive fix.

Recommendation

Apply the workaround by adjusting the auxiliary compression configuration and restarting the hermes-gateway.service to mitigate the issue. This may help reduce the occurrence of stale agent threads and sessions, but a more permanent fix will likely require changes to the Hermes codebase.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Possible interrupt/auxiliary-compression reliability issue causing stuck sessions and misleading partial outage state [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Fixes / mitigation applied locally

Code Example

Summary

Environment

What happened

Relevant observations from logs / behavior

Local factors discovered during diagnosis

Fixes / mitigation applied locally

Current status after mitigation

Suspected upstream issue

Suggested investigation areas

Developer-facing questions

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Possible interrupt/auxiliary-compression reliability issue causing stuck sessions and misleading partial outage state [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Fixes / mitigation applied locally

Code Example

Summary

Environment

What happened

Relevant observations from logs / behavior

Local factors discovered during diagnosis

Fixes / mitigation applied locally

Current status after mitigation

Suspected upstream issue

Suggested investigation areas

Developer-facing questions

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING