hermes - 💡(How to fix) Fix [Important] Architecture quality audit: restart continuity, gateway sprawl, and tool-policy coverage [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17154Fetched 2026-04-29 06:37:02
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×4commented ×1

Error Message

  • result/error status

Root Cause

Important caveat: several raw critical findings are expected false positives or low-priority in context because they land in tests, comments, or intentionally powerful skill/sandbox code. I am not presenting this as "Hermes has 35 vulnerabilities." The more serious signal is architectural: Hermes has become a very capable persistent runtime, but major responsibilities are concentrated in a few large surfaces where restart continuity, tool-policy coverage, memory provenance, and cross-channel state can drift.

Fix Action

Fix / Workaround

  • Loop detector does not observe all tool-call paths
  • Permission policy is not enforced on all dispatch paths

Hermes can look guarded because one path has approval, Tirith, loop detection, or command policy, while another path bypasses the same protection. For a high-agency agent, the policy must live at the shared dispatch boundary or be proven equivalent across all dispatch boundaries.

Create one tool-dispatch policy contract that records, for every tool invocation:

Code Example

hermescheck audit /tmp/nous-hermes-agent-audit \
  --profile enterprise_production \
  --self-review /tmp/hermes-self-review.json \
  --output /tmp/hermes-audit-self.json \
  --report /tmp/hermes-audit-self.md \
  --sarif /tmp/hermes-audit-self.sarif \
  --fail-on none
RAW_BUFFERClick to expand / collapse

I'm building hermescheck, an open-source tool for inspecting and validating AI agent workflows. I ran a fresh architecture-quality scan against the current NousResearch/hermes-agent main branch and wanted to share the results as a maintainer-facing engineering issue, not as a security disclosure.

Target revision scanned: a830f25f716190168dd7db6819c0b48848049002 (fix(tui): surface gateway stderr tail in start_timeout activity (#17112)).

Scan command:

hermescheck audit /tmp/nous-hermes-agent-audit \
  --profile enterprise_production \
  --self-review /tmp/hermes-self-review.json \
  --output /tmp/hermes-audit-self.json \
  --report /tmp/hermes-audit-self.md \
  --sarif /tmp/hermes-audit-self.sarif \
  --fail-on none

The scanner reported 133 findings: 35 critical, 20 high, 78 medium, overall critical, architecture-era score 30/100 under the strict enterprise-production profile.

Important caveat: several raw critical findings are expected false positives or low-priority in context because they land in tests, comments, or intentionally powerful skill/sandbox code. I am not presenting this as "Hermes has 35 vulnerabilities." The more serious signal is architectural: Hermes has become a very capable persistent runtime, but major responsibilities are concentrated in a few large surfaces where restart continuity, tool-policy coverage, memory provenance, and cross-channel state can drift.

Architecture self-review used for this scan

Hermes Agent appears to be a persistent, self-hosted multi-channel agent runtime. The core execution path is centered on:

  • run_agent.py for model execution, tool loops, prompt assembly, context compression, provider routing, fallback, persistence, and session handling.
  • gateway/run.py for messaging-platform orchestration, session routing, platform delivery, progress streaming, commands, gateway lifecycle, restart/replace, and cached agent reuse.
  • hermes_state.py / ~/.hermes/state.db for durable sessions and message history.
  • tools/ for privileged capabilities, especially terminal execution, browser/tool environments, approvals, memory, and session search.
  • skills/ and optional-skills/ for procedural knowledge and reusable workflows.
  • cron/ / scheduler surfaces for scheduled work.

Self-identified architecture risks from the review:

  • Restart and replace flows can become a state-loss boundary if gateway caches, active session maps, and state.db persistence diverge.
  • session_search exists, but continuity still depends heavily on the model remembering to call it rather than the runtime always injecting recent-session context after cold starts or channel remaps.
  • gateway/run.py has enough responsibilities to behave like a monolith during incidents.
  • Self-modifying or self-restarting agents are risky unless restart is deferred until user-visible delivery and session persistence are complete.
  • Memory, skills, session search, summaries, and transcript recall can blur without explicit source, timestamp, freshness, and confidence metadata.

Major issues worth treating as high priority

1. Gateway orchestration has become a high-risk monolith

Evidence:

  • gateway/run.py is about 12,118 lines in this scan.
  • It combines lifecycle/restart behavior, platform adapters, session routing, progress streaming, user commands, background tasks, cached agent management, and delivery behavior.

Why this is serious:

When one file owns this many runtime concerns, gateway incidents become hard to reason about. A change intended for Feishu/Discord streaming can accidentally affect restart behavior, session mapping, cached agent reuse, or active-turn cleanup. This is especially risky for a persistent messaging gateway where users expect continuity across restarts.

Recommended direction:

Split or hard-bound the responsibilities into smaller contracts:

  • lifecycle/restart/replace
  • session routing and channel identity
  • progress streaming/delivery
  • command handling
  • cached-agent lifecycle
  • platform adapter coordination

At minimum, add a "gateway lifecycle contract" doc and tests that pin the behavior around replace/restart, active agents, platform reconnect, and duplicate-process prevention.

2. run_agent.py is carrying too much of the core runtime

Evidence:

  • run_agent.py is about 13,601 lines.
  • It contains provider setup, prompt caching, tool execution loop, context compression, memory/session persistence, fallback logic, image handling, streaming, retry behavior, and platform/session metadata handling.

Why this is serious:

This makes it difficult to prove that a tool-loop change does not affect persistence, fallback, prompt cache stability, or session replay. It also increases the blast radius of small fixes and makes review harder for new contributors.

Recommended direction:

Extract stable boundaries around:

  • model transport/provider routing
  • prompt/system-context assembly
  • tool loop and tool-result persistence
  • session DB writes and replay
  • fallback/retry policy
  • compression and memory hooks

3. Restart continuity should be a first-class invariant, not a best-effort behavior

Evidence:

  • hermes_state.py has durable session APIs such as list_sessions_rich and get_messages_as_conversation.
  • tools/session_search_tool.py can retrieve past sessions.
  • Gateway and agent code maintain in-memory caches and active session state across a long-lived process.

Why this is serious:

A persistent agent that serves chat channels must survive gateway replace/restart without losing what it was just doing. If recent-session context is only available when the model chooses to call session_search, then cold-start recovery is probabilistic. This becomes especially painful when the agent restarts during long-running work.

Recommended direction:

Add a mandatory restart-recall acceptance test:

  1. Create or update a session with recent messages.
  2. Replace/restart the gateway.
  3. Verify platform reconnect.
  4. Verify no duplicate gateway process remains.
  5. Verify the next turn receives a bounded "recent session recall" context from state.db.
  6. Verify current user input remains authoritative over recalled context.

4. Tool-policy and loop-safety checks appear to be attached to partial paths

hermescheck found:

  • Loop detector does not observe all tool-call paths
  • Permission policy is not enforced on all dispatch paths

The scanner is heuristic, so the exact counts should be manually reviewed. But the architectural risk is real for a runtime with many execution modes: sequential tool calls, concurrent/delegated calls, scheduled jobs, gateway commands, CLI paths, MCP-like tools, and environment-specific tools can easily drift.

Why this is serious:

Hermes can look guarded because one path has approval, Tirith, loop detection, or command policy, while another path bypasses the same protection. For a high-agency agent, the policy must live at the shared dispatch boundary or be proven equivalent across all dispatch boundaries.

Recommended direction:

Create one tool-dispatch policy contract that records, for every tool invocation:

  • tool name
  • normalized arguments
  • session/channel identity
  • approval policy decision
  • dangerous-command decision
  • loop-budget decision
  • result/error status
  • whether the call was sequential, delegated, scheduled, or backgrounded

Then test representative calls from every execution path.

5. Self-restart/self-upgrade flows need explicit deferred semantics

Hermes has powerful terminal and process-control capabilities. That is useful, but any agent path that restarts or replaces its own gateway can interrupt the current turn, lose delivery, or persist an incomplete transcript.

Recommended direction:

Treat direct in-turn self-restart as an architecture smell. The safe policy should be:

  1. Detect restart/kill/replace commands that target the active Hermes process or gateway.
  2. Write a deferred restart marker.
  3. Finish the current turn.
  4. Persist the session.
  5. Deliver the user-visible response.
  6. Wait for active platform sessions to drain.
  7. Restart/replace the gateway.
  8. Verify reconnect and recent-session recall.

This should become a regression test, not just a convention.

6. Session search exists, but runtime-level recent recall is still needed

Evidence:

  • tools/session_search_tool.py exposes session search.
  • agent/prompt_builder.py tells the agent to use session_search when relevant.

Why this is serious:

Prompt guidance is weaker than runtime policy. If the agent is recovering from a restart, it may not know what it forgot. The runtime should provide a small, bounded recent-session packet automatically when the gateway has just restarted, when a channel maps to a session with no in-memory agent, or when cached agent state was evicted.

Recommended direction:

Use state.db as the source of truth and inject a bounded, clearly fenced recent-session context with:

  • source layer: recent transcript, not durable fact
  • session id / channel
  • timestamp
  • title/preview
  • max message count and max char budget
  • instruction that current user input wins on conflict

7. Memory provenance and freshness are not visible enough

hermescheck found Memory freshness / generation confusion detected across session, memory, plugin, README, and other memory-like surfaces.

Why this is serious:

Hermes has multiple memory concepts: durable user facts, procedural skills, session transcripts, summaries, plugin memory, context files, and search results. Without visible provenance, agents may treat stale transcript summaries like current facts or confuse a skill/procedure with user preference.

Recommended direction:

Every injected memory block should identify:

  • layer: fact / procedure / transcript / summary / context file
  • source path or session id
  • timestamp / freshness
  • confidence
  • whether it is allowed to override current user input

8. Startup and runtime surface sprawl make operations harder than they need to be

hermescheck reported:

  • Startup surface sprawl detected
  • Runtime surface sprawl detected

Why this is serious:

Hermes supports CLI, gateway, TUI, web, adapters, cron, MCP-like tools, containers, environments, and plugins. That breadth is a strength, but the canonical operational paths need to be very clear. Otherwise users and contributors debug the wrong runner, wrong service manager, or wrong state file.

Recommended direction:

Document and test two canonical paths:

  • development foreground path
  • production/background gateway path

Everything else should be explicitly marked optional, legacy, experimental, or adapter-specific.

9. Raw scanner criticals should be triaged, not ignored

The raw scan flagged unsafe execution patterns and hardcoded secret-like strings. Some are clearly tests, comments, fixtures, or intentionally powerful red-team/sandbox skills. Still, I would not discard the category completely.

Recommended direction:

Add scanner-aware annotations or a repo hygiene policy that separates:

  • test fixtures
  • comments/docs
  • intentionally sandboxed skill code
  • actual runtime execution paths

The goal is not to remove all exec/eval strings from a system that intentionally runs tools. The goal is to make it obvious which execution surfaces are policy-protected and which are only examples/tests.

10. Output mutation and rendering transformations need a clearer audit trail

The scan found many output transformation sites. Many are probably legitimate platform rendering behavior, but in a multi-channel agent runtime this matters.

Why this is serious:

If raw model output differs from user-visible output, incident analysis needs to know whether a platform adapter, renderer, sanitizer, media pipeline, or response wrapper changed it.

Recommended direction:

For high-risk channels and tool results, log:

  • raw assistant output
  • transformation stage
  • final delivered payload
  • platform/channel
  • truncation or media conversion decisions

Suggested fix order

  1. Define and test the restart-continuity contract.
  2. Add runtime-level recent-session recall from state.db for cold starts and channel remaps.
  3. Move tool policy/loop observation to a common dispatch contract or prove equivalent coverage across dispatch paths.
  4. Split or hard-bound gateway/run.py lifecycle/session/delivery responsibilities.
  5. Add provenance-rich memory rendering.
  6. Triage raw exec/eval/secret scanner hits into fixture/comment/intentional-runtime categories.
  7. Document canonical startup/runtime paths.
  8. Add output-transformation audit logs for multi-channel delivery.

Why I think this is important

Hermes already has many strong ingredients: persistent sessions, memory, skills, scheduler, multi-channel gateway, provider flexibility, tool execution, and session search. The risk is not that Hermes lacks capabilities. The risk is that the capability surface is now large enough that reliability depends on architectural contracts rather than local fixes.

In other words: the next quality jump is not "add one more feature"; it is making restart continuity, tool policy coverage, memory provenance, and gateway lifecycle behavior impossible to accidentally bypass.

extent analysis

TL;DR

The most likely fix for the issues reported by hermescheck is to address the architectural risks by defining and testing a restart-continuity contract, adding runtime-level recent-session recall, and moving tool policy/loop observation to a common dispatch contract.

Guidance

  • Define and test a restart-continuity contract to ensure that the gateway can survive restarts without losing session context.
  • Add runtime-level recent-session recall from state.db to provide a bounded recent-session context after cold starts or channel remaps.
  • Move tool policy/loop observation to a common dispatch contract to ensure that all tool invocations are recorded and protected by a unified policy.
  • Split or hard-bound gateway/run.py lifecycle, session, and delivery responsibilities to reduce the risk of incidents and make the code more maintainable.
  • Add provenance-rich memory rendering to ensure that memory blocks are properly identified and sourced.

Example

# Example of a restart-continuity contract test
def test_restart_continuity():
    # Create a session with recent messages
    session = create_session()
    # Replace/restart the gateway
    restart_gateway()
    # Verify platform reconnect
    assert platform_reconnect()
    # Verify no duplicate gateway process remains
    assert not duplicate_gateway_process()
    # Verify the next turn receives a bounded "recent session recall" context from `state.db`
    assert recent_session_recall(session)

Notes

The suggested fix order provided in the issue body should be followed to address the architectural risks. The fixes should be implemented and tested incrementally to ensure that each change does not introduce new issues.

Recommendation

Apply the suggested fixes in the order provided, starting with defining and testing the restart-continuity contract, to address the architectural risks and improve the reliability of the Hermes agent.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING