hermes - 💡(How to fix) Fix [Important] Architecture quality audit: restart continuity, gateway sprawl, and tool-policy coverage [1 comments, 2 participants]

huangrichao2020 · 2026-04-28T21:37:55Z

[hermes] I'm building hermescheck , an open-source tool for inspecting and validating AI agent workflows. I ran a fresh architecture-quality scan against the c… ## Fix / Workaround - `Loop detector does not observe all tool-call paths` - `Permission policy is not enforced on all dispatch paths` Hermes can look guarded because one path has approval, Tirith, loop detection, or command policy, while another path bypasses the same protection. For a high-agency agent, the policy must live at the shared dispatch boundary or be proven equivalent across all dispatch boundaries. Create one tool-dispatch policy contract that records, for every tool invocation: I'm building `hermescheck`, an open-source tool for inspecting and validating AI agent workflows. I ran a fresh architecture-quality scan against the current `NousResearch/hermes-agent` `main` branch and wanted to share the results as a maintainer-facing engineering issue, not as a security disclosure. Target revision scanned: `a830f25f716190168dd7db6819c0b48848049002` (`fix(tui): surface gateway stderr tail in start_timeout activity (#17112)`). Scan command: ```bash hermescheck audit /tmp/nous-hermes-agent-audit \ --profile enterprise_production \ --self-review /tmp/hermes-self-review.json \ --output /tmp/hermes-audit-self.json \ --report /tmp/hermes-audit-self.md \ --sarif /tmp/hermes-audit-self.sarif \ --fail-on none ``` The scanner reported `133` findings: `35 critical`, `20 high`, `78 medium`, overall `critical`, architecture-era score `30/100` under the strict enterprise-production profile. Important caveat: several raw `critical` findings are expected false positives or low-priority in context because they land in tests, comments, or intentionally powerful skill/sandbox code. I am not presenting this as "Hermes has 35 vulnerabilities." The more serious signal is architectural: Hermes has become a very capable persistent runtime, but major responsibilities are concentrated in a few large surfaces where restart continuity, tool-policy coverage, memory provenance, and cross-channel state can drift. ## Architecture self-review used for this scan Hermes Agent appears to be a persistent, self-hosted multi-channel agent runtime. The core execution path is centered on: - `run_agent.py` for model execution, tool loops, prompt assembly, context compression, provider routing, fallback, persistence, and session handling. - `gateway/run.py` for messaging-platform orchestration, session routing, platform delivery, progress streaming, commands, gateway lifecycle, restart/replace, and cached agent reuse. - `hermes_state.py` / `~/.hermes/state.db` for durable sessions and message history. - `tools/` for privileged capabilities, especially terminal execution, browser/tool environments, approvals, memory, and session search. - `skills/` and `optional-skills/` for procedural knowledge and reusable workflows. - `cron/` / scheduler surfaces for scheduled work. Self-identified architecture risks from the review: - Restart and replace flows can become a state-loss boundary if gateway caches, active session maps, and `state.db` persistence diverge. - `session_search` exists, but continuity still depends heavily on the model remembering to call it rather than the runtime always injecting recent-session context after cold starts or channel remaps. - `gateway/run.py` has enough responsibilities to behave like a monolith during incidents. - Self-modifying or self-restarting agents are risky unless restart is deferred until user-visible delivery and session persistence are complete. - Memory, skills, session search, summaries, and transcript recall can blur without explicit source, timestamp, freshness, and confidence metadata. ## Major issues worth treating as high priority ### 1. Gateway orchestration has become a high-risk monolith Evidence: - `gateway/run.py` is about `12,118` lines in this scan. - It combines lifecycle/restart behavior, platform adapters, session routing, progress streaming, user commands, background tasks, cached agent management, and delivery behavior. Why this is serious: When one file owns this many runtime concerns, gateway incidents become hard to reason about. A change intended for Feishu/Discord streaming can accidentally affect restart behavior, session mapping, cached agent reuse, or active-turn cleanup. This is especially risky for a persistent messaging gateway where users expect continuity across restarts. Recommended direction: Split or hard-bound the responsibilities into smaller contracts: - lifecycle/restart/replace - session routing and channel identity - progress streaming/delivery - command handling - cached-agent lifecycle - platform adapter coordination At minimum, add a "gateway lifecycle contract" doc and tests that pin the behavior around replace/restart, active agents, platform reconnect, and duplicate-process prevention. ### 2. `run_agent.py` is carrying too much of the core runtime Evidence: - `run_agent.py` is about `13,601` lines. -

hermes2026-04-28 21:37:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17154•Fetched 2026-04-29 06:37:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

huangrichao2020

Participants

alt-glitch

huangrichao2020

Timeline (top)

labeled ×4commented ×1

Error Message

result/error status

Root Cause

Important caveat: several raw critical findings are expected false positives or low-priority in context because they land in tests, comments, or intentionally powerful skill/sandbox code. I am not presenting this as "Hermes has 35 vulnerabilities." The more serious signal is architectural: Hermes has become a very capable persistent runtime, but major responsibilities are concentrated in a few large surfaces where restart continuity, tool-policy coverage, memory provenance, and cross-channel state can drift.

Fix Action

Fix / Workaround

Loop detector does not observe all tool-call paths
Permission policy is not enforced on all dispatch paths

Hermes can look guarded because one path has approval, Tirith, loop detection, or command policy, while another path bypasses the same protection. For a high-agency agent, the policy must live at the shared dispatch boundary or be proven equivalent across all dispatch boundaries.

Create one tool-dispatch policy contract that records, for every tool invocation:

Code Example

hermescheck audit /tmp/nous-hermes-agent-audit \
  --profile enterprise_production \
  --self-review /tmp/hermes-self-review.json \
  --output /tmp/hermes-audit-self.json \
  --report /tmp/hermes-audit-self.md \
  --sarif /tmp/hermes-audit-self.sarif \
  --fail-on none

RAW_BUFFERClick to expand / collapse

I'm building hermescheck, an open-source tool for inspecting and validating AI agent workflows. I ran a fresh architecture-quality scan against the current NousResearch/hermes-agent main branch and wanted to share the results as a maintainer-facing engineering issue, not as a security disclosure.

Target revision scanned: a830f25f716190168dd7db6819c0b48848049002 (fix(tui): surface gateway stderr tail in start_timeout activity (#17112)).

Scan command:

hermescheck audit /tmp/nous-hermes-agent-audit \
  --profile enterprise_production \
  --self-review /tmp/hermes-self-review.json \
  --output /tmp/hermes-audit-self.json \
  --report /tmp/hermes-audit-self.md \
  --sarif /tmp/hermes-audit-self.sarif \
  --fail-on none

The scanner reported 133 findings: 35 critical, 20 high, 78 medium, overall critical, architecture-era score 30/100 under the strict enterprise-production profile.

Architecture self-review used for this scan

Hermes Agent appears to be a persistent, self-hosted multi-channel agent runtime. The core execution path is centered on:

run_agent.py for model execution, tool loops, prompt assembly, context compression, provider routing, fallback, persistence, and session handling.
gateway/run.py for messaging-platform orchestration, session routing, platform delivery, progress streaming, commands, gateway lifecycle, restart/replace, and cached agent reuse.
hermes_state.py / ~/.hermes/state.db for durable sessions and message history.
tools/ for privileged capabilities, especially terminal execution, browser/tool environments, approvals, memory, and session search.
skills/ and optional-skills/ for procedural knowledge and reusable workflows.
cron/ / scheduler surfaces for scheduled work.

Self-identified architecture risks from the review:

Restart and replace flows can become a state-loss boundary if gateway caches, active session maps, and state.db persistence diverge.
session_search exists, but continuity still depends heavily on the model remembering to call it rather than the runtime always injecting recent-session context after cold starts or channel remaps.
gateway/run.py has enough responsibilities to behave like a monolith during incidents.
Self-modifying or self-restarting agents are risky unless restart is deferred until user-visible delivery and session persistence are complete.
Memory, skills, session search, summaries, and transcript recall can blur without explicit source, timestamp, freshness, and confidence metadata.

Major issues worth treating as high priority

1. Gateway orchestration has become a high-risk monolith

Evidence:

gateway/run.py is about 12,118 lines in this scan.
It combines lifecycle/restart behavior, platform adapters, session routing, progress streaming, user commands, background tasks, cached agent management, and delivery behavior.

Why this is serious:

When one file owns this many runtime concerns, gateway incidents become hard to reason about. A change intended for Feishu/Discord streaming can accidentally affect restart behavior, session mapping, cached agent reuse, or active-turn cleanup. This is especially risky for a persistent messaging gateway where users expect continuity across restarts.

Recommended direction:

Split or hard-bound the responsibilities into smaller contracts:

lifecycle/restart/replace
session routing and channel identity
progress streaming/delivery
command handling
cached-agent lifecycle
platform adapter coordination

At minimum, add a "gateway lifecycle contract" doc and tests that pin the behavior around replace/restart, active agents, platform reconnect, and duplicate-process prevention.

2. `run_agent.py` is carrying too much of the core runtime

Evidence:

run_agent.py is about 13,601 lines.
It contains provider setup, prompt caching, tool execution loop, context compression, memory/session persistence, fallback logic, image handling, streaming, retry behavior, and platform/session metadata handling.

Why this is serious:

This makes it difficult to prove that a tool-loop change does not affect persistence, fallback, prompt cache stability, or session replay. It also increases the blast radius of small fixes and makes review harder for new contributors.

Recommended direction:

Extract stable boundaries around:

model transport/provider routing
prompt/system-context assembly
tool loop and tool-result persistence
session DB writes and replay
fallback/retry policy
compression and memory hooks

3. Restart continuity should be a first-class invariant, not a best-effort behavior

Evidence:

hermes_state.py has durable session APIs such as list_sessions_rich and get_messages_as_conversation.
tools/session_search_tool.py can retrieve past sessions.
Gateway and agent code maintain in-memory caches and active session state across a long-lived process.

Why this is serious:

A persistent agent that serves chat channels must survive gateway replace/restart without losing what it was just doing. If recent-session context is only available when the model chooses to call session_search, then cold-start recovery is probabilistic. This becomes especially painful when the agent restarts during long-running work.

Recommended direction:

Add a mandatory restart-recall acceptance test:

Create or update a session with recent messages.
Replace/restart the gateway.
Verify platform reconnect.
Verify no duplicate gateway process remains.
Verify the next turn receives a bounded "recent session recall" context from state.db.
Verify current user input remains authoritative over recalled context.

4. Tool-policy and loop-safety checks appear to be attached to partial paths

hermescheck found:

Loop detector does not observe all tool-call paths
Permission policy is not enforced on all dispatch paths

The scanner is heuristic, so the exact counts should be manually reviewed. But the architectural risk is real for a runtime with many execution modes: sequential tool calls, concurrent/delegated calls, scheduled jobs, gateway commands, CLI paths, MCP-like tools, and environment-specific tools can easily drift.

Why this is serious:

Recommended direction:

Create one tool-dispatch policy contract that records, for every tool invocation:

tool name
normalized arguments
session/channel identity
approval policy decision
dangerous-command decision
loop-budget decision
result/error status
whether the call was sequential, delegated, scheduled, or backgrounded

Then test representative calls from every execution path.

5. Self-restart/self-upgrade flows need explicit deferred semantics

Hermes has powerful terminal and process-control capabilities. That is useful, but any agent path that restarts or replaces its own gateway can interrupt the current turn, lose delivery, or persist an incomplete transcript.

Recommended direction:

Treat direct in-turn self-restart as an architecture smell. The safe policy should be:

Detect restart/kill/replace commands that target the active Hermes process or gateway.
Write a deferred restart marker.
Finish the current turn.
Persist the session.
Deliver the user-visible response.
Wait for active platform sessions to drain.
Restart/replace the gateway.
Verify reconnect and recent-session recall.

This should become a regression test, not just a convention.

6. Session search exists, but runtime-level recent recall is still needed

Evidence:

tools/session_search_tool.py exposes session search.
agent/prompt_builder.py tells the agent to use session_search when relevant.

Why this is serious:

Prompt guidance is weaker than runtime policy. If the agent is recovering from a restart, it may not know what it forgot. The runtime should provide a small, bounded recent-session packet automatically when the gateway has just restarted, when a channel maps to a session with no in-memory agent, or when cached agent state was evicted.

Recommended direction:

Use state.db as the source of truth and inject a bounded, clearly fenced recent-session context with:

source layer: recent transcript, not durable fact
session id / channel
timestamp
title/preview
max message count and max char budget
instruction that current user input wins on conflict

7. Memory provenance and freshness are not visible enough

hermescheck found Memory freshness / generation confusion detected across session, memory, plugin, README, and other memory-like surfaces.

Why this is serious:

Hermes has multiple memory concepts: durable user facts, procedural skills, session transcripts, summaries, plugin memory, context files, and search results. Without visible provenance, agents may treat stale transcript summaries like current facts or confuse a skill/procedure with user preference.

Recommended direction:

Every injected memory block should identify:

layer: fact / procedure / transcript / summary / context file
source path or session id
timestamp / freshness
confidence
whether it is allowed to override current user input

8. Startup and runtime surface sprawl make operations harder than they need to be

hermescheck reported:

Startup surface sprawl detected
Runtime surface sprawl detected

Why this is serious:

Hermes supports CLI, gateway, TUI, web, adapters, cron, MCP-like tools, containers, environments, and plugins. That breadth is a strength, but the canonical operational paths need to be very clear. Otherwise users and contributors debug the wrong runner, wrong service manager, or wrong state file.

Recommended direction:

Document and test two canonical paths:

development foreground path
production/background gateway path

Everything else should be explicitly marked optional, legacy, experimental, or adapter-specific.

9. Raw scanner criticals should be triaged, not ignored

The raw scan flagged unsafe execution patterns and hardcoded secret-like strings. Some are clearly tests, comments, fixtures, or intentionally powerful red-team/sandbox skills. Still, I would not discard the category completely.

Recommended direction:

Add scanner-aware annotations or a repo hygiene policy that separates:

test fixtures
comments/docs
intentionally sandboxed skill code
actual runtime execution paths

The goal is not to remove all exec/eval strings from a system that intentionally runs tools. The goal is to make it obvious which execution surfaces are policy-protected and which are only examples/tests.

10. Output mutation and rendering transformations need a clearer audit trail

The scan found many output transformation sites. Many are probably legitimate platform rendering behavior, but in a multi-channel agent runtime this matters.

Why this is serious:

If raw model output differs from user-visible output, incident analysis needs to know whether a platform adapter, renderer, sanitizer, media pipeline, or response wrapper changed it.

Recommended direction:

For high-risk channels and tool results, log:

raw assistant output
transformation stage
final delivered payload
platform/channel
truncation or media conversion decisions

Suggested fix order

Define and test the restart-continuity contract.
Add runtime-level recent-session recall from state.db for cold starts and channel remaps.
Move tool policy/loop observation to a common dispatch contract or prove equivalent coverage across dispatch paths.
Split or hard-bound gateway/run.py lifecycle/session/delivery responsibilities.
Add provenance-rich memory rendering.
Triage raw exec/eval/secret scanner hits into fixture/comment/intentional-runtime categories.
Document canonical startup/runtime paths.
Add output-transformation audit logs for multi-channel delivery.

Why I think this is important

Hermes already has many strong ingredients: persistent sessions, memory, skills, scheduler, multi-channel gateway, provider flexibility, tool execution, and session search. The risk is not that Hermes lacks capabilities. The risk is that the capability surface is now large enough that reliability depends on architectural contracts rather than local fixes.

In other words: the next quality jump is not "add one more feature"; it is making restart continuity, tool policy coverage, memory provenance, and gateway lifecycle behavior impossible to accidentally bypass.

extent analysis

TL;DR

The most likely fix for the issues reported by hermescheck is to address the architectural risks by defining and testing a restart-continuity contract, adding runtime-level recent-session recall, and moving tool policy/loop observation to a common dispatch contract.

Guidance

Define and test a restart-continuity contract to ensure that the gateway can survive restarts without losing session context.
Add runtime-level recent-session recall from state.db to provide a bounded recent-session context after cold starts or channel remaps.
Move tool policy/loop observation to a common dispatch contract to ensure that all tool invocations are recorded and protected by a unified policy.
Split or hard-bound gateway/run.py lifecycle, session, and delivery responsibilities to reduce the risk of incidents and make the code more maintainable.
Add provenance-rich memory rendering to ensure that memory blocks are properly identified and sourced.

Example

# Example of a restart-continuity contract test
def test_restart_continuity():
    # Create a session with recent messages
    session = create_session()
    # Replace/restart the gateway
    restart_gateway()
    # Verify platform reconnect
    assert platform_reconnect()
    # Verify no duplicate gateway process remains
    assert not duplicate_gateway_process()
    # Verify the next turn receives a bounded "recent session recall" context from `state.db`
    assert recent_session_recall(session)

Notes

The suggested fix order provided in the issue body should be followed to address the architectural risks. The fixes should be implemented and tested incrementally to ensure that each change does not introduce new issues.

Recommendation

Apply the suggested fixes in the order provided, starting with defining and testing the restart-continuity contract, to address the architectural risks and improve the reliability of the Hermes agent.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - 💡(How to fix) Fix [Important] Architecture quality audit: restart continuity, gateway sprawl, and tool-policy coverage [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Architecture self-review used for this scan

Major issues worth treating as high priority

1. Gateway orchestration has become a high-risk monolith

2. run_agent.py is carrying too much of the core runtime

3. Restart continuity should be a first-class invariant, not a best-effort behavior

4. Tool-policy and loop-safety checks appear to be attached to partial paths

5. Self-restart/self-upgrade flows need explicit deferred semantics

6. Session search exists, but runtime-level recent recall is still needed

7. Memory provenance and freshness are not visible enough

8. Startup and runtime surface sprawl make operations harder than they need to be

9. Raw scanner criticals should be triaged, not ignored

10. Output mutation and rendering transformations need a clearer audit trail

Suggested fix order

Why I think this is important

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

2. `run_agent.py` is carrying too much of the core runtime