hermes - ✅(Solved) Fix [Bug]: Background self-improvement review's fresh AIAgent generates a system prompt that bytes-differs from the parent, busting prompt cache + preprocessing tree-dedup [1 pull requests, 1 comments, 2 participants]

hermes2026-05-14 00:47:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#25322•Fetched 2026-05-14 03:47:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

simpolism

Participants

alt-glitch

simpolism

Timeline (top)

labeled ×4commented ×1cross-referenced ×1

Root Cause

The narrow-toolset design of _spawn_background_review is intentional — it traces to the safety motivation in #15204 (don't let the review accidentally invoke terminal/send_message/delegate_task side-effects). But narrow-toolset = different stable-tier system prompt = no cache hit. The cache-bypass is a side effect of the safety mechanism, not an independent bug.

Fix Action

Fix / Workaround

Construct two AIAgent instances simulating the parent → background-review spawn. Patch hermes_time.now so we don't have to actually wait 60s for the timestamp to tick:

Patch the clock BEFORE importing run_agent so the first call inside

_build_system_prompt_parts sees the patched version.

_fake_now = [dt.datetime(2026, 5, 14, 14, 30, 15)] hermes_time.now = lambda: _fake_now[0]

The tree-dedup fragmentation (problem 2 in the bug description) is real for users who post-process their Hermes traces for analytics or training. It's separate from the cache problem and is fixed by any solution that keeps the system bytes static across the parent → review boundary — even Option C might help here if it gets the prefix far enough into the system string for the dedup chain to extend.

PR fix notes

PR #25330: fix: reuse parent system prompt for background review to preserve cache bytes

Repository: NousResearch/hermes-agent
Author: NeroNarada
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/25330

Description (problem / solution / changelog)

Summary

Background self-improvement review currently builds a new AIAgent with a freshly generated system prompt, which can differ from the parent agent and break prefix cache / preprocessing tree-dedup behavior. This change reuses the parent’s already-built system prompt when spawning the review agent, while keeping the review toolset restricted to memory/skills.

Changes

Add optional cached_system_prompt: Optional[str] = None to AIAgent.__init__.
Seed self._cached_system_prompt from that value during initialization.
In _spawn_background_review, pass:
- session_id=self.session_id
- pass_session_id=self.pass_session_id
- cached_system_prompt (from parent cache, or built lazily if absent)
Keep enabled_toolsets=["memory", "skills"] unchanged.
Add regression coverage in background review tests:
- review fork receives inherited system prompt and session settings,
- fallback to a parent-built prompt when cache is absent.

Motivation

This preserves byte-identical prompt bytes across the parent/review boundary without relaxing the tool-restriction safety model from #15204.

Issue

Fixes #25322

Changed files

agent/auxiliary_client.py (modified, +37/-2)
cron/jobs.py (modified, +58/-12)
cron/scheduler.py (modified, +2/-2)
run_agent.py (modified, +7/-1)
tests/agent/test_auxiliary_client.py (modified, +35/-3)
tests/cron/test_cron_runtime_profile_paths.py (added, +38/-0)
tests/run_agent/test_background_review.py (modified, +36/-0)
tests/run_agent/test_background_review_toolset_restriction.py (modified, +8/-0)

Code Example

import sys, hashlib, difflib
import datetime as dt
import hermes_time

# Patch the clock BEFORE importing run_agent so the first call inside
# _build_system_prompt_parts sees the patched version.
_fake_now = [dt.datetime(2026, 5, 14, 14, 30, 15)]
hermes_time.now = lambda: _fake_now[0]

from run_agent import AIAgent

# Parent — typical CLI session
parent = AIAgent(
    model="anthropic/claude-opus-4.6",
    platform="cli",
    enabled_toolsets=["memory", "skills", "terminal", "file", "web"],
    quiet_mode=True, skip_context_files=True, skip_memory=True,
    api_key="fake-for-test", base_url="http://localhost",
    pass_session_id=True,
)
parent_sys = parent._build_system_prompt(system_message=None)

# Advance the simulated clock 90s so the printed minute changes
_fake_now[0] = dt.datetime(2026, 5, 14, 14, 31, 45)

# Review — same constructor pattern as _spawn_background_review on current main
review = AIAgent(
    model=parent.model, platform=parent.platform, provider=parent.provider,
    enabled_toolsets=["memory", "skills"],
    parent_session_id=parent.session_id,
    quiet_mode=True, skip_context_files=True, skip_memory=True,
    api_key="fake-for-test", base_url="http://localhost",
    pass_session_id=True,
)
review_sys = review._build_system_prompt(system_message=None)

print(f"parent sha256: {hashlib.sha256(parent_sys.encode()).hexdigest()[:16]}")
print(f"review sha256: {hashlib.sha256(review_sys.encode()).hexdigest()[:16]}")
print(f"bytes equal:   {parent_sys == review_sys}")
print()
for line in difflib.unified_diff(parent_sys.splitlines(), review_sys.splitlines(),
                                  lineterm="", n=1):
    print(line)

---

parent sha256: fbaa5dd90fafad27
review sha256: 59a080afc0518dca
bytes equal:   False

---
+++
@@ -153,3 +153,2 @@
     - linear: Linear: manage issues, projects, teams via GraphQL + curl.
-    - maps: Geocode, POIs, routes, timezones via OpenStreetMap/OSRM.
     - nano-pdf: Edit PDF text/typos/titles via nano-pdf CLI (NL prompts).
@@ -201,4 +200,4 @@

-Conversation started: Thursday, May 14, 2026 02:30 PM
-Session ID: 20260513_204511_d02a5d
+Conversation started: Thursday, May 14, 2026 02:31 PM
+Session ID: 20260513_204511_9e319e
 Model: anthropic/claude-opus-4.6

RAW_BUFFERClick to expand / collapse

Bug Description

Background self-improvement review (_spawn_background_review in run_agent.py) creates a fresh AIAgent for the review pass. That fresh agent regenerates its system prompt from scratch, which produces a system message whose bytes differ from the parent's at several points — even when the user has only one logical conversation going.

That bytes-differ triggers two distinct downstream problems:

Anthropic prompt cache miss. The parent's system bytes are cached at the model edge after the parent's first turn. The review pass's HTTP request to the same model carries a different system string, so the cache lookup fails. Every skill/memory nudge that triggers a review pays the full uncached system-prompt cost (large, since Hermes' system prompt is multi-kB after SOUL/skills/guidance blocks).
Preprocessing tree-dedup fork. Hermes Agent's preprocessing pipeline (LiteLLM-side, on the training-data path I work with) dedupes conversations by walking a hash chain of canonical messages. A system-prompt fork at position 0 produces two separate "conversation leaves" for what is logically one user session — the review pass becomes its own leaf instead of collapsing into the parent's, fragmenting the training-data signal.

Recent PR #24778 (fix(cache): kill long-lived prefix layout — system prompt is now byte-static within a session) fixed the intra-session cache-bypass mode (the long-lived layout's volatile tier ticking per turn). The bug filed here is adjacent but distinct — it's the cross-AIAgent-instance divergence, not intra-instance mutation. Both bugs prevent cached_tokens from rising; this one survives #24778.

Three causes of the bytes-difference (all on current `main` after #24778)

In _build_system_prompt_parts (run_agent.py:5810–6020):

Conversation started: timestamp line. Built fresh per-instance from _hermes_now() at the moment the system prompt is first assembled. Parent assembled this ~N minutes ago; the review's fresh AIAgent assembles it now. Bytes differ unless the two calls happen to fall in the same printed minute (the format is "%A, %B %d, %Y %I:%M %p").
Session ID: line. self.session_id is freshly generated in __init__ (f"{timestamp_str}_{short_uuid}", run_agent.py:1894). Parent and review-fork have different session_ids unless explicitly passed.
Skills_prompt available-skills list. _build_system_prompt_parts calls build_skills_system_prompt(...) which enumerates which skills are available to load, and that availability is gated by which toolsets are enabled (e.g. a skill that needs the terminal toolset doesn't appear when enabled_toolsets=["memory","skills"]). So the parent's skills_prompt lists more skills than the review's. The tool-aware guidance blocks (MEMORY_GUIDANCE, SESSION_SEARCH_GUIDANCE, SKILLS_GUIDANCE, etc. in run_agent.py:5849–5866) are gated on self.valid_tool_names, so they can also differ if the review's narrower toolset drops one — but in practice memory + skills already pull in their respective guidance blocks, so this contributes less than the skills-list divergence does.

(The model: / provider: lines below the timestamp do match in practice because the review inherits parent's model + provider, so they don't contribute to the divergence.)

Steps to Reproduce

Construct two AIAgent instances simulating the parent → background-review spawn. Patch hermes_time.now so we don't have to actually wait 60s for the timestamp to tick:

import sys, hashlib, difflib
import datetime as dt
import hermes_time

# Patch the clock BEFORE importing run_agent so the first call inside
# _build_system_prompt_parts sees the patched version.
_fake_now = [dt.datetime(2026, 5, 14, 14, 30, 15)]
hermes_time.now = lambda: _fake_now[0]

from run_agent import AIAgent

# Parent — typical CLI session
parent = AIAgent(
    model="anthropic/claude-opus-4.6",
    platform="cli",
    enabled_toolsets=["memory", "skills", "terminal", "file", "web"],
    quiet_mode=True, skip_context_files=True, skip_memory=True,
    api_key="fake-for-test", base_url="http://localhost",
    pass_session_id=True,
)
parent_sys = parent._build_system_prompt(system_message=None)

# Advance the simulated clock 90s so the printed minute changes
_fake_now[0] = dt.datetime(2026, 5, 14, 14, 31, 45)

# Review — same constructor pattern as _spawn_background_review on current main
review = AIAgent(
    model=parent.model, platform=parent.platform, provider=parent.provider,
    enabled_toolsets=["memory", "skills"],
    parent_session_id=parent.session_id,
    quiet_mode=True, skip_context_files=True, skip_memory=True,
    api_key="fake-for-test", base_url="http://localhost",
    pass_session_id=True,
)
review_sys = review._build_system_prompt(system_message=None)

print(f"parent sha256: {hashlib.sha256(parent_sys.encode()).hexdigest()[:16]}")
print(f"review sha256: {hashlib.sha256(review_sys.encode()).hexdigest()[:16]}")
print(f"bytes equal:   {parent_sys == review_sys}")
print()
for line in difflib.unified_diff(parent_sys.splitlines(), review_sys.splitlines(),
                                  lineterm="", n=1):
    print(line)

Actual output (verified on current main at e2b2d4861):

parent sha256: fbaa5dd90fafad27
review sha256: 59a080afc0518dca
bytes equal:   False

---
+++
@@ -153,3 +153,2 @@
     - linear: Linear: manage issues, projects, teams via GraphQL + curl.
-    - maps: Geocode, POIs, routes, timezones via OpenStreetMap/OSRM.
     - nano-pdf: Edit PDF text/typos/titles via nano-pdf CLI (NL prompts).
@@ -201,4 +200,4 @@

-Conversation started: Thursday, May 14, 2026 02:30 PM
-Session ID: 20260513_204511_d02a5d
+Conversation started: Thursday, May 14, 2026 02:31 PM
+Session ID: 20260513_204511_9e319e
 Model: anthropic/claude-opus-4.6

That diff shows all three causes: the maps skill (and other terminal/web-gated skills) is in parent's skills_prompt but not review's; timestamp differs by one minute; session_id is different. The cache prefix-match fails at the FIRST bytes-difference, which is the maps line in this run — but in practice any of the three is enough to bust the cache.

Expected Behavior

When the background review fires for an active session, its outbound HTTP request to the model carries the same system bytes as the parent's outbound requests. The Anthropic cache (and OpenRouter / Nous Portal equivalents) sees the cached prefix and returns cached_tokens > 0. The preprocessing tree-dedup collapses the review into the parent's leaf rather than forking at position 0.

Actual Behavior

cached_tokens == 0 on every background-review HTTP call (the byte-mismatch above means the cache prefix can never extend into the review's request — confirmed by the diff, since Anthropic's caching is by exact-prefix-match). On the training-data side, review-spawned LiteLLM calls show up as separate is_leaf conversations rather than as extensions of the parent's prefix chain, because the preprocessing tree-dedup walks the same hash chain that the cache lookup uses.

Affected Component

Core agent (run_agent.py, prompt builder)

Operating System

Ubuntu 24.04

Hermes Version

Reproduced against current main (commit at time of filing: e2b2d4861 fix(cli): preserve startup banner on terminal resize, which already includes #24778's cache refactor).

Root Cause Analysis

Possible Fixes (three options, each with a different trade-off)

I have a local fix that ships within my fork — happy to upstream whichever option the maintainers prefer, or pass on writing the PR if maintainers want to design this themselves given the safety implications. The options:

Option A: Inherit _cached_system_prompt from parent verbatim; keep enabled_toolsets=["memory","skills"] on the review.

The review's outbound system string is bit-identical to parent's. Cache hits, tree-dedup collapses. Safety is preserved mechanically: the review agent's actual tool_schemas (sent in the API call's tools field) still come from its narrow toolset, so the model literally cannot call terminal/send_message/etc. The system prompt would "describe" tools the review can't call, which is mildly leaky but no worse than describing tools the model chose not to use. Pair with an explicit prompt-layer instruction ("Only use memory/skill tools") for belt-and-suspenders.

Lowest risk. Question: does anything elsewhere in the codebase assume _cached_system_prompt content matches tool_schemas exactly? I haven't audited that fully.

Option B: Move the safety boundary entirely to the prompt layer.

Spawn with enabled_toolsets=self.enabled_toolsets (inherit parent's full set), inherit parent's _cached_system_prompt, and rely on a hardcoded prompt-layer instruction to keep the review from invoking dangerous tools. This is what my local fork does. Maximum cache hit. Requires trusting prompt instructions for safety, which is a real relaxation of the model in #15204. Upstream test test_background_review_agent_uses_restricted_toolsets would need to be inverted.

Higher risk on safety, cleanest on cache.

Option C: Inherit only session_start + session_id; rebuild the rest based on review's narrow toolset.

Eliminates the easy bytes-difference sources (timestamp + UUID) without touching the safety story at all. Tool-guidance + skills_prompt divergence remains, so the cache hit is partial — the volatile tier and most of the stable tier match, but the tool-guidance block in the middle still differs, which I believe is enough to bust the cache prefix anyway (Anthropic's cache is by-prefix, not by-overall-similarity).

Smallest change, smallest safety implication, but I think it doesn't actually solve the cache-hit problem unless I'm wrong about prefix semantics.

Note on training-data signal

Proposed Fix

(See "Possible Fixes" above — three options, trade-offs differ on safety vs cache-hit.) Filing this as an issue rather than going straight to a PR because the safety question merits maintainer judgment; happy to write whichever option maintainers prefer.

Contribution

I'd like to fix this myself and submit a PR

(once we've agreed on the right option)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.