hermes - 💡(How to fix) Fix [Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13631Fetched 2026-04-22 08:05:11
View on GitHub
Comments
3
Participants
2
Timeline
6
Reactions
0
Timeline (top)
commented ×3labeled ×2cross-referenced ×1

Root Cause

`agent/prompt_builder.py` emits the Honcho static block inside `_build_system_prompt()`, whose result is cached on `self._cached_system_prompt`. The Honcho memory provider's `prefetch_all()` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.

The correct injection site already exists: `_ext_prefetch_cache` in `run_agent.py` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.


Fix Action

Fix / Workaround

  • #5719 — Honcho context block leaking into vision auto-analysis output
  • #5102 — Thread-safety race in `HonchoSessionManager._cache`
  • #4751 — Sequential chat dispatch / session-key / peer-card issues
  • #2150 — Feature request for cached startup context in `recallMode=tools`
RAW_BUFFERClick to expand / collapse

Bug Description

AGENTS.md states a hard policy on prompt caching:

Prompt Caching Must Not Break

Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would:

  • Alter past context mid-conversation
  • Change toolsets mid-conversation
  • Reload memories or rebuild system prompts mid-conversation

Cache-breaking forces dramatically higher costs. The ONLY time we alter context is during context compression.

The Honcho memory provider violates this directly.

Per the Prompt Assembly architecture doc, Layer 3 of the cached system prompt is the "Honcho static block (when active)". Per the Honcho Memory integration doc, that same block is:

  • Base context — session summary, user representation, user peer card, AI self-representation, AI identity card. "Refreshed on contextCadence."
  • Dialectic supplement — LLM-synthesized reasoning about the user's current state. "Refreshed on dialecticCadence."

Every refresh rewrites Layer 3 of the cached prompt mid-conversation, invalidating the entire KV cache on every backend that does prefix caching:

  • Local: llama.cpp, vLLM --enable-prefix-caching, MLX with PagedAttention
  • Cloud: Anthropic / OpenAI / Bedrock with explicit cache control

Causal attention means mid-prefix mutations invalidate everything after the first divergence — every token downstream of the Honcho block must be reprocessed from scratch. On a ~20K-token session that is a multi-minute reprocess on every cadence tick.

The same Prompt Assembly doc already classifies "later-turn Honcho recall injected into the current-turn user message" as an API-call-time-only layer (explicitly not persisted to the cached system prompt) — exactly the correct pattern. The auto-injected base / dialectic block just hasn't been moved over.


Steps to Reproduce

  1. Configure Hermes with recallMode: \"hybrid\" (or \"context\") and any finite contextCadence / dialecticCadence.
  2. Point Hermes at a backend with prefix caching enabled (e.g. local vllm --enable-prefix-caching, llama.cpp, MLX via vLLM-MLX).
  3. Send the first user message — first-turn full prefill is expected.
  4. Send N−1 more messages where N = min(contextCadence, dialecticCadence). Prefix cache hits as expected.
  5. Send one more message. Observe: full prefill from scratch, not just the new tail tokens.

Cross-check: run the same backend + model under a client that does not mutate the system prompt mid-session (e.g. opencode). Prefix cache hits on every turn, confirming the backend is not at fault.


Expected Behavior

Per the stated policy: the cached system prompt is byte-identical for the duration of a session (compression excepted).

Auto-injected Honcho context rides on the current user message, same path as the existing `_ext_prefetch_cache` injection (`run_agent.py` ~L6737–6740), and is therefore ephemeral and cache-safe.


Actual Behavior

Every `contextCadence` / `dialecticCadence` tick rewrites Layer 3 of the cached system prompt.

Observed on Gemma 4 31B 8-bit (MLX) served via a `vllm-mlx` fork with PagedAttention prefix caching:

`contextCadence`Cache-miss ratePer-turn latency on miss
`1`100 % of turns~5 min (≈20K tokens full prefill)
`4`~25 % of turns~5 min on every 4th turn

With `contextCadence: 1` every user turn is a full prefill. Bumping to 4 reduces the rate to one in four, but the underlying design is still wrong — on any multi-turn session, some fraction of turns will always incur a full prefill regardless of how it is tuned.


Root Cause

`agent/prompt_builder.py` emits the Honcho static block inside `_build_system_prompt()`, whose result is cached on `self._cached_system_prompt`. The Honcho memory provider's `prefetch_all()` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.

The correct injection site already exists: `_ext_prefetch_cache` in `run_agent.py` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.


Proposed Fix

  1. Remove the Honcho static block from Layer 3 of `_build_system_prompt()` in `agent/prompt_builder.py`.
  2. Extend `_ext_prefetch_cache` (or an equivalent per-turn slot) in `run_agent.py` to carry the formatted `## Honcho Context` block for the current turn only.
  3. Document and enforce the invariant: the cached system prompt is byte-stable for the session; any Honcho content that varies turn-to-turn must ride on the current user message.

This is the same architectural move already proposed in #3353 for runtime metadata, and enforces the policy stated in `AGENTS.md`. It subsumes the `contextCadence` / `dialecticCadence` knobs under the same correctness guarantee instead of requiring users to manually trade off memory freshness against prefill cost.


Environment

ComponentValue
Hermes Agentv0.10.0 (2026.4.16) — 34 commits behind `main`
Python3.11.15
OpenAI SDK2.31.0
OSmacOS 26.4.1 (Darwin 25.4.0, `arm64 T6020` / M2 Pro)
Client host`sparta.local`
Inference backendSeparate M2 Ultra (128 GB) running a custom `mlx-hosting` glue in front of a `vllm-mlx` fork with PagedAttention prefix caching; model Gemma 4 31B 8-bit MLX
HonchoSelf-hosted, `baseUrl: http://127.0.0.1:8000\`

Relevant Honcho config (`~/.hermes/honcho.json`):

```json { "recallMode": "hybrid", "contextCadence": 4, "dialecticCadence": 4, "dialecticDepth": 1, "dialecticReasoningLevel": "medium", "observationMode": "directional", "writeFrequency": "async", "sessionStrategy": "per-directory" } ```


Related Issues

Same architectural class — mid-session mutation of cached prompt state:

  • #3353 — `perf(ttft): move runtime metadata out of the cached system prompt` — same principle, different varying field; this bug is a much larger, more frequently varying instance of the same anti-pattern.
  • #8687 — `perf: system prompt timestamp changes after compression, invalidating prefix cache on local LLM backends` — same class, timestamp specifically.
  • #4319 — `[Performance] KV cache invalidation on compression hurts local MoE models` — same class, compression trigger.
  • #4555 — `[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload` — byte-stability principle for messages on reload.
  • #13442 — `[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B` — downstream symptom of the same invariant being violated.

Adjacent Honcho issues (for completeness, not duplicates of this bug):

  • #5719 — Honcho context block leaking into vision auto-analysis output
  • #5102 — Thread-safety race in `HonchoSessionManager._cache`
  • #4751 — Sequential chat dispatch / session-key / peer-card issues
  • #2150 — Feature request for cached startup context in `recallMode=tools`

extent analysis

TL;DR

Move the Honcho static block from the cached system prompt to a per-turn injection, ensuring the cached system prompt remains byte-stable throughout a session.

Guidance

  1. Remove the Honcho static block from Layer 3 of _build_system_prompt() in agent/prompt_builder.py to prevent mid-conversation cache invalidation.
  2. Extend _ext_prefetch_cache in run_agent.py to carry the formatted ## Honcho Context block for the current turn only, keeping it out of the cached system prompt.
  3. Verify the fix by checking that the cached system prompt does not change between turns, using tools like debugging logs or cache inspection, and confirm that prefix cache hits are consistent.
  4. Test with varying contextCadence and dialecticCadence to ensure the fix works across different refresh rates and does not introduce new issues.

Example

No code snippet is provided as the issue description already outlines the necessary changes, focusing on moving the Honcho context block to a per-turn injection.

Notes

This fix addresses a specific instance of a broader architectural issue related to mid-session mutations of cached prompt state, as seen in related issues like #3353. Ensuring the cached system prompt's byte-stability is crucial for maintaining performance, especially with prefix caching enabled.

Recommendation

Apply the proposed workaround by moving the Honcho static block to a per-turn injection, as this directly addresses the root cause of the cache invalidation issue and aligns with the stated policy in AGENTS.md.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend [3 comments, 2 participants]