hermes - 💡(How to fix) Fix [Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Additional Logs / Traceback (optional)

Root Cause

Server logs confirm this is caused by the framework mutating or shuffling the order of tool definitions in the system instructions when assembling the history payload for a subsequent user turn. For position-dependent architectures and models with recurrent/hybrid memory (like Qwen 3.6), this minor structural change completely invalidates the historical KV cache and deletes all downstream context checkpoints.

Fix Action

Fixed

Code Example

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

---

Debug report uploaded:
  Report       https://paste.rs/jSNNa
  agent.log    https://paste.rs/btUg4
  gateway.log  https://paste.rs/lpGMJ

---

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)
RAW_BUFFERClick to expand / collapse

Bug Description

When running the agent framework with an OpenAI-compatible backend server (specifically tested using llama-server / ik_llama.cpp), the context/prompt cache functions properly during internal agent tool loops. However, the exact moment the agent loop completes and the user submits a new follow-up prompt, the framework forces a complete prompt re-processing loop from scratch.

Server logs confirm this is caused by the framework mutating or shuffling the order of tool definitions in the system instructions when assembling the history payload for a subsequent user turn. For position-dependent architectures and models with recurrent/hybrid memory (like Qwen 3.6), this minor structural change completely invalidates the historical KV cache and deletes all downstream context checkpoints.

Steps to Reproduce:

Start a session using a model requiring strict context sequence adherence (e.g., Qwen 3.6).

Issue a request that requires the agent to call multiple tools.

Observe that the internal tool loops execute quickly with functional caching.

Once control returns to the user, type a new follow-up prompt and send it.

Check the server backend logs. Note the prompt similarity drop and the subsequent mass erasure of context checkpoints.

Expected Behavior

The framework should pass a byte-identical, sequentially static history payload to the inference server across turns. Tool definitions inside the system prompt array should remain locked in a static, predictable order to preserve the backend's prompt cache.

Actual Behavior / Log Evidence

The server detects a text mismatch early in the prompt history, calculating a critical drop in prompt similarity (sim = 0.159 vs default threshold 0.50). This forces the server to evict tens of thousands of cached tokens and context checkpoints, inducing a heavy time-to-first-token (TTFT) delay.

Relevant Server Log Snippet:


======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

Technical Analysis & Impact

The Root Cause: The log explicitly displays a payload divergence at token index ~6,866. In the cached session state, the "memory" tool was declared. When rebuilding the payload for the follow-up prompt, the framework dynamically pushed the "feishu_doc_read" tool into that position instead.

The Fallout: While the server successfully rolls back to a safe checkpoint at token 6,144, it is forced to clear out all subsequent checkpoints up to token 36,628. The system is then penalized with a full re-prefill of over 37,000 tokens entirely due to a structural mismatch in the tool definitions array.

Environment Context

Backend Engine: ik_llama.cpp / llama.cpp server (GGUF deployment)

Model: Qwen 3.6 (27B MoE / Dense variants leveraging hybrid/recurrent SWA memory)

Steps to Reproduce

Start a session using a model requiring strict context sequence adherence (e.g., Qwen 3.6).

Issue a request that requires the agent to call multiple tools.

Observe that the internal tool loops execute quickly with functional caching.

Once control returns to the user, type a new follow-up prompt and send it.

Check the server backend logs. Note the prompt similarity drop and the subsequent mass erasure of context checkpoints.

Expected Behavior

The framework should pass a byte-identical, sequentially static history payload to the inference server across turns. Tool definitions inside the system prompt array should remain locked in a static, predictable order to preserve the backend's prompt cache.

Actual Behavior

The server detects a text mismatch early in the prompt history, calculating a critical drop in prompt similarity (sim = 0.159 vs default threshold 0.50). This forces the server to evict tens of thousands of cached tokens and context checkpoints, inducing a heavy time-to-first-token (TTFT) delay.

Affected Component

CLI (interactive chat)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Debug report uploaded:
  Report       https://paste.rs/jSNNa
  agent.log    https://paste.rs/btUg4
  gateway.log  https://paste.rs/lpGMJ

Operating System

Ubunty 24.04

Python Version

No response

Hermes Version

No response

Additional Logs / Traceback (optional)

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

Root Cause Analysis (optional)

The Root Cause: The log explicitly displays a payload divergence at token index ~6,866. In the cached session state, the "memory" tool was declared. When rebuilding the payload for the follow-up prompt, the framework dynamically pushed the "feishu_doc_read" tool into that position instead.

The Fallout: While the server successfully rolls back to a safe checkpoint at token 6,144, it is forced to clear out all subsequent checkpoints up to token 36,628. The system is then penalized with a full re-prefill of over 37,000 tokens entirely due to a structural mismatch in the tool definitions array.

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling [1 pull requests]