hermes - 💡(How to fix) Fix [Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling [1 pull requests]

Root Cause

Server logs confirm this is caused by the framework mutating or shuffling the order of tool definitions in the system instructions when assembling the history payload for a subsequent user turn. For position-dependent architectures and models with recurrent/hybrid memory (like Qwen 3.6), this minor structural change completely invalidates the historical KV cache and deletes all downstream context checkpoints.

Code Example

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

---

Debug report uploaded:
  Report       https://paste.rs/jSNNa
  agent.log    https://paste.rs/btUg4
  gateway.log  https://paste.rs/lpGMJ

---

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

Bug Description

When running the agent framework with an OpenAI-compatible backend server (specifically tested using llama-server / ik_llama.cpp), the context/prompt cache functions properly during internal agent tool loops. However, the exact moment the agent loop completes and the user submits a new follow-up prompt, the framework forces a complete prompt re-processing loop from scratch.

Steps to Reproduce:

Start a session using a model requiring strict context sequence adherence (e.g., Qwen 3.6).

Issue a request that requires the agent to call multiple tools.

Observe that the internal tool loops execute quickly with functional caching.

Once control returns to the user, type a new follow-up prompt and send it.

Check the server backend logs. Note the prompt similarity drop and the subsequent mass erasure of context checkpoints.

Expected Behavior

The framework should pass a byte-identical, sequentially static history payload to the inference server across turns. Tool definitions inside the system prompt array should remain locked in a static, predictable order to preserve the backend's prompt cache.

Actual Behavior / Log Evidence

The server detects a text mismatch early in the prompt history, calculating a critical drop in prompt similarity (sim = 0.159 vs default threshold 0.50). This forces the server to evict tens of thousands of cached tokens and context checkpoints, inducing a heavy time-to-first-token (TTFT) delay.

Relevant Server Log Snippet:


======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

Technical Analysis & Impact

The Root Cause: The log explicitly displays a payload divergence at token index ~6,866. In the cached session state, the "memory" tool was declared. When rebuilding the payload for the follow-up prompt, the framework dynamically pushed the "feishu_doc_read" tool into that position instead.

The Fallout: While the server successfully rolls back to a safe checkpoint at token 6,144, it is forced to clear out all subsequent checkpoints up to token 36,628. The system is then penalized with a full re-prefill of over 37,000 tokens entirely due to a structural mismatch in the tool definitions array.

Environment Context

Backend Engine: ik_llama.cpp / llama.cpp server (GGUF deployment)

Model: Qwen 3.6 (27B MoE / Dense variants leveraging hybrid/recurrent SWA memory)

Steps to Reproduce

Start a session using a model requiring strict context sequence adherence (e.g., Qwen 3.6).

Issue a request that requires the agent to call multiple tools.

Observe that the internal tool loops execute quickly with functional caching.

Once control returns to the user, type a new follow-up prompt and send it.

Check the server backend logs. Note the prompt similarity drop and the subsequent mass erasure of context checkpoints.

Expected Behavior

Actual Behavior

Affected Component

CLI (interactive chat)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Debug report uploaded:
  Report       https://paste.rs/jSNNa
  agent.log    https://paste.rs/btUg4
  gateway.log  https://paste.rs/lpGMJ

Operating System

Ubunty 24.04

Python Version

No response

Hermes Version

No response

Additional Logs / Traceback (optional)

======== Prompt cache: cache size: 43396, n_keep: 0 ... cache_ram_similarity: 0.50
- looking for better prompt, base f_keep = 0.164, sim = 0.159, n_keep = 0
- prompt 0x72ee2024d700:   43396 tokens, checkpoints: 43

Common part does not match fully
cache :  {"name": "memory", "description": "Save durable information to persistent memory...
prompt:  {"name": "feishu_doc_read", "description": "Read the full content of a Feishu/Lark...

slot apply_checkp: id  0 | restored context checkpoint took  24.69 ms (pos_max = 6143)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 8191...)
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 10239...)
...
slot apply_checkp: id  0 | erased invalidated context checkpoint (pos_min = 36628...)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Fix Action

Fixed

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Fix Action

Fixed

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

RELATED_DISCOVERY

TRENDING