hermes - 💡(How to fix) Fix [Feature]: Reuse KV cache during compression

hermes2026-05-21 13:50:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix / Workaround

A slice of middle turns is selected.
Those turns are serialized into plain text and wrapped into a single {"role": "user", "content": prompt} message.
That message is sent as a completely independent LLM call, with no connection to the main conversation.
The returned summary text is then patched back into the message list, with role-alternation logic deciding where to insert it.
The main model's next request starts fresh with the modified list, having lost any cache context from before the compression.

RAW_BUFFERClick to expand / collapse

Problem or Use Case

Suggestion

Instead of making compression a separate, context-free request, append the summarization prompt as a user message directly in the ongoing conversation. Let the main model produce the summary as its next reply, using all the existing context (system prompt, tool schemas, previous turns) that's already in the window.

Something like this: when compression is needed, inject a message like:

"The conversation has grown long. Please produce a compact handoff summary of all messages before this one, using the structured template below. Then continue with the user's most recent request."

The model replies with the summary text. That summary replaces the earlier turns, and the conversation continues from where it left off.

Why this might be better

KV cache fully reused. The existing conversation context is already cached on the server side. Appending a new message extends the cache incrementally, bypassing the expensive prefix computation. On providers like DeepSeek, cached tokens cost about 1/10 the price of non-cached ones, sometimes as low as 1/100. The current separate-request approach throws that away entirely — the summarizer call pays full price for every token, even though the content it summarizes is right there in the already-cached conversation.

Cheaper. Compression today effectively doubles the inference cost of the turn that triggers it (main request + a separate summarizer request, both at full price). Inline summarization adds only the incremental cache extension cost.
Faster. No separate round trip to the model provider. The summarization happens as part of the normal response flow.

How it works today (for context)

Currently, when compression fires:

A slice of middle turns is selected.
Those turns are serialized into plain text and wrapped into a single {"role": "user", "content": prompt} message.
That message is sent as a completely independent LLM call, with no connection to the main conversation.
The returned summary text is then patched back into the message list, with role-alternation logic deciding where to insert it.
The main model's next request starts fresh with the modified list, having lost any cache context from before the compression.

The two paths (summarizer request and main request) don't share any state beyond the plain text that was copied out and back in.

Proposed Solution

As said above, KV cache is fully reused.

Alternatives Considered

No response

Feature Type

Performance / reliability

Scope

None

Contribution

I'd like to implement this myself and submit a PR

Debug Report (optional)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering