hermes - 💡(How to fix) Fix Proposal: progressive tool-result compression to reduce token waste in long conversations [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14948Fetched 2026-04-24 10:44:06
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
labeled ×3

Code Example

Before each API call:
  1. Identify all role="tool" messages in api_messages
  2. Keep the last N tool results intact (recent context)
  3. For older tool results, replace content with a compact summary:
     [read_file] OK (12847 chars) | import os → from pathlib import Path

---

agent:
  compression:
    progressive:
      enabled: true           # defaults to compression.enabled
      recent_tool_keep: 20    # defaults to compression.protect_last_n
      min_messages: 16        # only activate in long conversations
      max_compressed_len: 300 # skip results shorter than this
RAW_BUFFERClick to expand / collapse

Problem

In long conversations (40+ turns), old tool results — file contents, command outputs, search results — consume thousands of tokens that the model no longer needs verbatim. The model only needs to remember: WHAT tool was called, and the OUTCOME (success/failure + key result).

Current behavior

The existing ContextCompressor (threshold-triggered LLM summarization) handles this, but only after a 413/overflow event — it's a reactive, heavyweight mechanism that permanently mutates self.messages.

Before that trigger point, every API call sends the full verbatim content of all historical tool results. In a 50-turn session with 30 tool calls, this can easily be 50K+ tokens of stale tool output that the model has already acted upon.

Why it matters

  1. Token waste — Each API call pays for tokens the model doesn't need. In a coding session with many read_file and terminal calls, old outputs are pure waste after the model has moved on.
  2. Earlier 413 triggers — Stale tool results push the context toward the threshold faster, causing more frequent full compression events (which are expensive, irreversible, and disruptive).
  3. Degraded reasoning — More tokens in context = more noise for the model to sift through. Compact summaries of old results can actually improve focus.

Proposed solution: Progressive tool-result compression

An ephemeral, per-API-call optimization that compresses old tool results to one-line summaries before each LLM call — complementing (not replacing) the existing ContextCompressor.

How it works

Before each API call:
  1. Identify all role="tool" messages in api_messages
  2. Keep the last N tool results intact (recent context)
  3. For older tool results, replace content with a compact summary:
     [read_file] OK (12847 chars) | import os → from pathlib import Path

Key design decisions

DecisionRationale
Operates on api_messages copy onlyself.messages is never touched — fully reversible
Regex-based summary, not LLMZero latency, zero cost per call
Respects compression.enabledUsers who opt out of compression aren't silently opted in
recent_tool_keep defaults to protect_last_nConsistent with existing "how much recent context to preserve" intent
All thresholds configurable via compression.progressive.*No hardcoded behavior

Relationship to existing compression

ContextCompressorProgressive tool-result
TriggerAfter 413/overflowBefore every API call
PersistencePermanent (mutates history)Ephemeral (API copy only)
MethodLLM summarizationRegex one-line summary
CostAPI call per compressionZero
ReversibleNoYes

The two are orthogonal: progressive compression reduces token waste on every call, which delays the need for a full ContextCompressor trigger.

Configuration

agent:
  compression:
    progressive:
      enabled: true           # defaults to compression.enabled
      recent_tool_keep: 20    # defaults to compression.protect_last_n
      min_messages: 16        # only activate in long conversations
      max_compressed_len: 300 # skip results shorter than this

Evidence

Token savings (benchmark)

Simulated conversations with read_file tool calls (the most common token-heavy pattern):

ScenarioTool callsAvg result sizeBefore (tokens)After (tokens)SavedReduction
Small102KB3,2532,67358017.8%
Medium205KB16,1386,8359,30357.6%
Large308KB39,04811,19627,85271.3%
XLarge5010KB81,49314,37067,12382.4%

Per-result compression ratio in the Large scenario: 64.7x (5,144 chars → 80 chars per compressed result).

For typical coding sessions (20-30 tool calls), this means 40-70K fewer tokens per API call, directly translating to lower cost and later 413 triggers.

Implementation status

I have a working implementation with 32 unit tests covering all branches (no-op paths, boundary conditions, immutability, edge cases). Happy to submit a PR if there's interest.

Questions for maintainers

  1. Is this direction something you'd want in the core?
  2. Any preference on the summary format or the default thresholds?
  3. Should this be opt-in (default false) or opt-out (default true, respecting compression.enabled)?

extent analysis

TL;DR

Implementing progressive tool-result compression can significantly reduce token waste and improve model performance by summarizing old tool results before each API call.

Guidance

  • Review the proposed solution's design decisions, such as operating on a copy of api_messages and using regex-based summaries, to ensure they align with the project's requirements.
  • Consider the configuration options, like recent_tool_keep and max_compressed_len, to determine the optimal settings for the project.
  • Evaluate the trade-offs between the existing ContextCompressor and the proposed progressive tool-result compression, including their triggers, persistence, and costs.
  • Assess the potential impact of this feature on the project's performance, cost, and user experience, using the provided benchmark results as a reference.

Example

agent:
  compression:
    progressive:
      enabled: true
      recent_tool_keep: 20
      min_messages: 16
      max_compressed_len: 300

This example configuration enables progressive compression, keeps the last 20 tool results intact, and compresses results only in conversations with at least 16 messages.

Notes

The proposed solution has a working implementation with 32 unit tests, but it's essential to review and discuss the design decisions, configuration options, and potential impact before integrating it into the core project.

Recommendation

Apply the proposed progressive tool-result compression workaround, as it has the potential to significantly reduce token waste and improve model performance, while being orthogonal to the existing ContextCompressor.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Proposal: progressive tool-result compression to reduce token waste in long conversations [1 participants]