hermes - 💡(How to fix) Fix Proposal: progressive tool-result compression to reduce token waste in long conversations [1 participants]

zons-zhaozhy · 2026-04-24T06:36:37Z

[hermes] Problem In long conversations 40+ turns , old tool results — file contents, command outputs, search results — consume thousands of tokens that the mod… ## Problem In long conversations (40+ turns), old tool results — file contents, command outputs, search results — consume thousands of tokens that the model no longer needs verbatim. The model only needs to remember: **WHAT** tool was called, and the **OUTCOME** (success/failure + key result). ### Current behavior The existing `ContextCompressor` (threshold-triggered LLM summarization) handles this, but only **after** a 413/overflow event — it's a reactive, heavyweight mechanism that permanently mutates `self.messages`. Before that trigger point, every API call sends the full verbatim content of **all** historical tool results. In a 50-turn session with 30 tool calls, this can easily be 50K+ tokens of stale tool output that the model has already acted upon. ### Why it matters 1. **Token waste** — Each API call pays for tokens the model doesn't need. In a coding session with many `read_file` and `terminal` calls, old outputs are pure waste after the model has moved on. 2. **Earlier 413 triggers** — Stale tool results push the context toward the threshold faster, causing more frequent full compression events (which are expensive, irreversible, and disruptive). 3. **Degraded reasoning** — More tokens in context = more noise for the model to sift through. Compact summaries of old results can actually improve focus. ## Proposed solution: Progressive tool-result compression An **ephemeral, per-API-call** optimization that compresses old tool results to one-line summaries **before** each LLM call — complementing (not replacing) the existing `ContextCompressor`. ### How it works ``` Before each API call: 1. Identify all role="tool" messages in api_messages 2. Keep the last N tool results intact (recent context) 3. For older tool results, replace content with a compact summary: [read_file] OK (12847 chars) | import os → from pathlib import Path ``` ### Key design decisions | Decision | Rationale | |---|---| | Operates on `api_messages` copy only | `self.messages` is never touched — fully reversible | | Regex-based summary, not LLM | Zero latency, zero cost per call | | Respects `compression.enabled` | Users who opt out of compression aren't silently opted in | | `recent_tool_keep` defaults to `protect_last_n` | Consistent with existing "how much recent context to preserve" intent | | All thresholds configurable via `compression.progressive.*` | No hardcoded behavior | ### Relationship to existing compression | | ContextCompressor | Progressive tool-result | |---|---|---| | Trigger | After 413/overflow | Before every API call | | Persistence | Permanent (mutates history) | Ephemeral (API copy only) | | Method | LLM summarization | Regex one-line summary | | Cost | API call per compression | Zero | | Reversible | No | Yes | The two are **orthogonal**: progressive compression reduces token waste on every call, which *delays* the need for a full ContextCompressor trigger. ### Configuration ```yaml agent: compression: progressive: enabled: true # defaults to compression.enabled recent_tool_keep: 20 # defaults to compression.protect_last_n min_messages: 16 # only activate in long conversations max_compressed_len: 300 # skip results shorter than this ``` ## Evidence ### Token savings (benchmark) Simulated conversations with `read_file` tool calls (the most common token-heavy pattern): | Scenario | Tool calls | Avg result size | Before (tokens) | After (tokens) | Saved | Reduction | |---|---|---|---|---|---|---| | Small | 10 | 2KB | 3,253 | 2,673 | 580 | 17.8% | | Medium | 20 | 5KB | 16,138 | 6,835 | 9,303 | **57.6%** | | Large | 30 | 8KB | 39,048 | 11,196 | 27,852 | **71.3%** | | XLarge | 50 | 10KB | 81,493 | 14,370 | 67,123 | **82.4%** | Per-result compression ratio in the Large scenario: **64.7x** (5,144 chars → 80 chars per compressed result). For typical coding sessions (20-30 tool calls), this means **40-70K fewer tokens per API call**, directly translating to lower cost and later 413 triggers. ### Implementation status I have a working implementation with **32 unit tests** covering all branches (no-op paths, boundary conditions, immutability, edge cases). Happy to submit a PR if there's interest. ## Questions for maintainers 1. Is this direction something you'd want in the core? 2. Any preference on the summary format or the default thresholds? 3. Should this be opt-in (default `false`) or opt-out (default `true`, respecting `compression.enabled`)?

Code Example

Before each API call:
  1. Identify all role="tool" messages in api_messages
  2. Keep the last N tool results intact (recent context)
  3. For older tool results, replace content with a compact summary:
     [read_file] OK (12847 chars) | import os → from pathlib import Path

---

agent:
  compression:
    progressive:
      enabled: true           # defaults to compression.enabled
      recent_tool_keep: 20    # defaults to compression.protect_last_n
      min_messages: 16        # only activate in long conversations
      max_compressed_len: 300 # skip results shorter than this

Problem

In long conversations (40+ turns), old tool results — file contents, command outputs, search results — consume thousands of tokens that the model no longer needs verbatim. The model only needs to remember: WHAT tool was called, and the OUTCOME (success/failure + key result).

Current behavior

The existing ContextCompressor (threshold-triggered LLM summarization) handles this, but only after a 413/overflow event — it's a reactive, heavyweight mechanism that permanently mutates self.messages.

Before that trigger point, every API call sends the full verbatim content of all historical tool results. In a 50-turn session with 30 tool calls, this can easily be 50K+ tokens of stale tool output that the model has already acted upon.

Why it matters

Token waste — Each API call pays for tokens the model doesn't need. In a coding session with many read_file and terminal calls, old outputs are pure waste after the model has moved on.
Earlier 413 triggers — Stale tool results push the context toward the threshold faster, causing more frequent full compression events (which are expensive, irreversible, and disruptive).
Degraded reasoning — More tokens in context = more noise for the model to sift through. Compact summaries of old results can actually improve focus.

Proposed solution: Progressive tool-result compression

An ephemeral, per-API-call optimization that compresses old tool results to one-line summaries before each LLM call — complementing (not replacing) the existing ContextCompressor.

How it works

Before each API call:
  1. Identify all role="tool" messages in api_messages
  2. Keep the last N tool results intact (recent context)
  3. For older tool results, replace content with a compact summary:
     [read_file] OK (12847 chars) | import os → from pathlib import Path

Key design decisions

Decision	Rationale
Operates on `api_messages` copy only	`self.messages` is never touched — fully reversible
Regex-based summary, not LLM	Zero latency, zero cost per call
Respects `compression.enabled`	Users who opt out of compression aren't silently opted in
`recent_tool_keep` defaults to `protect_last_n`	Consistent with existing "how much recent context to preserve" intent
All thresholds configurable via `compression.progressive.*`	No hardcoded behavior

Relationship to existing compression

	ContextCompressor	Progressive tool-result
Trigger	After 413/overflow	Before every API call
Persistence	Permanent (mutates history)	Ephemeral (API copy only)
Method	LLM summarization	Regex one-line summary
Cost	API call per compression	Zero
Reversible	No	Yes

The two are orthogonal: progressive compression reduces token waste on every call, which delays the need for a full ContextCompressor trigger.

Configuration

agent:
  compression:
    progressive:
      enabled: true           # defaults to compression.enabled
      recent_tool_keep: 20    # defaults to compression.protect_last_n
      min_messages: 16        # only activate in long conversations
      max_compressed_len: 300 # skip results shorter than this

Evidence

Token savings (benchmark)

Simulated conversations with read_file tool calls (the most common token-heavy pattern):

Scenario	Tool calls	Avg result size	Before (tokens)	After (tokens)	Saved	Reduction
Small	10	2KB	3,253	2,673	580	17.8%
Medium	20	5KB	16,138	6,835	9,303	57.6%
Large	30	8KB	39,048	11,196	27,852	71.3%
XLarge	50	10KB	81,493	14,370	67,123	82.4%

Per-result compression ratio in the Large scenario: 64.7x (5,144 chars → 80 chars per compressed result).

For typical coding sessions (20-30 tool calls), this means 40-70K fewer tokens per API call, directly translating to lower cost and later 413 triggers.

Implementation status

I have a working implementation with 32 unit tests covering all branches (no-op paths, boundary conditions, immutability, edge cases). Happy to submit a PR if there's interest.

Questions for maintainers

Is this direction something you'd want in the core?
Any preference on the summary format or the default thresholds?
Should this be opt-in (default false) or opt-out (default true, respecting compression.enabled)?

extent analysis

TL;DR

Implementing progressive tool-result compression can significantly reduce token waste and improve model performance by summarizing old tool results before each API call.

Guidance

Review the proposed solution's design decisions, such as operating on a copy of api_messages and using regex-based summaries, to ensure they align with the project's requirements.
Consider the configuration options, like recent_tool_keep and max_compressed_len, to determine the optimal settings for the project.
Evaluate the trade-offs between the existing ContextCompressor and the proposed progressive tool-result compression, including their triggers, persistence, and costs.
Assess the potential impact of this feature on the project's performance, cost, and user experience, using the provided benchmark results as a reference.

Example

agent:
  compression:
    progressive:
      enabled: true
      recent_tool_keep: 20
      min_messages: 16
      max_compressed_len: 300

This example configuration enables progressive compression, keeps the last 20 tool results intact, and compresses results only in conversations with at least 16 messages.

Notes

The proposed solution has a working implementation with 32 unit tests, but it's essential to review and discuss the design decisions, configuration options, and potential impact before integrating it into the core project.

Recommendation

Apply the proposed progressive tool-result compression workaround, as it has the potential to significantly reduce token waste and improve model performance, while being orthogonal to the existing ContextCompressor.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Proposal: progressive tool-result compression to reduce token waste in long conversations [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Problem

Current behavior

Why it matters

Proposed solution: Progressive tool-result compression

How it works

Key design decisions

Relationship to existing compression

Configuration

Evidence

Token savings (benchmark)

Implementation status

Questions for maintainers

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Proposal: progressive tool-result compression to reduce token waste in long conversations [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Problem

Current behavior

Why it matters

Proposed solution: Progressive tool-result compression

How it works

Key design decisions

Relationship to existing compression

Configuration

Evidence

Token savings (benchmark)

Implementation status

Questions for maintainers

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING