hermes - 💡(How to fix) Fix feat(compression): integrate headroom-ai for tool output compression

Fix Action

Fix / Workaround

headroom_compressor.py — wrapper module that routes tool outputs to headroom's compressors
sitecustomize.py — auto-patches tool_executor.py on startup via Python's sitecustomize mechanism
patch_tool_executor.py — idempotent patch that injects the compression call into the tool execution pipeline

Code Example

Tool Execution → headroom compress_tool_output() → Compressed Output → Message List → LLM Context
                     ↓
              ContentRouter detects type:
              - terminal → LogCompressor (~93% reduction)
              - search_files → SearchCompressor (~87% reduction)
              - web_search → SmartCrusher (~2-5% reduction)
              - read_file → CodeAwareCompressor (tree-sitter, currently buggy)
              - browser_snapshot → noop (plain text, not supported yet)
              - web_extract → noop (markdown, not supported yet)

---

headroom:
  enabled: false  # opt-in, default off
  mode: audit     # audit (log only) | optimize (compress)
  threshold: 300  # minimum tokens to trigger compression
  tools:
    - terminal
    - search_files
    - web_search
    - read_file
    - browser_snapshot
    - web_extract

---

# Pseudocode for the integration point
function_result = execute_tool(name, args)

# NEW: Compress tool output before adding to context
if headroom_config.enabled:
    compressed = headroom_compress(name, function_result)
    if compressed is not None:
        function_result = compressed

messages.append(make_tool_result_message(name, function_result, tc.id))

Problem or Use Case

Hermes Agent's current context compression system (context_compressor.py, conversation_compression.py) works at the conversation level — it summarizes the entire context window via LLM calls when the session approaches its token limit. This approach has several known issues:

Premature compression triggers due to token estimation inaccuracies (#23902, #14690)
Compression can increase prompt size instead of reducing it (#23767)
Silent data loss when summary generation fails (#25585, #10719)
Anti-thrashing protection permanently disables compression with no recovery (#14690)
Preflight guard bypasses token threshold for sessions with few but huge messages (#27405)

These are fundamentally hard to solve at the conversation-summary level because the compressor operates on already-assembled context.

Headroom-ai (headroom-ai on PyPI, github.com/chopratejas/headroom, 13K+ stars) takes a different approach: it compresses individual tool outputs before they enter the context, using specialized compressors per content type (logs, grep results, JSON, code). This is complementary to the existing conversation-level compression — it reduces the rate at which context grows in the first place.

Proposed Solution

Integrate headroom-ai as an optional tool-output compression layer that sits between tool execution and context insertion. The integration would:

Intercept tool outputs after execution but before they are appended to the message list
Route each output to the appropriate headroom compressor based on tool name / content type
Replace the original output with the compressed version (in optimize mode) or log metrics only (in audit mode)
Fall back gracefully if headroom is not installed or fails

Architecture

Tool Execution → headroom compress_tool_output() → Compressed Output → Message List → LLM Context
                     ↓
              ContentRouter detects type:
              - terminal → LogCompressor (~93% reduction)
              - search_files → SearchCompressor (~87% reduction)
              - web_search → SmartCrusher (~2-5% reduction)
              - read_file → CodeAwareCompressor (tree-sitter, currently buggy)
              - browser_snapshot → noop (plain text, not supported yet)
              - web_extract → noop (markdown, not supported yet)

Configuration

New optional config section in config.yaml:

headroom:
  enabled: false  # opt-in, default off
  mode: audit     # audit (log only) | optimize (compress)
  threshold: 300  # minimum tokens to trigger compression
  tools:
    - terminal
    - search_files
    - web_search
    - read_file
    - browser_snapshot
    - web_extract

Proof of Concept

I've been running a working integration in production for several days. The implementation consists of:

headroom_compressor.py — wrapper module that routes tool outputs to headroom's compressors
sitecustomize.py — auto-patches tool_executor.py on startup via Python's sitecustomize mechanism
patch_tool_executor.py — idempotent patch that injects the compression call into the tool execution pipeline

Measured results (headroom-ai v0.23.0, optimize mode):

Tool	Compressor	Reduction	Notes
`terminal`	LogCompressor	~93%	Preserves [WARN], [ERROR], [FAIL] lines; removes repetitive [INFO]
`search_files`	SearchCompressor	~87%	Preserves matching lines; deduplicates context
`web_search`	SmartCrusher	~2-5%	Light JSON array compression
`read_file`	CodeAwareCompressor	~0%	tree-sitter bug in v0.23.0, falls back to generic
`browser_snapshot`	noop	0%	plain text not supported by headroom yet
`web_extract`	noop	0%	markdown not supported by headroom yet

Token savings example: A session with 50 tool calls averaging 2000 tokens each would save approximately 30-40K tokens total (depending on tool mix), significantly delaying the need for conversation-level compression.

Integration Points

The cleanest integration point is in agent/tool_executor.py, in the _execute_tool_calls_sequential and _execute_tool_calls_concurrent functions, right before make_tool_result_message() is called:

# Pseudocode for the integration point
function_result = execute_tool(name, args)

# NEW: Compress tool output before adding to context
if headroom_config.enabled:
    compressed = headroom_compress(name, function_result)
    if compressed is not None:
        function_result = compressed

messages.append(make_tool_result_message(name, function_result, tc.id))

Advantages Over Current Approach

Complementary: Works before context compression, reducing its frequency and improving its effectiveness
No LLM calls: Unlike conversation-level compression, headroom uses deterministic algorithms — no API costs, no latency, no summary quality issues
Reversible (CCR): Headroom's Context Compression & Retrieval system stores originals locally; the LLM can retrieve them on demand
Content-aware: Different compressors for different content types, vs. one-size-fits-all LLM summarization
Opt-in: Zero impact on existing users who don't enable it

Dependencies

headroom-ai package (Apache 2.0 license, Python >= 3.10)
Optional: tree-sitter for code compression (already a transitive dependency of headroom-ai)

Alternatives Considered

Fix the existing compression system — addresses symptoms at the conversation level but doesn't reduce context growth rate. The two approaches are complementary.
Use headroom as a proxy — headroom supports proxy mode (headroom proxy --port 8787), but this intercepts all LLM traffic and is a heavier integration. Library mode is more targeted.
Build custom compressors — headroom already provides well-tested, content-specific compressors. Reinventing them would duplicate effort.

Scope

Medium — new optional feature, no breaking changes, ~200-300 lines of integration code plus config schema changes.

Related Issues

#23902 — premature compression trigger (headroom reduces context growth rate, making this less frequent)
#23767 — compression can increase prompt size (headroom's deterministic compressors don't have this problem)
#25585 — failed summaries discard context (headroom doesn't use LLM summarization)
#14690 — anti-thrashing permanently disables compression (headroom operates per-tool, not per-session)
#27405 — preflight guard bypasses token threshold (headroom compresses before messages reach the guard)
#14695 — post-compression token estimate excludes tools schema (headroom reduces tool output size, making estimates more accurate)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering