hermes - 💡(How to fix) Fix RFC: Pluggable type-aware output-compressor pipeline for tool results

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I'd like to gauge interest in upstreaming a pluggable output-compressor pipeline that detects the type of a tool result (pytest output, git diff, grep matches, docker ps, cargo test, npm install, etc.) and applies a type-specific compression strategy — preserving signal lines (tracebacks, hunk headers, error messages) while trimming noise (passing tests, unchanged lines, repetitive headers).

I've been running this in production for ~4 weeks across multiple profiles. Backed by ~50 golden-fixture regression files so a compressor change can't silently destroy a class of output.

This is runtime tool-result compression, distinct from the trajectory-compression mentioned in the README (which is for training-data export).

Error Message

tool result │ ▼ ┌────────────────────────┐ │ pattern_detector() │ ← regex + content heuristics; identifies "this looks like pytest output" └────────────────────────┘ │ ▼ detected_type = "pytest" ┌────────────────────────┐ │ compressor_registry │ ← registry of {type → compressor_fn} └────────────────────────┘ │ ▼ ┌────────────────────────┐ │ pytest_compressor() │ ← keeps FAIL lines + traceback + summary; drops PASS lines └────────────────────────┘ │ ▼ compressed result + metric {input_bytes, output_bytes, ratio, type}

Root Cause

Across a long working session this compounds — without compression the same useful info pushes 5-10× more tokens through the model. With compression, prompt-cache hit rate also improves because the noise that varies per run is what gets stripped.

Code Example

tool result
┌────────────────────────┐
pattern_detector()    │  ← regex + content heuristics; identifies "this looks like pytest output"
└────────────────────────┘
    ▼  detected_type = "pytest"
┌────────────────────────┐
│  compressor_registry   │  ← registry of {type → compressor_fn}
└────────────────────────┘
┌────────────────────────┐
pytest_compressor()   │  ← keeps FAIL lines + traceback + summary; drops PASS lines
└────────────────────────┘
compressed result + metric {input_bytes, output_bytes, ratio, type}
RAW_BUFFERClick to expand / collapse

Summary

I'd like to gauge interest in upstreaming a pluggable output-compressor pipeline that detects the type of a tool result (pytest output, git diff, grep matches, docker ps, cargo test, npm install, etc.) and applies a type-specific compression strategy — preserving signal lines (tracebacks, hunk headers, error messages) while trimming noise (passing tests, unchanged lines, repetitive headers).

I've been running this in production for ~4 weeks across multiple profiles. Backed by ~50 golden-fixture regression files so a compressor change can't silently destroy a class of output.

This is runtime tool-result compression, distinct from the trajectory-compression mentioned in the README (which is for training-data export).

Why

Tool results are the largest non-skill contributor to turn context in my deployments. Typical observed sizes:

ToolRaw outputAfter type-aware compression
pytest -v (1000 tests, 3 fail)~80 KB~3 KB (keep failures + summary, drop pass lines)
git diff (medium PR)~25 KB~6 KB (keep hunk headers + changed lines, drop unchanged context beyond N)
rg <pattern> (200 matches)~40 KB~5 KB (keep first/last N matches + count, dedupe near-identical)
docker ps -a (50 containers)~12 KB~2 KB (tabulate, drop verbose mount lines)
cargo test (large workspace)~150 KB~8 KB (keep failed test detail, drop progress)
pip install (deep dep tree)~30 KB~1 KB (keep summary + errors)

Across a long working session this compounds — without compression the same useful info pushes 5-10× more tokens through the model. With compression, prompt-cache hit rate also improves because the noise that varies per run is what gets stripped.

Design sketch

tool result
┌────────────────────────┐
│  pattern_detector()    │  ← regex + content heuristics; identifies "this looks like pytest output"
└────────────────────────┘
    ▼  detected_type = "pytest"
┌────────────────────────┐
│  compressor_registry   │  ← registry of {type → compressor_fn}
└────────────────────────┘
┌────────────────────────┐
│  pytest_compressor()   │  ← keeps FAIL lines + traceback + summary; drops PASS lines
└────────────────────────┘
compressed result + metric {input_bytes, output_bytes, ratio, type}

Plug points:

  • New compressors are stand-alone functions registered via decorator; no core changes needed
  • Per-type config (max_lines_kept, signal_patterns, dedup_threshold) in cli-config.yaml
  • Per-tool override possible (e.g., "for pytest from project X, use a different compressor")
  • Lossless mode toggle (env var or per-turn flag) disables all compression for debugging — full raw output passes through

Fixture-driven testing:

Each compressor ships with golden fixtures under fixtures/output_compressor/<type>/<scenario>.txt (input) and <type>/<scenario>.expected.txt (compressed output). Pytest enforces no-regression; adding a new compressor requires a fixture pair.

I currently have ~50 fixtures covering: pytest (pass/fail/error), git (status/diff/log), grep/rg variants, ls, docker ps, cargo (pass/fail), npm/pip install, curl JSON, ESLint, mypy, coverage, PowerShell native commands, etc.

What this is NOT

  • Not training-data trajectory compression (different layer — that's post-hoc, this is per-turn)
  • Not an LLM-based summarizer (pure regex / structured rules; ~0 added latency, ~0 added cost)
  • Not lossy by default for unknown types — when no compressor matches, output passes through unchanged
  • Not opinionated about which model / transport / profile — purely a turn-result transform

Why this isn't covered by existing features

  • The agent's existing context window management is line-count / token-count based, not type-aware (it truncates blindly)
  • Anthropic's automatic context compaction kicks in only when the window fills; this prevents the window from filling so fast
  • LLM-based summarization on every tool result would add latency + cost; this is sub-millisecond per call

Questions before I open a PR

  1. In scope? Pluggable pipeline under agent/ or tools/? Or better as an optional bundled skill that wraps tool calls?
  2. Compressor registration UX — decorator + entry-point discovery, or explicit registry list in cli-config?
  3. Fixture format — keep as raw .txt pairs (current local form) or move to a structured YAML?
  4. Scope split — would you prefer (a) the framework + 3-5 common compressors (pytest / git / grep / ls / docker) as Phase 1, deferring the rest? Or (b) framework only, with compressors added separately by the community?

Not opening a PR yet. Related batch: #31385 (bridge), #31387 (drift hook, withdrawn), #31388 (multi-profile memory), #31392 (task relay), and a parallel SKILL-scheduling proposal I'm filing alongside this.

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING