hermes - 💡(How to fix) Fix Context compression fires prematurely — estimated token count overwrites precise API value [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

The estimation error compounds the problem:

Root Cause

The estimation error compounds the problem:

  • The last API response returned prompt_tokens = 80K (precise)
  • estimate_request_tokens_rough() for the current context reports ~101K (overcounted by ~20K due to tool schemas being included in the estimate but not in the lagged API value)
  • Compression fires because the estimated 101K > threshold 100K
  • Actual context may only be ~90K

Fix Action

Fixed

Code Example

if _compressor.last_prompt_tokens > 0:
    _real_tokens = _compressor.last_prompt_tokens  # precise value
else:
    _real_tokens = estimate_request_tokens_rough(...)  # fallback only when unavailable
RAW_BUFFERClick to expand / collapse

Problem

Hermes has two compression buffers:

  1. User-configured threshold (default 50%): threshold_tokens = context_length x 0.50 -> fires at 100K on a 200K context
  2. Post-compression estimation overwrite: after every compression, last_prompt_tokens is unconditionally overwritten with estimate_request_tokens_rough(), which is consistently 15-30K higher than the actual API-returned value

The estimation error compounds the problem:

  • The last API response returned prompt_tokens = 80K (precise)
  • estimate_request_tokens_rough() for the current context reports ~101K (overcounted by ~20K due to tool schemas being included in the estimate but not in the lagged API value)
  • Compression fires because the estimated 101K > threshold 100K
  • Actual context may only be ~90K

Expected behavior

Users set compression.threshold: 0.50 to mean "compress at 50% of context." Hermes should honor that exactly. Users who want more headroom can lower the threshold themselves.

Proposed fix

Remove the post-compression estimation overwrite entirely. After compression, last_prompt_tokens should retain the last API-returned value (or 0 if unavailable). The estimate_request_tokens_rough() path is already used as a fallback when last_prompt_tokens == 0 (line 14554-14568 in run_agent.py).

The compression decision logic already correctly prioritizes the precise API value when available:

if _compressor.last_prompt_tokens > 0:
    _real_tokens = _compressor.last_prompt_tokens  # precise value
else:
    _real_tokens = estimate_request_tokens_rough(...)  # fallback only when unavailable

The issue is that _compress_context() overwrites last_prompt_tokens with an estimate after every compression, so the next compression decision always uses the estimated (inflated) value instead of waiting for the next API response.

Fix: remove or make optional the last_prompt_tokens = _compressed_est line in _compress_context() (run_agent.py:10177). This way:

  • After compression, last_prompt_tokens stays at the last precise value
  • On the next user message, the API call updates it to the new precise value
  • Compression fires exactly when prompt_tokens (precise) > threshold, not when the estimate does

This puts the only buffer under user control via compression.threshold.

Related issues

  • #7133: Compression causes incoherent responses on MiniMax-M2.7 (open)
  • #14695: Tools schema tokens not counted in estimation (fixed but issue not closed)
  • #1091, #683: Real-time context visibility requests (open)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Users set compression.threshold: 0.50 to mean "compress at 50% of context." Hermes should honor that exactly. Users who want more headroom can lower the threshold themselves.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Context compression fires prematurely — estimated token count overwrites precise API value [1 pull requests]