hermes - 💡(How to fix) Fix Context compression fires prematurely — estimated token count overwrites precise API value [1 pull requests]

Q: Expected behavior

Users set `compression.threshold: 0.50` to mean "compress at 50% of context." Hermes should honor that exactly. Users who want more headroom can lower the threshold themselves.

hermes2026-05-11 16:54:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

The estimation error compounds the problem:

Root Cause

The estimation error compounds the problem:

The last API response returned prompt_tokens = 80K (precise)
estimate_request_tokens_rough() for the current context reports ~101K (overcounted by ~20K due to tool schemas being included in the estimate but not in the lagged API value)
Compression fires because the estimated 101K > threshold 100K
Actual context may only be ~90K

Fix Action

Fixed

Fixed by PR: fix(compression): separate provider-exact vs projected token state (https://github.com/NousResearch/hermes-agent/pull/23934)

Code Example

if _compressor.last_prompt_tokens > 0:
    _real_tokens = _compressor.last_prompt_tokens  # precise value
else:
    _real_tokens = estimate_request_tokens_rough(...)  # fallback only when unavailable

RAW_BUFFERClick to expand / collapse

Problem

Hermes has two compression buffers:

User-configured threshold (default 50%): threshold_tokens = context_length x 0.50 -> fires at 100K on a 200K context
Post-compression estimation overwrite: after every compression, last_prompt_tokens is unconditionally overwritten with estimate_request_tokens_rough(), which is consistently 15-30K higher than the actual API-returned value

The estimation error compounds the problem:

The last API response returned prompt_tokens = 80K (precise)
estimate_request_tokens_rough() for the current context reports ~101K (overcounted by ~20K due to tool schemas being included in the estimate but not in the lagged API value)
Compression fires because the estimated 101K > threshold 100K
Actual context may only be ~90K

Expected behavior

Users set compression.threshold: 0.50 to mean "compress at 50% of context." Hermes should honor that exactly. Users who want more headroom can lower the threshold themselves.

Proposed fix

Remove the post-compression estimation overwrite entirely. After compression, last_prompt_tokens should retain the last API-returned value (or 0 if unavailable). The estimate_request_tokens_rough() path is already used as a fallback when last_prompt_tokens == 0 (line 14554-14568 in run_agent.py).

The compression decision logic already correctly prioritizes the precise API value when available:

if _compressor.last_prompt_tokens > 0:
    _real_tokens = _compressor.last_prompt_tokens  # precise value
else:
    _real_tokens = estimate_request_tokens_rough(...)  # fallback only when unavailable

The issue is that _compress_context() overwrites last_prompt_tokens with an estimate after every compression, so the next compression decision always uses the estimated (inflated) value instead of waiting for the next API response.

Fix: remove or make optional the last_prompt_tokens = _compressed_est line in _compress_context() (run_agent.py:10177). This way:

After compression, last_prompt_tokens stays at the last precise value
On the next user message, the API call updates it to the new precise value
Compression fires exactly when prompt_tokens (precise) > threshold, not when the estimate does

This puts the only buffer under user control via compression.threshold.

Related issues

#7133: Compression causes incoherent responses on MiniMax-M2.7 (open)
#14695: Tools schema tokens not counted in estimation (fixed but issue not closed)
#1091, #683: Real-time context visibility requests (open)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Users set compression.threshold: 0.50 to mean "compress at 50% of context." Hermes should honor that exactly. Users who want more headroom can lower the threshold themselves.

#api #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Context compression fires prematurely — estimated token count overwrites precise API value [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Problem

Expected behavior

Proposed fix

Related issues

FAQ

Expected behavior

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Context compression fires prematurely — estimated token count overwrites precise API value [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Problem

Expected behavior

Proposed fix

Related issues

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING