hermes - 💡(How to fix) Fix [Bug]: Compression token savings ignored when message count is unchanged, causing false context exhaustion [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Auto/preflight context compression can materially reduce token/request size while leaving the message count unchanged, but the conversation loop treats len(messages) >= orig_len as "cannot compress further" and returns a context-exhaustion failure. In gateway sessions this can auto-reset an otherwise viable long-context session even when the post-compression token count is far below the model context window.

Error Message

  1. Update the error text to distinguish token overflow from message-count hygiene exhaustion.
  • The current error message misleads users into thinking GPT-5.5 cannot handle ~288k tokens, when the failure is Hermes' compression bookkeeping.

Root Cause

The model context window was 1,000,000 tokens. The compressed request estimate was ~183,180 tokens, well below the configured threshold and far below the model limit, but the session still failed because the message count did not decrease.

Fix Action

Fixed

Code Example

2026-06-04 23:39:57 context compression started: session=20260604_131128_cd6624 messages=220 tokens=~288,028 model=gpt-5.5 focus=None
2026-06-04 23:41:38 context compression done: session=20260604_234138_2f8759 messages=220->220 tokens=~183,180
2026-06-04 23:41:38 Context length exceeded: 288,028 tokens. Cannot compress further.
2026-06-04 23:41:38 Auto-resetting session 20260604_131128_cd6624 after compression exhaustion.

---

_orig_len = len(messages)
messages, active_system_prompt = agent._compress_context(...)
if len(messages) >= _orig_len:
    break  # Cannot compress further
RAW_BUFFERClick to expand / collapse

Summary

Auto/preflight context compression can materially reduce token/request size while leaving the message count unchanged, but the conversation loop treats len(messages) >= orig_len as "cannot compress further" and returns a context-exhaustion failure. In gateway sessions this can auto-reset an otherwise viable long-context session even when the post-compression token count is far below the model context window.

Real observed failure

Main/default Telegram profile using GPT-5.5 with explicit 1M context:

  • model.default: gpt-5.5
  • model.provider: openai-codex
  • model.context_length: 1000000
  • compression.threshold: 0.35
  • compression.hygiene_hard_message_limit: 220 at the time of the incident

Logs:

2026-06-04 23:39:57 context compression started: session=20260604_131128_cd6624 messages=220 tokens=~288,028 model=gpt-5.5 focus=None
2026-06-04 23:41:38 context compression done: session=20260604_234138_2f8759 messages=220->220 tokens=~183,180
2026-06-04 23:41:38 Context length exceeded: 288,028 tokens. Cannot compress further.
2026-06-04 23:41:38 Auto-resetting session 20260604_131128_cd6624 after compression exhaustion.

The model context window was 1,000,000 tokens. The compressed request estimate was ~183,180 tokens, well below the configured threshold and far below the model limit, but the session still failed because the message count did not decrease.

Why this is wrong

Compression can succeed by reducing content/tool-result size without reducing the number of message objects. In this case it saved roughly 105k tokens (~36%) but preserved 220 message rows.

The conversation loop appears to use message-count reduction as the success criterion:

_orig_len = len(messages)
messages, active_system_prompt = agent._compress_context(...)
if len(messages) >= _orig_len:
    break  # Cannot compress further

That conflates two different conditions:

  1. No-op compression: transcript materially unchanged.
  2. Effective token compression: same number of rows, much smaller request.

Only (1) should be treated as compression exhaustion.

Expected behavior

Compression success should be evaluated by material request-size reduction and the active trigger reason, not only by message count.

Examples:

  • If compression reduced estimated request tokens below threshold, continue.
  • If compression reduced estimated request tokens materially but not enough, allow another pass or report token pressure accurately.
  • If the trigger was message-count hygiene, either run a mode that actually reduces effective message rows or emit a specific message-count-hygiene no-op reason.
  • Do not report Context length exceeded when the model context is 1M and post-compression estimate is ~183k.

Actual behavior

The loop treats unchanged message count as compression failure/exhaustion even when token pressure was substantially improved. The gateway then auto-resets the session.

Related issues

This is adjacent to, but not fully covered by:

  • #6202 — /compress can report success even when transcript is unchanged
  • #15195 — gateway hygiene hard message cap counts tool rows in tool-heavy Telegram sessions
  • #12626 — gateway auto-compacts below token pressure due to message count
  • #35809 — compression exhaustion / auto-reset loop

This issue is specifically about the success criterion after compression: same message count does not imply no compression.

Suggested fix direction

Return structured compression outcome metadata, e.g.:

  • changed_messages
  • old_message_count, new_message_count
  • old_request_tokens, new_request_tokens
  • token_savings_pct
  • trigger_reason (token_threshold, message_hygiene, manual, provider_413)
  • no_op_reason when applicable

Minimum viable fix:

  1. Re-estimate request tokens immediately after _compress_context(...).
  2. Treat compression as successful if request tokens decreased materially and/or fell below threshold, even if len(messages) is unchanged.
  3. Only set compression_exhausted=True when both message count and request size are materially unchanged, or when post-compression request size still exceeds provider/model limits after max passes.
  4. Update the error text to distinguish token overflow from message-count hygiene exhaustion.

Impact

  • Long-context models with 1M windows can be reset around ~200-400 raw transcript rows even when token usage is far below context.
  • Tool-heavy Telegram sessions are especially vulnerable because tool calls/results inflate raw row count.
  • The current error message misleads users into thinking GPT-5.5 cannot handle ~288k tokens, when the failure is Hermes' compression bookkeeping.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Compression success should be evaluated by material request-size reduction and the active trigger reason, not only by message count.

Examples:

  • If compression reduced estimated request tokens below threshold, continue.
  • If compression reduced estimated request tokens materially but not enough, allow another pass or report token pressure accurately.
  • If the trigger was message-count hygiene, either run a mode that actually reduces effective message rows or emit a specific message-count-hygiene no-op reason.
  • Do not report Context length exceeded when the model context is 1M and post-compression estimate is ~183k.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING