hermes - 💡(How to fix) Fix Preflight compression guard bypasses token threshold, causing silent context overflow on sessions with few messages but large total tokens

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The preflight compression check in run_agent.py uses len(messages) > protect_first_n + protect_last_n + 1 (default: 24) as the gatekeeper condition. When this message-count guard fails, the token estimation and compression logic that follows it is never reached — even if the session's total token count far exceeds the configured compression.threshold (default 500K tokens for a 1M-context model).

This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual /compress or /new.

In API Server / Open WebUI mode, each request creates a fresh AIAgent that runs preflight once. If the client sends back a compact history (few messages, huge content), the guard blocks compression permanently.


Error Message

  1. Next API call reaches the model's context limit and may fail with 413 or a non-retryable error.

Root Cause

Root Cause

Fix Action

Fix / Workaround

This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual /compress or /new.

Code Example

if (
    self.compression_enabled
    and len(messages) > self.context_compressor.protect_first_n
                        + self.context_compressor.protect_last_n + 1
                        # 3 + 20 + 1 = 24
):
    _preflight_tokens = estimate_request_tokens_rough(
        messages,
        system_prompt=active_system_prompt or "",
        tools=self.tools or None,
    )
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        # ... compress

---

'messages': 8 messages
  [0] role=system   content= 19,556 chars
  [1] role=user     content=  8,928 chars
  [4] role=user     content=  4,781 chars  (compression summary marker)
  [7] role=user     content=5,334,419 chars  ← failed compression result

Total chars: 5,368,491 (~1.34M tokens at 4 chars/token)

---

_approx_chars = sum(len(str(m.get('content', '')) or '') for m in messages)
_approx_tokens = _approx_chars // 4

if (
    self.compression_enabled
    and (len(messages) > 24 or _approx_tokens > self.context_compressor.threshold_tokens)
):
    _preflight_tokens = estimate_request_tokens_rough(...)
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        self._compress_context(...)
RAW_BUFFERClick to expand / collapse

Summary

The preflight compression check in run_agent.py uses len(messages) > protect_first_n + protect_last_n + 1 (default: 24) as the gatekeeper condition. When this message-count guard fails, the token estimation and compression logic that follows it is never reached — even if the session's total token count far exceeds the configured compression.threshold (default 500K tokens for a 1M-context model).

This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual /compress or /new.

In API Server / Open WebUI mode, each request creates a fresh AIAgent that runs preflight once. If the client sends back a compact history (few messages, huge content), the guard blocks compression permanently.


Root Cause

Primary bug: len(messages) guard preempts token-based triggering

At run_agent.py line 11220:

if (
    self.compression_enabled
    and len(messages) > self.context_compressor.protect_first_n
                        + self.context_compressor.protect_last_n + 1
                        # 3 + 20 + 1 = 24
):
    _preflight_tokens = estimate_request_tokens_rough(
        messages,
        system_prompt=active_system_prompt or "",
        tools=self.tools or None,
    )
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        # ... compress

The len(messages) > 24 guard is a gate, not a hint. When it evaluates to False, the entire token-check block — and therefore compression — is skipped. It fails when a conversation has few (e.g. 8) messages that are each very large (e.g. 1.3M tokens in a single message).

Secondary issue: update_from_response data is never used for compression triggering

The ContextCompressor.update_from_response() method (line 488 of context_compressor.py) already receives actual prompt_tokens from each API response. should_compress() is implemented but never called from anywhere in the main loop (run_agent.py run_conversation()). Searching the entire run_agent.py file for any call to should_compress returns zero results.

Tertiary: estimate_request_tokens_rough() may underestimate

This function uses _CHARS_PER_TOKEN = 4, which underestimates CJK text (Chinese averages ~1.5-2 chars/token), code, JSON, and base64 content. Actual token counts can be 1.5-2x the estimate.


Observed Evidence

From agent.log

  • Zero occurrences of "Preflight compression" — the preflight path never reached the token-check branch.
  • Zero occurrences of "context compression started"_compress_context() was never called.
  • 24 occurrences of "Auxiliary compression: using auto" — these are just auxiliary model initialization messages, not actual compression events.

From request_dump analysis (latest dump, 6.75MB)

'messages': 8 messages
  [0] role=system   content= 19,556 chars
  [1] role=user     content=  8,928 chars
  [4] role=user     content=  4,781 chars  (compression summary marker)
  [7] role=user     content=5,334,419 chars  ← failed compression result

Total chars: 5,368,491 (~1.34M tokens at 4 chars/token)

The session had only 8 messages (well below the 24-message gate) but 1.34M tokens (well above the 500K threshold). Preflight silently skipped compression because 8 > 24 == False.


Reproduction Steps

  1. Use Hermes in API Server mode (or any mode where the gateway passes conversation_history).
  2. Create a session with fewer than 24 messages but very large content per message (e.g., read a 200K-token file, or have a failed compression attempt produce a 1M+ token summary).
  3. Observe that preflight compression never triggers, even though token count exceeds threshold_tokens.
  4. Next API call reaches the model's context limit and may fail with 413 or a non-retryable error.
  5. Check agent.log — no "Preflight compression" or "context compression started" entries.

Environment

  • Hermes Agent version: observed on main commit as of 2026-05-17
  • Model: deepseek-v4-flash (1M context, threshold=500K)
  • Platform: API Server (hermes-web-ui frontend)
  • Config: defaults (compression.enabled: true, threshold: 0.5, protect_last_n: 20)

Proposed Fix

Fix 1 (minimal, high impact) ⭐ Recommended

Add a token-based OR condition to the preflight gate:

_approx_chars = sum(len(str(m.get('content', '')) or '') for m in messages)
_approx_tokens = _approx_chars // 4

if (
    self.compression_enabled
    and (len(messages) > 24 or _approx_tokens > self.context_compressor.threshold_tokens)
):
    _preflight_tokens = estimate_request_tokens_rough(...)
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        self._compress_context(...)

Fix 2 (medium)

Use the actual prompt_tokens returned by the LLM API as a secondary compression trigger after each API response.

Fix 3 (optional, completeness)

Add a mid-loop check: after tool results are appended but before the next API call, if the token estimate exceeds threshold, compress before sending.


Related Issues

  • #6202: /compress reports success even when unchanged — discusses the same 24-msg guard but for manual /compress path, not preflight auto-compression
  • #22871: Fix preflight compression pass budget — related to preflight but different issue (pass count, not gate condition)
  • #25921: Gateway reuses parent-sized history after compression split
  • #12213: Feature request for compress as native tool

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Preflight compression guard bypasses token threshold, causing silent context overflow on sessions with few messages but large total tokens