hermes - 💡(How to fix) Fix Preflight compression guard bypasses token threshold, causing silent context overflow on sessions with few messages but large total tokens

StepCodex · 2026-05-17T11:06:57Z

[hermes] The preflight compression check in run agent.py uses len messages protect first n + protect last n + 1 default: 24 as the gatekeeper condition. When t… The preflight compression check in `run_agent.py` uses `len(messages) > protect_first_n + protect_last_n + 1` (default: 24) as the gatekeeper condition. When this message-count guard fails, the token estimation and compression logic that follows it is **never reached** — even if the session's total token count far exceeds the configured `compression.threshold` (default 500K tokens for a 1M-context model). This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual `/compress` or `/new`. In API Server / Open WebUI mode, each request creates a fresh `AIAgent` that runs preflight once. If the client sends back a compact history (few messages, huge content), the guard blocks compression permanently. --- ## Fix / Workaround This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual `/compress` or `/new`. ## Summary The preflight compression check in `run_agent.py` uses `len(messages) > protect_first_n + protect_last_n + 1` (default: 24) as the gatekeeper condition. When this message-count guard fails, the token estimation and compression logic that follows it is **never reached** — even if the session's total token count far exceeds the configured `compression.threshold` (default 500K tokens for a 1M-context model). This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual `/compress` or `/new`. In API Server / Open WebUI mode, each request creates a fresh `AIAgent` that runs preflight once. If the client sends back a compact history (few messages, huge content), the guard blocks compression permanently. --- ## Root Cause ### Primary bug: `len(messages)` guard preempts token-based triggering At `run_agent.py` line 11220: ```python if ( self.compression_enabled and len(messages) > self.context_compressor.protect_first_n + self.context_compressor.protect_last_n + 1 # 3 + 20 + 1 = 24 ): _preflight_tokens = estimate_request_tokens_rough( messages, system_prompt=active_system_prompt or "", tools=self.tools or None, ) if _preflight_tokens >= self.context_compressor.threshold_tokens: # ... compress ``` The `len(messages) > 24` guard is a **gate**, not a **hint**. When it evaluates to False, the entire token-check block — and therefore compression — is skipped. It fails when a conversation has few (e.g. 8) messages that are each very large (e.g. 1.3M tokens in a single message). ### Secondary issue: `update_from_response` data is never used for compression triggering The `ContextCompressor.update_from_response()` method (line 488 of `context_compressor.py`) already receives actual `prompt_tokens` from each API response. `should_compress()` is implemented but **never called from anywhere in the main loop** (`run_agent.py` `run_conversation()`). Searching the entire `run_agent.py` file for any call to `should_compress` returns zero results. ### Tertiary: `estimate_request_tokens_rough()` may underestimate This function uses `_CHARS_PER_TOKEN = 4`, which underestimates CJK text (Chinese averages ~1.5-2 chars/token), code, JSON, and base64 content. Actual token counts can be 1.5-2x the estimate. --- ## Observed Evidence ### From agent.log - Zero occurrences of `"Preflight compression"` — the preflight path never reached the token-check branch. - Zero occurrences of `"context compression started"` — `_compress_context()` was never called. - 24 occurrences of `"Auxiliary compression: using auto"` — these are just auxiliary model initialization messages, not actual compression events. ### From request_dump analysis (latest dump, 6.75MB) ``` 'messages': 8 messages [0] role=system content= 19,556 chars [1] role=user content= 8,928 chars [4] role=user content= 4,781 chars (compression summary marker) [7] role=user content=5,334,419 chars ← failed compression result Total chars: 5,368,491 (~1.34M tokens at 4 chars/token) ``` The session had only **8 messages** (well below the 24-message gate) but **1.34M tokens** (well above the 500K threshold). Preflight silently skipped compression because `8 > 24 == False`. --- ## Reproduction Steps 1. Use Hermes in API Server mode (or any mode where the gateway passes `conversation_history`)

The preflight compression check in run_agent.py uses len(messages) > protect_first_n + protect_last_n + 1 (default: 24) as the gatekeeper condition. When this message-count guard fails, the token estimation and compression logic that follows it is never reached — even if the session's total token count far exceeds the configured compression.threshold (default 500K tokens for a 1M-context model).

This causes silent context overflow in a specific but common scenario: sessions that have very large single messages (e.g., reading big files, failed compression attempts producing 1M+ token messages) but have fewer than 24 total messages. The result is an unrecoverable session that can only be fixed by manual /compress or /new.

In API Server / Open WebUI mode, each request creates a fresh AIAgent that runs preflight once. If the client sends back a compact history (few messages, huge content), the guard blocks compression permanently.

Fix Action

Fix / Workaround

Code Example

if (
    self.compression_enabled
    and len(messages) > self.context_compressor.protect_first_n
                        + self.context_compressor.protect_last_n + 1
                        # 3 + 20 + 1 = 24
):
    _preflight_tokens = estimate_request_tokens_rough(
        messages,
        system_prompt=active_system_prompt or "",
        tools=self.tools or None,
    )
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        # ... compress

---

'messages': 8 messages
  [0] role=system   content= 19,556 chars
  [1] role=user     content=  8,928 chars
  [4] role=user     content=  4,781 chars  (compression summary marker)
  [7] role=user     content=5,334,419 chars  ← failed compression result

Total chars: 5,368,491 (~1.34M tokens at 4 chars/token)

---

_approx_chars = sum(len(str(m.get('content', '')) or '') for m in messages)
_approx_tokens = _approx_chars // 4

if (
    self.compression_enabled
    and (len(messages) > 24 or _approx_tokens > self.context_compressor.threshold_tokens)
):
    _preflight_tokens = estimate_request_tokens_rough(...)
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        self._compress_context(...)

Summary

Root Cause

Primary bug: `len(messages)` guard preempts token-based triggering

At run_agent.py line 11220:

if (
    self.compression_enabled
    and len(messages) > self.context_compressor.protect_first_n
                        + self.context_compressor.protect_last_n + 1
                        # 3 + 20 + 1 = 24
):
    _preflight_tokens = estimate_request_tokens_rough(
        messages,
        system_prompt=active_system_prompt or "",
        tools=self.tools or None,
    )
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        # ... compress

The len(messages) > 24 guard is a gate, not a hint. When it evaluates to False, the entire token-check block — and therefore compression — is skipped. It fails when a conversation has few (e.g. 8) messages that are each very large (e.g. 1.3M tokens in a single message).

Secondary issue: `update_from_response` data is never used for compression triggering

The ContextCompressor.update_from_response() method (line 488 of context_compressor.py) already receives actual prompt_tokens from each API response. should_compress() is implemented but never called from anywhere in the main loop (run_agent.py run_conversation()). Searching the entire run_agent.py file for any call to should_compress returns zero results.

Tertiary: `estimate_request_tokens_rough()` may underestimate

This function uses _CHARS_PER_TOKEN = 4, which underestimates CJK text (Chinese averages ~1.5-2 chars/token), code, JSON, and base64 content. Actual token counts can be 1.5-2x the estimate.

Observed Evidence

From agent.log

Zero occurrences of "Preflight compression" — the preflight path never reached the token-check branch.
Zero occurrences of "context compression started" — _compress_context() was never called.
24 occurrences of "Auxiliary compression: using auto" — these are just auxiliary model initialization messages, not actual compression events.

From request_dump analysis (latest dump, 6.75MB)

'messages': 8 messages
  [0] role=system   content= 19,556 chars
  [1] role=user     content=  8,928 chars
  [4] role=user     content=  4,781 chars  (compression summary marker)
  [7] role=user     content=5,334,419 chars  ← failed compression result

Total chars: 5,368,491 (~1.34M tokens at 4 chars/token)

The session had only 8 messages (well below the 24-message gate) but 1.34M tokens (well above the 500K threshold). Preflight silently skipped compression because 8 > 24 == False.

Reproduction Steps

Use Hermes in API Server mode (or any mode where the gateway passes conversation_history).
Create a session with fewer than 24 messages but very large content per message (e.g., read a 200K-token file, or have a failed compression attempt produce a 1M+ token summary).
Observe that preflight compression never triggers, even though token count exceeds threshold_tokens.
Next API call reaches the model's context limit and may fail with 413 or a non-retryable error.
Check agent.log — no "Preflight compression" or "context compression started" entries.

Environment

Hermes Agent version: observed on main commit as of 2026-05-17
Model: deepseek-v4-flash (1M context, threshold=500K)
Platform: API Server (hermes-web-ui frontend)
Config: defaults (compression.enabled: true, threshold: 0.5, protect_last_n: 20)

Proposed Fix

Fix 1 (minimal, high impact) ⭐ Recommended

Add a token-based OR condition to the preflight gate:

_approx_chars = sum(len(str(m.get('content', '')) or '') for m in messages)
_approx_tokens = _approx_chars // 4

if (
    self.compression_enabled
    and (len(messages) > 24 or _approx_tokens > self.context_compressor.threshold_tokens)
):
    _preflight_tokens = estimate_request_tokens_rough(...)
    if _preflight_tokens >= self.context_compressor.threshold_tokens:
        self._compress_context(...)

Fix 2 (medium)

Use the actual prompt_tokens returned by the LLM API as a secondary compression trigger after each API response.

Fix 3 (optional, completeness)

Add a mid-loop check: after tool results are appended but before the next API call, if the token estimate exceeds threshold, compress before sending.

Related Issues

#6202: /compress reports success even when unchanged — discusses the same 24-msg guard but for manual /compress path, not preflight auto-compression
#22871: Fix preflight compression pass budget — related to preflight but different issue (pass count, not gate condition)
#25921: Gateway reuses parent-sized history after compression split
#12213: Feature request for compress as native tool

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Preflight compression guard bypasses token threshold, causing silent context overflow on sessions with few messages but large total tokens

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Root Cause

Primary bug: `len(messages)` guard preempts token-based triggering

Secondary issue: `update_from_response` data is never used for compression triggering

Tertiary: `estimate_request_tokens_rough()` may underestimate

Observed Evidence

From agent.log

From request_dump analysis (latest dump, 6.75MB)

Reproduction Steps

Environment

Proposed Fix

Fix 1 (minimal, high impact) ⭐ Recommended

Fix 2 (medium)

Fix 3 (optional, completeness)

Related Issues

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Preflight compression guard bypasses token threshold, causing silent context overflow on sessions with few messages but large total tokens

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Root Cause

Primary bug: len(messages) guard preempts token-based triggering

Secondary issue: update_from_response data is never used for compression triggering

Tertiary: estimate_request_tokens_rough() may underestimate

Observed Evidence

From agent.log

From request_dump analysis (latest dump, 6.75MB)

Reproduction Steps

Environment

Proposed Fix

Fix 1 (minimal, high impact) ⭐ Recommended

Fix 2 (medium)

Fix 3 (optional, completeness)

Related Issues

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Primary bug: `len(messages)` guard preempts token-based triggering

Secondary issue: `update_from_response` data is never used for compression triggering

Tertiary: `estimate_request_tokens_rough()` may underestimate