hermes - 💡(How to fix) Fix [Bug] CLI: context compression runs but result discarded on next turn — session rotation not synced to conversation_history

hermes2026-05-21 16:54:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

In CLI mode (not gateway), automatic context compression fires correctly (the log shows messages=162->10, saving ~115K tokens), but the compressed message list is not carried forward to the next user turn. The subsequent API call loads the pre-compression history, making the compression effectively a no-op. The VRAM and time spent on the auxiliary compression model are wasted.

This is the CLI counterpart of the gateway-side issue being addressed in PR #29140 and PR #29505 — those fix the gateway path (gateway/run.py) but the CLI path (cli.py) remains broken.

Environment

- Platform: WSL2 (Windows Subsystem for Linux) on Windows 11
  - Linux DESKTOP-4216AAM 6.6.114.1-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Mon Dec 1 20:46:23 UTC 2025 x86_64 GNU/Linux
- Hardware: NVIDIA GeForce RTX 2060 SUPER (8 GB VRAM), driver 581.80
- Hermes: v2026.5.16-590-g5e743559e (HEAD 5e743559e, 2026-05-21)
- Python: 3.12.3
- Docker: 29.5.0
- Ollama: 0.24.0 (running in Docker container with --gpus all)

Configuration

- Mode: CLI (hermes in terminal, interactive session)
- Main model: deepseek-v4-flash via DeepSeek API (1,000,000 context)
- Auxiliary compression model: llama3.2:3b via Ollama (local, http://localhost:11434/v1)
- Compression enabled: true
- Threshold: 0.128 (12.8% × 1M = 128K trigger)
- Target ratio: 0.15 (compress to 15% of threshold ≈ 19K tokens)
- Protect first N: 3 | Protect last N: 20
- Aux compression context length: 131,072 (auto-detected for llama3.2:3b)

Steps to Reproduce

1. Run hermes in CLI mode.
2. Have a long conversation that exceeds the compression threshold (128K tokens in our case).
3. Observe the first auto-compression works correctly (subsequent API calls show reduced token count).
4. Continue the conversation past the threshold again.
5. Observe the second (and any subsequent) auto-compression — the log says compression done: messages=N->M tokens=~X, but the very next API call still shows in=~134K (the pre-compression size).

Evidence from Logs


Session: 20260521_222812_a56325

22:37:08  Compression #1 started:  messages=142 tokens=~129,390
22:37:25  Compression #1 done:     messages=142->19 tokens=~30,228
22:39:xx  API call #62:            in=29,812     ← Correctly compressed ✓

23:00:31  Compression #2 started:  messages=162 tokens=~139,028
23:00:48  Compression #2 done:     messages=162->10 tokens=~23,825
23:03:29  API call #75:            in=134,191    ← NOT compressed ✗
23:03:38  API call #76:            in=134,248    ← Still not compressed ✗


Notice the pattern: Compression #1 works, but Compression #2 (and any subsequent one) does not. The compressed message list (~24K tokens) never reaches the API client.

Root Cause Analysis

The issue lives at the boundary between run_conversation() (in agent/conversation_loop.py) and the CLI's history tracking (cli.py), with a possible contribution from the session DB reload path.

What goes wrong:

1. The CLI stores conversation history in self.conversation_history (cli.py).
2. Each turn, it calls:
   python
   self.agent.run_conversation(
       conversation_history=self.conversation_history[:-1],
       ...
   )

   (line 11227 of cli.py)
3. Inside run_conversation(), the preflight compression at conversation_loop.py:459 compresses messages (updates the local variable) and rotates agent.session_id — the old session is ended in SQLite, a new one is created (conversation_compression.py:383).
4. The compressed messages list is used for the current turn's API calls. By the time the turn ends, messages has grown back with the current turn's tool output and assistant responses.
5. After run_conversation() returns, the CLI syncs its history at line 11369:
   python
   self.conversation_history = result.get("messages", self.conversation_history)

6. The problem: result["messages"] is the inflated post-turn list, not the compressed baseline. The CLI's conversation_history now contains the pre-compression messages + the current turn's growth.
7. On the next user turn, self.conversation_history[:-1] (still 130K+ tokens) is passed to run_conversation again. The preflight check fires compression again — creating a loop of wasted compressions.

Additional concern — session DB reload: Depending on whether the CLI's session persistence layer reloads from SQLite/JSONL between turns, the old session's pre-compression transcript may be re-read, bypassing the in-memory conversation_history entirely and compounding the issue. This is worth investigating further.

The gateway path (PR #29140) fixes this by eagerly persisting the rotated session ID and rewriting the transcript on disk. The CLI needs analogous treatment.

Related Issues / PRs

- PR #29140 — fix(gateway): eagerly persist mid-run session rotation (gateway only, not merged)
- PR #29505 — fix(agent+gateway): break infinite context compression loop (gateway + should_compress guard, not merged)
- Neither PR addresses the CLI path (cli.py lines 11227 / 11369 and the conversation_history sync)

Suggested Fix

In cli.py, after run_conversation() returns, detect whether compression rotated the session and replace the history accordingly:

python
old_sid = getattr(self.agent, "session_id", None)
result = self.agent.run_conversation(...)
new_sid = getattr(self.agent, "session_id", None)

if new_sid and old_sid and new_sid != old_sid:
    # Compression rotated the session — use the agent's
    # internal compressed message store instead of the
    # inflated post-turn result["messages"].
    self.conversation_history = list(
        getattr(self.agent, "_session_messages", [])
        or result.get("messages", self.conversation_history)
    )
else:
    self.conversation_history = result.get(
        "messages", self.conversation_history
    )


A simpler interim workaround for users: lower the compression threshold (e.g. threshold: 0.100) so compression fires earlier, giving more headroom before the context grows back past the threshold.

Reported By

- Laputa Sunny — a rookie from China who noticed the pattern while stress-testing local auxiliary compression
- Hermes — the AI agent that dug through the source code and correlated the log evidence

This is the third report of this class of bug (after #29140 and #29505), but the first to specifically identify the CLI mode path.

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

A simpler interim workaround for users: lower the compression threshold (e.g. threshold: 0.100) so compression fires earlier, giving more headroom before the context grows back past the threshold.

RAW_BUFFERClick to expand / collapse

Summary

In CLI mode (not gateway), automatic context compression fires correctly (the log shows messages=162->10, saving ~115K tokens), but the compressed message list is not carried forward to the next user turn. The subsequent API call loads the pre-compression history, making the compression effectively a no-op. The VRAM and time spent on the auxiliary compression model are wasted.

This is the CLI counterpart of the gateway-side issue being addressed in PR #29140 and PR #29505 — those fix the gateway path (gateway/run.py) but the CLI path (cli.py) remains broken.

Environment

- Platform: WSL2 (Windows Subsystem for Linux) on Windows 11
  - Linux DESKTOP-4216AAM 6.6.114.1-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Mon Dec 1 20:46:23 UTC 2025 x86_64 GNU/Linux
- Hardware: NVIDIA GeForce RTX 2060 SUPER (8 GB VRAM), driver 581.80
- Hermes: v2026.5.16-590-g5e743559e (HEAD 5e743559e, 2026-05-21)
- Python: 3.12.3
- Docker: 29.5.0
- Ollama: 0.24.0 (running in Docker container with --gpus all)

Configuration

- Mode: CLI (hermes in terminal, interactive session)
- Main model: deepseek-v4-flash via DeepSeek API (1,000,000 context)
- Auxiliary compression model: llama3.2:3b via Ollama (local, http://localhost:11434/v1)
- Compression enabled: true
- Threshold: 0.128 (12.8% × 1M = 128K trigger)
- Target ratio: 0.15 (compress to 15% of threshold ≈ 19K tokens)
- Protect first N: 3 | Protect last N: 20
- Aux compression context length: 131,072 (auto-detected for llama3.2:3b)

Steps to Reproduce

1. Run hermes in CLI mode.
2. Have a long conversation that exceeds the compression threshold (128K tokens in our case).
3. Observe the first auto-compression works correctly (subsequent API calls show reduced token count).
4. Continue the conversation past the threshold again.
5. Observe the second (and any subsequent) auto-compression — the log says compression done: messages=N->M tokens=~X, but the very next API call still shows in=~134K (the pre-compression size).

Evidence from Logs


Session: 20260521_222812_a56325

22:37:08  Compression #1 started:  messages=142 tokens=~129,390
22:37:25  Compression #1 done:     messages=142->19 tokens=~30,228
22:39:xx  API call #62:            in=29,812     ← Correctly compressed ✓

23:00:31  Compression #2 started:  messages=162 tokens=~139,028
23:00:48  Compression #2 done:     messages=162->10 tokens=~23,825
23:03:29  API call #75:            in=134,191    ← NOT compressed ✗
23:03:38  API call #76:            in=134,248    ← Still not compressed ✗


Notice the pattern: Compression #1 works, but Compression #2 (and any subsequent one) does not. The compressed message list (~24K tokens) never reaches the API client.

Root Cause Analysis

The issue lives at the boundary between run_conversation() (in agent/conversation_loop.py) and the CLI's history tracking (cli.py), with a possible contribution from the session DB reload path.

What goes wrong:

1. The CLI stores conversation history in self.conversation_history (cli.py).
2. Each turn, it calls:
   python
   self.agent.run_conversation(
       conversation_history=self.conversation_history[:-1],
       ...
   )

   (line 11227 of cli.py)
3. Inside run_conversation(), the preflight compression at conversation_loop.py:459 compresses messages (updates the local variable) and rotates agent.session_id — the old session is ended in SQLite, a new one is created (conversation_compression.py:383).
4. The compressed messages list is used for the current turn's API calls. By the time the turn ends, messages has grown back with the current turn's tool output and assistant responses.
5. After run_conversation() returns, the CLI syncs its history at line 11369:
   python
   self.conversation_history = result.get("messages", self.conversation_history)

6. The problem: result["messages"] is the inflated post-turn list, not the compressed baseline. The CLI's conversation_history now contains the pre-compression messages + the current turn's growth.
7. On the next user turn, self.conversation_history[:-1] (still 130K+ tokens) is passed to run_conversation again. The preflight check fires compression again — creating a loop of wasted compressions.

Additional concern — session DB reload: Depending on whether the CLI's session persistence layer reloads from SQLite/JSONL between turns, the old session's pre-compression transcript may be re-read, bypassing the in-memory conversation_history entirely and compounding the issue. This is worth investigating further.

The gateway path (PR #29140) fixes this by eagerly persisting the rotated session ID and rewriting the transcript on disk. The CLI needs analogous treatment.

Related Issues / PRs

- PR #29140 — fix(gateway): eagerly persist mid-run session rotation (gateway only, not merged)
- PR #29505 — fix(agent+gateway): break infinite context compression loop (gateway + should_compress guard, not merged)
- Neither PR addresses the CLI path (cli.py lines 11227 / 11369 and the conversation_history sync)

Suggested Fix

In cli.py, after run_conversation() returns, detect whether compression rotated the session and replace the history accordingly:

python
old_sid = getattr(self.agent, "session_id", None)
result = self.agent.run_conversation(...)
new_sid = getattr(self.agent, "session_id", None)

if new_sid and old_sid and new_sid != old_sid:
    # Compression rotated the session — use the agent's
    # internal compressed message store instead of the
    # inflated post-turn result["messages"].
    self.conversation_history = list(
        getattr(self.agent, "_session_messages", [])
        or result.get("messages", self.conversation_history)
    )
else:
    self.conversation_history = result.get(
        "messages", self.conversation_history
    )


A simpler interim workaround for users: lower the compression threshold (e.g. threshold: 0.100) so compression fires earlier, giving more headroom before the context grows back past the threshold.

Reported By

- Laputa Sunny — a rookie from China who noticed the pattern while stress-testing local auxiliary compression
- Hermes — the AI agent that dug through the source code and correlated the log evidence

This is the third report of this class of bug (after #29140 and #29505), but the first to specifically identify the CLI mode path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering