hermes - ✅(Solved) Fix [Bug]: update_model() leaves tail_token_budget and max_summary_tokens stale after model switch [1 pull requests, 2 comments, 3 participants]

nftpoetrist · 2026-04-25T08:11:59Z

[hermes] PR 15560: fix compressor : re-derive tail token budget and max summary tokens in update model - Repository: NousResearch/hermes-agent - Author: nftpoe… # PR #15560: fix(compressor): re-derive tail_token_budget and max_summary_tokens in update_model() - Repository: NousResearch/hermes-agent - Author: nftpoetrist - State: closed | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/15560 ## Description (problem / solution / changelog) ## What does this PR do? `ContextCompressor.update_model()` correctly updates `context_length` and `threshold_tokens` when the active model changes, but does not re-derive `tail_token_budget` and `max_summary_tokens`. Both are computed from those same values in `__init__` — omitting the re-derivation leaves them stale after a model switch to one with a significantly different context window. **Large → small context switch** (e.g. gemini 1M → local 8K): `tail_token_budget` stays at ~100 K. `_find_tail_cut_by_tokens()` treats the entire conversation as "tail" (soft ceiling is never reached), leaving almost nothing in the middle region to summarize. Compression silently does nothing. **Small → large context switch** (e.g. local 8K → claude-sonnet 200K): `tail_token_budget` stays at ~800 tokens. The tail is drastically under-protected; recent turns and the active task are aggressively summarized away. Fix: mirror the two `__init__` assignments at the end of `update_model()` using the already-updated `self.threshold_tokens` and `self.context_length`. Symmetric with `__init__` lines 365–369. No behavior change when the context length is unchanged. ## Related Issue Fixes #15558 ## Type of Change - [x] 🐛 Bug fix - [ ] ✨ New feature - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [ ] ✅ Tests - [ ] ♻️ Refactor - [ ] 🎯 New skill ## Changes Made - `agent/context_compressor.py`: add `tail_token_budget` and `max_summary_tokens` re-derivation to `update_model()` (+9 lines) ## How to Test ```python from unittest.mock import patch with patch("agent.context_compressor.get_model_context_length", return_value=1_000_000): c = ContextCompressor(model="big/model", quiet_mode=True) assert c.tail_token_budget == int(c.threshold_tokens * c.summary_target_ratio) c.update_model(model="small/model", context_length=8_000) assert c.tail_token_budget == int(c.threshold_tokens * c.summary_target_ratio) assert c.max_summary_tokens == min(int(8_000 * 0.05), 12_000) ``` ## Checklist ### Code - [x] Contributing Guide okundu - [x] Conventional Commits - [x] Duplicate PR yok - [x] Sadece bu fix - [x] pytest çalıştırıldı - [x] Platform: macOS ### Documentation & Housekeeping - [x] Docs güncellendi — N/A - [x] cli-config.yaml.example — N/A - [x] CONTRIBUTING.md/AGENTS.md — N/A - [x] Cross-platform impact — N/A - [x] Tool descriptions — N/A ## Changed files - `agent/context_compressor.py` (modified, +9/-0) ## Fixed - Fixed by PR: fix(compressor): re-derive tail_token_budget and max_summary_tokens in update_model() (https://github.com/NousResearch/hermes-agent/pull/15560) ## Bug Description `ContextCompressor.update_model()` updates `context_length` and `threshold_tokens` when the active model changes, but does not re-derive `tail_token_budget` and `max_summary_tokens`. Both values are computed from `threshold_tokens` and `context_length` in `__init__` — without the same re-derivation in `update_model()`, they go stale whenever the user switches to a model with a significantly different context window. ## Root Cause `update_model()` in `agent/context_compressor.py`: ```python self.context_length = context_length self.threshold_tokens = max( int(context_length * self.threshold_percent), MINIMUM_CONTEXT_LENGTH, ) # tail_token_budget ← NOT updated # max_summary_tokens ← NOT updated ``` `__init__` computes both correctly: ```python target_tokens = int(self.threshold_tokens * self.summary_target_ratio) self.tail_token_budget = target_tokens self.max_summary_tokens = min(int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING) ``` ## Impact **Large → small context switch** (e.g. gemini-3-flash 1M → local model 8K): `tail_token_budget` stays at ~100K. `_find_tail_cut_by_tokens()` tries to protect 100K tokens of "tail" in an 8K context — the entire conversation is treated as tail, leaving nothing to summarize. Compression silently does nothing. **Small → large context switch** (e.g. local 8K → claude-sonnet 200K): `tail_token_budget` stays at ~800 tokens. The tail is drastically under-protected; recent turns and the active task are aggressively summarized away. ## Steps to Reproduce 1. Start a session with a large-context model (e.g. OpenRouter gemini or claude-sonnet). 2. Fill context past the compression threshold. 3. Switch to a model with a much smaller context window via `/model`. 4. Continue chatting until the new threshold is reached. 5. Observe: compression triggers but summarizes only 1–2 messages (tail budget exceeds total context). ## Are you willing to submit a PR? Yes, fix is re

hermes2026-04-25 08:11:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#15558•Fetched 2026-04-26 05:26:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×3commented ×2cross-referenced ×1

Root Cause

update_model() in agent/context_compressor.py:

self.context_length = context_length
self.threshold_tokens = max(
    int(context_length * self.threshold_percent),
    MINIMUM_CONTEXT_LENGTH,
)
# tail_token_budget  ← NOT updated
# max_summary_tokens ← NOT updated

__init__ computes both correctly:

target_tokens = int(self.threshold_tokens * self.summary_target_ratio)
self.tail_token_budget = target_tokens
self.max_summary_tokens = min(int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING)

Fix Action

Fixed

Fixed by PR: fix(compressor): re-derive tail_token_budget and max_summary_tokens in update_model() (https://github.com/NousResearch/hermes-agent/pull/15560)

PR fix notes

PR #15560: fix(compressor): re-derive tail_token_budget and max_summary_tokens in update_model()

Repository: NousResearch/hermes-agent
Author: nftpoetrist
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/15560

Description (problem / solution / changelog)

What does this PR do?

ContextCompressor.update_model() correctly updates context_length and threshold_tokens when the active model changes, but does not re-derive tail_token_budget and max_summary_tokens. Both are computed from those same values in __init__ — omitting the re-derivation leaves them stale after a model switch to one with a significantly different context window.

Large → small context switch (e.g. gemini 1M → local 8K): tail_token_budget stays at ~100 K. _find_tail_cut_by_tokens() treats the entire conversation as "tail" (soft ceiling is never reached), leaving almost nothing in the middle region to summarize. Compression silently does nothing.

Small → large context switch (e.g. local 8K → claude-sonnet 200K): tail_token_budget stays at ~800 tokens. The tail is drastically under-protected; recent turns and the active task are aggressively summarized away.

Fix: mirror the two __init__ assignments at the end of update_model() using the already-updated self.threshold_tokens and self.context_length. Symmetric with __init__ lines 365–369. No behavior change when the context length is unchanged.

Related Issue

Fixes #15558

Type of Change

Changes Made

agent/context_compressor.py: add tail_token_budget and max_summary_tokens re-derivation to update_model() (+9 lines)

How to Test

from unittest.mock import patch
with patch("agent.context_compressor.get_model_context_length", return_value=1_000_000):
    c = ContextCompressor(model="big/model", quiet_mode=True)

assert c.tail_token_budget == int(c.threshold_tokens * c.summary_target_ratio)

c.update_model(model="small/model", context_length=8_000)
assert c.tail_token_budget == int(c.threshold_tokens * c.summary_target_ratio)
assert c.max_summary_tokens == min(int(8_000 * 0.05), 12_000)

Checklist

Code

Documentation & Housekeeping

Docs güncellendi — N/A
cli-config.yaml.example — N/A
CONTRIBUTING.md/AGENTS.md — N/A
Cross-platform impact — N/A
Tool descriptions — N/A

Changed files

agent/context_compressor.py (modified, +9/-0)

Code Example

self.context_length = context_length
self.threshold_tokens = max(
    int(context_length * self.threshold_percent),
    MINIMUM_CONTEXT_LENGTH,
)
# tail_token_budget  ← NOT updated
# max_summary_tokens ← NOT updated

---

target_tokens = int(self.threshold_tokens * self.summary_target_ratio)
self.tail_token_budget = target_tokens
self.max_summary_tokens = min(int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING)

RAW_BUFFERClick to expand / collapse

Bug Description

ContextCompressor.update_model() updates context_length and threshold_tokens when the active model changes, but does not re-derive tail_token_budget and max_summary_tokens. Both values are computed from threshold_tokens and context_length in __init__ — without the same re-derivation in update_model(), they go stale whenever the user switches to a model with a significantly different context window.

Root Cause

update_model() in agent/context_compressor.py:

self.context_length = context_length
self.threshold_tokens = max(
    int(context_length * self.threshold_percent),
    MINIMUM_CONTEXT_LENGTH,
)
# tail_token_budget  ← NOT updated
# max_summary_tokens ← NOT updated

__init__ computes both correctly:

target_tokens = int(self.threshold_tokens * self.summary_target_ratio)
self.tail_token_budget = target_tokens
self.max_summary_tokens = min(int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING)

Impact

Large → small context switch (e.g. gemini-3-flash 1M → local model 8K): tail_token_budget stays at ~100K. _find_tail_cut_by_tokens() tries to protect 100K tokens of "tail" in an 8K context — the entire conversation is treated as tail, leaving nothing to summarize. Compression silently does nothing.

Steps to Reproduce

Start a session with a large-context model (e.g. OpenRouter gemini or claude-sonnet).
Fill context past the compression threshold.
Switch to a model with a much smaller context window via /model.
Continue chatting until the new threshold is reached.
Observe: compression triggers but summarizes only 1–2 messages (tail budget exceeds total context).

Are you willing to submit a PR?

Yes, fix is ready.

extent analysis

TL;DR

Update update_model() to re-derive tail_token_budget and max_summary_tokens after changing the active model.

Guidance

In update_model(), recompute tail_token_budget and max_summary_tokens using the updated threshold_tokens and context_length values.
Verify the fix by switching between models with different context windows and checking that compression behaves as expected.
To mitigate the issue, ensure that update_model() updates all dependent variables when the model changes.
Consider adding tests to cover model switching scenarios and verify that tail_token_budget and max_summary_tokens are correctly updated.

Example

def update_model(self, ...):
    self.context_length = context_length
    self.threshold_tokens = max(
        int(context_length * self.threshold_percent),
        MINIMUM_CONTEXT_LENGTH,
    )
    target_tokens = int(self.threshold_tokens * self.summary_target_ratio)
    self.tail_token_budget = target_tokens
    self.max_summary_tokens = min(int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING)

Notes

The provided code snippet assumes that the necessary variables and methods are already defined and accessible within the update_model() method.

Recommendation

Apply workaround: update update_model() to re-derive tail_token_budget and max_summary_tokens to ensure correct compression behavior when switching between models.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: update_model() leaves tail_token_budget and max_summary_tokens stale after model switch [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #15560: fix(compressor): re-derive tail_token_budget and max_summary_tokens in update_model()

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Changed files

Code Example

Bug Description

Root Cause

Impact

Steps to Reproduce

Are you willing to submit a PR?

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING