hermes - 💡(How to fix) Fix [Bug]: Large tool results consume entire tail token budget — conversation messages lost to summary on compression

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  • #12131 — Context lost when summary generation fails (different root cause, similar symptom)
  • #11588 — Preserve-on-failure principle (broader compression reliability)
  • #10896 — Last user message lost to compression (partial fix via _ensure_last_user_message_in_tail, but doesn't cover the multi-message case described here)

Code Example

# Derived budgets (128K context model example):
tail_token_budget = threshold_tokens × summary_target_ratio
                 = 64000 × 0.20
                 = ~12,800 tokens

# protect_last_n = 20 (hard minimum floor)

---

messages[40]: user: "Now run the test suite and fix any failures"     ← pushed to summary
messages[41]: assistant: "Running tests..."                            ← pushed to summary  
messages[42]: tool: [terminal] npm test → 5000 lines of output        ← in tail (8K tokens)
messages[43]: assistant: "3 tests failed, fixing..."                   ← pushed to summary
messages[44]: tool: [terminal] npm test → 5000 lines of output        ← in tail (6K tokens)
messages[45]: user: "Also check the lint warnings"in tail (barely)

---

# In _find_tail_cut_by_tokens(), after the backward token walk:
# Count conversation messages (user + assistant without tool_calls) in the tail
conv_msgs_in_tail = sum(
    1 for m in messages[cut_idx:]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls")
)
# If fewer than CONVERSATION_FLOOR, expand the tail backward
CONVERSATION_FLOOR = 6  # guarantee at least 6 conversational turns
while conv_msgs_in_tail < CONVERSATION_FLOOR and cut_idx > head_end + 1:
    cut_idx -= 1
    m = messages[cut_idx]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls"):
        conv_msgs_in_tail += 1

---

# Before _find_tail_cut_by_tokens():
MAX_TOOL_RESULT_TAIL_TOKENS = 2000  # per tool result

# Create a temporary view where tool results are truncated
# Use this truncated view for tail boundary calculation
# Then apply the boundary to the ORIGINAL messages (keeping full tool results in the tail)
RAW_BUFFERClick to expand / collapse

Bug Description

When context compression triggers during a session that contains large tool results (terminal output, git logs, file diffs, search results), the tail protection mechanism correctly preserves recent messages by token budget — but the budget is almost entirely consumed by the large tool outputs. The user's actual conversation messages (questions, instructions, task context) get pushed out of the tail and into the compressed summary region.

The result feels like "the conversation disappeared": after compression, the agent sees tool call history but loses the conversational context of what was being discussed. The user's most recent messages are buried in the LLM-generated summary rather than being in the active context window.

Technical Details

How the tail budget works

In agent/context_compressor.py, _find_tail_cut_by_tokens() walks backward from the end of the message list, accumulating tokens until tail_token_budget is reached:

# Derived budgets (128K context model example):
tail_token_budget = threshold_tokens × summary_target_ratio
                 = 64000 × 0.20
                 = ~12,800 tokens

# protect_last_n = 20 (hard minimum floor)

The problem

  1. A single large terminal tool result (e.g. npm test output, build log, git diff) can easily be 3,000-8,000+ tokens
  2. The backward walk accumulates these large tool results first (they're at the end)
  3. After 2-3 large tool results, the entire ~12.8K token budget is exhausted
  4. The boundary (cut_idx) is placed such that user/assistant conversation messages just before those tool results fall into the "middle" region — which gets summarized
  5. _ensure_last_user_message_in_tail() only protects the single most recent user message; earlier but still-recent user messages and assistant responses are lost to the summary

Concrete scenario

messages[40]: user: "Now run the test suite and fix any failures"     ← pushed to summary
messages[41]: assistant: "Running tests..."                            ← pushed to summary  
messages[42]: tool: [terminal] npm test → 5000 lines of output        ← in tail (8K tokens)
messages[43]: assistant: "3 tests failed, fixing..."                   ← pushed to summary
messages[44]: tool: [terminal] npm test → 5000 lines of output        ← in tail (6K tokens)
messages[45]: user: "Also check the lint warnings"                     ← in tail (barely)

After compression, the agent sees ~14K tokens of test output but has lost the conversational thread about why tests were being run and what was being fixed.

Code References

  • agent/context_compressor.py_find_tail_cut_by_tokens() (~line 420): backward walk with token budget
  • agent/context_compressor.py_prune_old_tool_results(): pre-pass pruning only affects messages outside the tail boundary
  • agent/context_compressor.py_ensure_last_user_message_in_tail(): only anchors the last user message
  • tail_token_budget derived in __init__(): int(threshold_tokens * summary_target_ratio)

Proposed Solution: Pre-truncate Tool Results in Tail Before Budget Calculation

Option A — Conversation message floor (complementary): Add a guarantee that the last N user/assistant text messages (excluding tool results) are always preserved in the tail, regardless of tool result sizes. This acts as a safety net:

# In _find_tail_cut_by_tokens(), after the backward token walk:
# Count conversation messages (user + assistant without tool_calls) in the tail
conv_msgs_in_tail = sum(
    1 for m in messages[cut_idx:]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls")
)
# If fewer than CONVERSATION_FLOOR, expand the tail backward
CONVERSATION_FLOOR = 6  # guarantee at least 6 conversational turns
while conv_msgs_in_tail < CONVERSATION_FLOOR and cut_idx > head_end + 1:
    cut_idx -= 1
    m = messages[cut_idx]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls"):
        conv_msgs_in_tail += 1

Option B — Truncate tool results in tail before budget calculation (primary fix): Before calculating the tail boundary, cap tool results in the tail region to a reasonable size. This ensures the budget is spent on a mix of conversation + tool context:

# Before _find_tail_cut_by_tokens():
MAX_TOOL_RESULT_TAIL_TOKENS = 2000  # per tool result

# Create a temporary view where tool results are truncated
# Use this truncated view for tail boundary calculation
# Then apply the boundary to the ORIGINAL messages (keeping full tool results in the tail)

This way:

  • Full tool results are still sent to the model (they're in the tail)
  • But the boundary calculation isn't skewed by oversized outputs
  • Conversation messages are more likely to be included in the tail

Recommended: Implement both — Option B as the primary fix, Option A as a safety net.

Impact

  • Severity: High — causes task amnesia in long sessions with heavy tool use
  • Frequency: Common during SWE/coding workflows (build-fix-test loops, git operations, large file reads)
  • User impact: Agent appears to "forget" what it was doing and needs task re-explanation after every compression cycle

Environment

  • Any model with context compression enabled
  • Most noticeable on 128K context models where tail_token_budget ≈ 12.8K tokens
  • Exacerbated by tools that produce large outputs (terminal, search_files, read_file on large files)

Related Issues

  • #12131 — Context lost when summary generation fails (different root cause, similar symptom)
  • #11588 — Preserve-on-failure principle (broader compression reliability)
  • #10896 — Last user message lost to compression (partial fix via _ensure_last_user_message_in_tail, but doesn't cover the multi-message case described here)

extent analysis

TL;DR

Implementing a combination of truncating tool results in the tail before budget calculation and ensuring a conversation message floor can help preserve conversational context during context compression.

Guidance

  1. Truncate tool results: Before calculating the tail boundary, cap tool results in the tail region to a reasonable size (e.g., 2000 tokens per tool result) to prevent them from consuming the entire token budget.
  2. Conversation message floor: Implement a guarantee that the last N user/assistant text messages (excluding tool results) are always preserved in the tail, regardless of tool result sizes, to act as a safety net.
  3. Apply boundary to original messages: After calculating the tail boundary using the truncated tool results, apply this boundary to the original messages to keep full tool results in the tail while preserving conversational context.
  4. Test and refine: Test these changes with various tool output sizes and conversation scenarios to refine the token limits and conversation floor values for optimal performance.

Example

# Truncate tool results before budget calculation
MAX_TOOL_RESULT_TAIL_TOKENS = 2000
truncated_messages = []
for message in messages:
    if message.get("tool_calls"):
        # Truncate tool result to MAX_TOOL_RESULT_TAIL_TOKENS
        truncated_message = {
            **message,
            "text": message["text"][:MAX_TOOL_RESULT_TAIL_TOKENS]
        }
        truncated_messages.append(truncated_message)
    else:
        truncated_messages.append(message)

# Calculate tail boundary using truncated messages
# ...

# Apply boundary to original messages

Notes

  • The proposed solution requires careful tuning of the MAX_TOOL_RESULT_TAIL_TOKENS and CONVERSATION_FLOOR values to balance between preserving conversational context and maintaining useful tool output.
  • This solution may not completely eliminate the issue but should significantly improve the preservation of conversational context during context compression.

Recommendation

Apply the proposed solution by implementing both the truncation of tool results and the conversation message floor as a primary fix and safety net, respectively, to address the high-severity issue of task amnesia in long sessions with heavy tool use.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Large tool results consume entire tail token budget — conversation messages lost to summary on compression