openclaw - ✅(Solved) Fix [Bug]: Tool calls silently hang after compaction in extended sessions (compaction.mode=safeguard) [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#51031Fetched 2026-04-08 01:05:19
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
0
Participants
Timeline (top)
cross-referenced ×3commented ×2

Root Cause

Root Cause Analysis (250+ words)

Fix Action

Fix / Workaround

Impact

  • Severity: High — renders sessions unusable after extended use
  • Affected Configurations: Any with compaction.mode: safeguard + extended sessions
  • Workaround: Restart session (as noted in bug report)

PR fix notes

PR #51262: clear pending tool results after compaction to prevent stale id corruption

Description (problem / solution / changelog)

Problem

After compaction in extended sessions with compaction.mode=safeguard, tool calls silently hang. New sessions work fine.

Root cause

guardSessionManager in attempt.ts creates a pendingState that tracks in-flight tool call IDs. When compaction fires and rewrites the JSONL file, those tool call IDs are removed from the transcript. But clearPendingToolResults() is never called on the attempt's session manager after compaction completes.

After compaction, when the next assistant message arrives with new tool calls, shouldFlushBeforeNewToolCalls() sees stale pending IDs (pending.size > 0) and calls flushPendingToolResults(). This inserts synthetic tool results with IDs that no longer exist in the transcript, corrupting subsequent tool calls.

The compaction handler (handleAutoCompactionEnd) has its own separate guardSessionManager instance. Its clearPendingToolResults only clears its own pending state, not the run attempt's.

Fix

Call sessionManager.clearPendingToolResults?.() in attempt.ts after compaction completes (getCompactionCount() > 0). This clears stale pending IDs before they can corrupt the transcript.

One new line of runtime code, plus a regression test confirming that clearing pending state prevents synthetic result insertion for stale IDs.

Changes

  • src/agents/pi-embedded-runner/run/attempt.ts: clear pending tool results when compaction occurred
  • src/agents/session-tool-result-guard.test.ts: regression test for post-compaction stale ID scenario

Closes #51031

Changed files

  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +9/-0)
  • src/agents/session-tool-result-guard.test.ts (modified, +42/-0)

PR #2749: fix: cap in-memory fileEntries array to prevent unbounded heap growth

Description (problem / solution / changelog)

Problem

SessionManager.fileEntries is a private FileEntry[] that mirrors every JSONL session entry in memory. It grows via push() in _appendEntry() and is never pruned. In long-running gateway sessions (9+ hours of normal use), this array silently accumulates thousands of entries containing full message bodies, tool results, and metadata, causing heap growth of 1GB+ with no upper bound.

The on-disk JSONL transcript is append-only and that's correct. But the in-memory mirror has no reason to retain the full history. Most read paths either use the byId Map for tree traversal or only need recent entries. Compaction summarizes older context for the LLM but never touches fileEntries. Two layers of session state, only one with a ceiling.

Heap snapshot analysis of a production session showed 99.5% of 250K+ retained message objects tracing back through fileEntries -> SessionManager -> AgentSession.

Full investigation with V8 heap snapshots, retainer analysis, and related issue survey: openclaw/openclaw#58802

Fix

Add a configurable sliding window cap (maxFileEntries, default 1000) on the in-memory fileEntries array. After each _appendEntry() and after setSessionFile() bulk loads, if the array exceeds the limit, the oldest entries (after the header at index 0) are spliced out and evicted from byId, labelsById, and labelTimestampsById.

The on-disk JSONL file is append-only and unaffected. Full session history is preserved on disk. Only the in-memory representation is bounded.

What changes

  • New _pruneIfNeeded() private method, called after _appendEntry() and after setSessionFile() loads entries from disk
  • New maxFileEntries private field, set via constructor (default: DEFAULT_MAX_FILE_ENTRIES = 1000)
  • Static factory methods (create, open, continueRecent, inMemory, forkFrom) accept optional maxFileEntries parameter
  • DEFAULT_MAX_FILE_ENTRIES exported for downstream configuration

What does NOT change

  • On-disk JSONL format and append-only persistence behavior
  • newSession() and createBranchedSession() (they replace fileEntries entirely)
  • Public API shape (getEntries(), getHeader(), getBranch(), buildSessionContext(), etc.)
  • All existing test behavior (876 tests pass, 0 failures)

Why 1000 as the default

  • A typical interactive turn produces 2-4 entries (user message, assistant message, optional tool calls). 1000 entries covers ~250-500 turns, well beyond what any single LLM context window can hold.
  • Compaction typically fires at 50K-200K tokens, which maps to ~100-400 entries. 1000 provides generous headroom above the compaction window.
  • OpenClaw's session.maintenance.maxEntries defaults to 500 for the session index file, so 1000 for the in-memory transcript mirror is consistent.
  • At ~1-10KB per entry, 1000 entries = 1-10MB of retained heap. Compared to unbounded growth toward 1GB+, that's a hard ceiling that stays invisible.

Trade-offs

  • Branch navigation to very old entries: If an entry has been pruned from memory, getEntry(id) returns undefined and getBranch(id) produces a truncated path. This only affects TUI users navigating to entries older than the window. The JSONL file on disk retains full history. Callers that need old entries can reload via loadEntriesFromFile().
  • getTree() shows fewer branches: Only entries within the window appear in the tree view. Disk has full history.
  • External access: OpenClaw's session-manager-init.ts accesses sm.fileEntries directly (bypassing TypeScript private). The array is still a plain FileEntry[] with the same shape, just bounded. OpenClaw's existing pattern of resetting sm.fileEntries = [header] continues to work.

Related issues

  • openclaw/openclaw#13758: Gateway accumulates memory over long sessions (1.9GB RSS after 13h). Comment by echoVic identifies SessionManager caching as likely primary cause.
  • openclaw/openclaw#6190: Session log growing and bot hanging up (master issue for session bloat).
  • openclaw/openclaw#4948: Multiple in-memory caches grow unbounded (same class of bug, fixed).
  • openclaw/openclaw#17820: Cron runs never clean up agent-events Maps (same pattern, ~68MB/hr growth).
  • openclaw/openclaw#51031: Tool calls hang after compaction; shows sessionManager.appendMessage and pendingState map divergence.
  • openclaw/openclaw#24800: Auto-compaction not triggered during tool-use loops.
  • openclaw/openclaw#33553: Feature request for configurable sliding window to cap conversation history.

Testing

  • 15 new tests in test/session-manager-pruning.test.ts covering:
    • Cap enforcement and header preservation
    • Most-recent-entries retention
    • Pruned entry eviction from byId
    • Leaf accessibility after pruning
    • newSession/createBranchedSession after pruning
    • Compaction within capped sessions
    • Persisted sessions (in-memory pruned, disk retains full JSONL)
    • getBranch, getTree, buildSessionContext after pruning
    • Minimum cap (1 entry)
    • Default cap applied when unspecified
  • All 876 existing tests pass with zero changes

Changed files

  • packages/coding-agent/src/core/session-manager.ts (modified, +67/-12)
  • packages/coding-agent/test/session-manager-pruning.test.ts (added, +254/-0)

Code Example

const pendingState = createPendingToolCallState();
   // ...
   if (toolCalls.length > 0) {
     pendingState.trackToolCalls(toolCalls);  // line 259: tracks {id, name}
   }

---

// session-tool-result-guard.ts:195
   if (id) {
     pendingState.delete(id);  // Never finds the stale ID
   }
   // ...
   if (pendingState.shouldFlushBeforeNonToolResult(nextRole, toolCalls.length)) {
     flushPendingToolResults();  // line 233: triggers with wrong IDs
   }

---

// In pi-embedded-runner.ts — after compaction completes:
const guard = installSessionToolResultGuard(sessionManager, opts);
// ...
// After compaction.write() succeeds:
guard.clearPendingToolResults();
RAW_BUFFERClick to expand / collapse

Bug Summary

After extended sessions with heavy tool usage under compaction.mode: safeguard, tool calls stop executing. The model generates the tool call output normally (confirmed in LM Studio logs), but OpenClaw never executes it and the session hangs indefinitely. Starting a new session immediately restores normal behavior.

Root Cause Analysis (250+ words)

Location of bug: src/agents/session-tool-result-guard.ts (guard closure) + src/agents/compaction.ts (compaction logic)

The Bug Mechanism

  1. Pending Tool Call Tracking ():

    const pendingState = createPendingToolCallState();
    // ...
    if (toolCalls.length > 0) {
      pendingState.trackToolCalls(toolCalls);  // line 259: tracks {id, name}
    }
  2. Compaction Rewrites JSONL Directly (): Compaction writes compacted messages directly to the session JSONL file, bypassing sessionManager.appendMessage. The pendingState map in the guard closure is never cleared during or after compaction.

  3. Stale Pending IDs After Compaction: After compaction rewrites the JSONL, the tool call IDs in pendingState are orphaned — they reference pre-compaction message indices that no longer exist. The compacted transcript has a completely different message structure.

  4. Confusion on Next Tool Call: When a new tool result arrives post-compaction:

    // session-tool-result-guard.ts:195
    if (id) {
      pendingState.delete(id);  // Never finds the stale ID
    }
    // ...
    if (pendingState.shouldFlushBeforeNonToolResult(nextRole, toolCalls.length)) {
      flushPendingToolResults();  // line 233: triggers with wrong IDs
    }

    The guard creates synthetic tool results with wrong IDs, corrupting the transcript state.

  5. Tool Execution Hangs: With corrupted pending state and mismatched tool_use/tool_result IDs, subsequent tool calls are either silently dropped by the guard or misrouted, causing the observed hang.

Why New Sessions Work

Starting a new session creates a fresh pendingState map with no stale entries, so tool execution works normally.

Environment

  • OpenClaw 2026.3.13 (confirmed affected)
  • Model: Qwen3.5-35B via LM Studio (OpenAI-compatible endpoint)
  • OS: Ubuntu 22.04
  • compaction.mode: safeguard (default for this config)
  • Session length: Extended (heavy tool usage over many turns)

Proposed Fix

The installSessionToolResultGuard() function returns clearPendingToolResults(). The compaction completion callback should invoke this to reset the pending state:

// In pi-embedded-runner.ts — after compaction completes:
const guard = installSessionToolResultGuard(sessionManager, opts);
// ...
// After compaction.write() succeeds:
guard.clearPendingToolResults();

Alternatively, add a compaction lifecycle hook in installSessionToolResultGuard options that calls clearPendingToolResults when the transcript is about to be rewritten.

Impact

  • Severity: High — renders sessions unusable after extended use
  • Affected Configurations: Any with compaction.mode: safeguard + extended sessions
  • Workaround: Restart session (as noted in bug report)

Tags

  • bug
  • compaction
  • tool-execution

extent analysis

Fix Plan

To resolve the issue, follow these steps:

  • Update the installSessionToolResultGuard() function to clear the pending tool results after compaction completes.
  • Add a compaction lifecycle hook in installSessionToolResultGuard options to call clearPendingToolResults when the transcript is about to be rewritten.

Example code:

// In pi-embedded-runner.ts — after compaction completes:
const guard = installSessionToolResultGuard(sessionManager, opts);
// ...
// After compaction.write() succeeds:
guard.clearPendingToolResults();

Alternatively, modify the installSessionToolResultGuard function to accept a compaction lifecycle hook:

interface InstallSessionToolResultGuardOptions {
  onCompactionComplete: () => void;
}

function installSessionToolResultGuard(sessionManager, opts: InstallSessionToolResultGuardOptions) {
  // ...
  opts.onCompactionComplete = () => {
    clearPendingToolResults();
  };
  // ...
}

Then, pass the onCompactionComplete callback when calling installSessionToolResultGuard:

const guard = installSessionToolResultGuard(sessionManager, {
  onCompactionComplete: () => {
    guard.clearPendingToolResults();
  },
});

Verification

To verify the fix, test the application with extended sessions and heavy tool usage under compaction.mode: safeguard. The tool calls should execute normally, and the session should not hang indefinitely.

Extra Tips

  • Ensure that the clearPendingToolResults function is properly clearing the pending state to prevent stale IDs from causing issues.
  • Consider adding logging or monitoring to detect and handle any potential issues with the compaction lifecycle hook or the clearPendingToolResults function.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING