openclaw - ✅(Solved) Fix Session hangs indefinitely when compaction times out, causing repeated duplicate message sends [2 pull requests, 5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#43661Fetched 2026-04-08 00:16:12
View on GitHub
Comments
5
Participants
4
Timeline
9
Reactions
2
Assignees
Timeline (top)
commented ×5cross-referenced ×2assigned ×1subscribed ×1

When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve.

Error Message

  • If compaction fails, the agent should surface an error to the user (once), not retry the same send repeatedly

Root Cause

When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve.

Fix Action

Fixed

PR fix notes

PR #44124: Improve ACP spawn and timeout diagnostics

Description (problem / solution / changelog)

Summary

  • add a reusable ACP diagnostic formatter that preserves nested stderr/stdout/cause details
  • log richer context when sessions_spawn(runtime=\"acp\") fails during ACP session initialization
  • include last tool error and recent message targets in embedded run timeout logs

Why

Several current ACP/session issues are hard to debug from gateway logs because failures collapse down to generic strings like ACP_SESSION_INIT_FAILED, acpx exited with code 1, or a bare embedded timeout line.

This does not change runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively.

Related

  • #43667
  • #43661

Testing

  • corepack pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts

AI assistance

AI-assisted. I reviewed and tested the patch locally.

Changed files

  • src/acp/runtime/errors.test.ts (modified, +40/-1)
  • src/acp/runtime/errors.ts (modified, +96/-0)
  • src/agents/acp-spawn.ts (modified, +5/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +24/-1)

PR #44151: Improve ACP spawn and timeout diagnostics

Description (problem / solution / changelog)

Summary

  • add reusable ACP diagnostic helpers that preserve nested stderr/stdout/cause details in log output
  • log richer context when sessions_spawn(runtime="acp") fails during ACP session initialization
  • include last tool error and recent message targets in embedded run timeout logs
  • carry forward the latest generated Swift protocol models required by current main
  • keep config help-label coverage in sync for the current gateway.push* fields
  • preserve falsy primitive ACP causes like 0 and false in diagnostic output

Why

Several current ACP/session issues are hard to debug from gateway logs because failures collapse down to generic strings like ACP_SESSION_INIT_FAILED, acpx exited with code 1, or a bare embedded timeout line.

This patch does not change ACP runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively.

The Swift protocol update and config help-label adjustment are carry-forward fixes needed to keep the PR green against the latest main.

Related

  • #43667
  • #43661

Testing

  • pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts src/config/schema.help.quality.test.ts
  • pnpm protocol:check
  • pnpm check (local run hit existing unrelated format issues in ui/src/styles/chat/grouped.css, ui/src/styles/chat/layout.css, and ui/src/styles/layout.css)

AI assistance

AI-assisted. I reviewed and tested the patch locally.

Changed files

  • apps/macos/Sources/OpenClawProtocol/GatewayModels.swift (modified, +5/-1)
  • apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +5/-1)
  • src/acp/runtime/errors.test.ts (modified, +57/-1)
  • src/acp/runtime/errors.ts (modified, +98/-0)
  • src/agents/acp-spawn.ts (modified, +5/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +24/-1)
  • src/config/schema.labels.ts (modified, +5/-5)
  • src/config/sessions/targets.test.ts (modified, +9/-10)
  • ui/src/styles/chat/grouped.css (modified, +10/-7)
  • ui/src/styles/chat/layout.css (modified, +3/-1)
  • ui/src/styles/layout.css (modified, +4/-4)

Code Example

2026-03-12T03:35:13.596Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:35:13.599Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:35:13.880Z [telegram] sendMessage ok chat=... message=579

2026-03-12T03:45:15.075Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:45:15.078Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:45:16.385Z [telegram] sendMessage ok chat=... message=580

2026-03-12T03:55:17.499Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:55:17.503Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:55:18.827Z [telegram] sendMessage ok chat=... message=582

2026-03-12T04:05:17.739Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T04:05:17.743Z [agent/embedded] using current snapshot: timed out during compaction

2026-03-12T04:08:51.252Z [telegram] message failed: Call to 'sendMessage' failed! (400: Bad Request: message is too long)
2026-03-12T04:08:51.256Z [delivery-recovery] Retry failed for delivery ...: Call to 'sendMessage' failed! (400: Bad Request: message is too long)
RAW_BUFFERClick to expand / collapse

Bug Report

Summary

When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve.

Steps to Reproduce

  1. Send a large message to the agent (e.g. a long X/Twitter post or paste-heavy content)
  2. This bloats the session context enough to trigger compaction
  3. Compaction begins but hangs/times out (embedded run timeout: timed out during compaction)
  4. Every ~10 minutes, the timed-out run retries delivery — sending the same message again

Observed Behavior

  • Repeated embedded run timeout: timed out during compaction entries in logs, spaced ~10 minutes apart
  • Each timeout triggers a sendMessage to the user with the same content (4x in this case)
  • One final send attempt fails with 400: Bad Request: message is too long (Telegram limit hit)
  • Agent is completely unresponsive during this entire period
  • Only resolution was a manual pm2 reload

Relevant Log Excerpts

2026-03-12T03:35:13.596Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:35:13.599Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:35:13.880Z [telegram] sendMessage ok chat=... message=579

2026-03-12T03:45:15.075Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:45:15.078Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:45:16.385Z [telegram] sendMessage ok chat=... message=580

2026-03-12T03:55:17.499Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:55:17.503Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:55:18.827Z [telegram] sendMessage ok chat=... message=582

2026-03-12T04:05:17.739Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T04:05:17.743Z [agent/embedded] using current snapshot: timed out during compaction

2026-03-12T04:08:51.252Z [telegram] message failed: Call to 'sendMessage' failed! (400: Bad Request: message is too long)
2026-03-12T04:08:51.256Z [delivery-recovery] Retry failed for delivery ...: Call to 'sendMessage' failed! (400: Bad Request: message is too long)

Expected Behavior

  • Compaction timeout should trigger a clean failure path, not a retry loop
  • If compaction fails, the agent should surface an error to the user (once), not retry the same send repeatedly
  • The session should either recover gracefully or terminate cleanly, not hang for 30+ minutes requiring manual intervention

Environment

  • OpenClaw (self-hosted, pm2-managed)
  • Channel: Telegram
  • Trigger: Large inbound message (external content paste)

Impact

  • Agent unresponsive for 30+ minutes
  • User receives 4 identical messages
  • Manual restart required to recover

extent analysis

Fix Plan

To resolve the silent failure loop caused by compaction timeouts, we need to implement a clean failure path and prevent repeated delivery retries.

Step 1: Implement Exponential Backoff

Introduce an exponential backoff mechanism to prevent repeated retries. This can be achieved by adding a retry counter and a backoff factor to the delivery retry logic.

const maxRetries = 3;
const backoffFactor = 2;

let retryCount = 0;
let backoffTimeout = 1000; // initial backoff timeout (1 second)

function retryDelivery() {
  if (retryCount >= maxRetries) {
    // trigger clean failure path
    throw new Error('Max retries exceeded');
  }

  // calculate backoff timeout
  backoffTimeout *= backoffFactor;
  retryCount++;

  // schedule retry
  setTimeout(() => {
    // attempt delivery again
  }, backoffTimeout);
}

Step 2: Surface Error to User

Modify the delivery retry logic to surface an error to the user if compaction fails. This can be achieved by sending a custom error message to the user after a certain number of retries.

function handleCompactionFailure() {
  // send custom error message to user
  const errorMessage = 'Error: Compaction failed. Please try again later.';
  sendMessage(errorMessage);

  // trigger clean failure path
  throw new Error('Compaction failed');
}

Step 3: Implement Clean Failure Path

Introduce a clean failure path to prevent the agent from hanging indefinitely. This can be achieved by terminating the session or triggering a manual restart.

function triggerCleanFailurePath() {
  // terminate session or trigger manual restart
  // ...
}

Verification

To verify that the fix worked, test the following scenarios:

  • Send a large message to the agent to trigger compaction
  • Verify that the agent surfaces an error to the user if compaction fails
  • Verify that the agent does not enter a silent failure loop and retries the delivery repeatedly

Extra Tips

To prevent regressions, consider implementing additional logging and monitoring to detect compaction failures and silent failure loops. Additionally, consider implementing a more robust retry mechanism that takes into account the specific requirements of the agent and the user.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING