openclaw - ✅(Solved) Fix Session hangs indefinitely when compaction times out, causing repeated duplicate message sends [2 pull requests, 5 comments, 4 participants]

thedanchez · 2026-03-12T04:12:38Z

[openclaw] When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Ea… When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve. # PR #44124: Improve ACP spawn and timeout diagnostics - Repository: openclaw/openclaw - Author: mySebbe - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/44124 ## Description (problem / solution / changelog) ## Summary - add a reusable ACP diagnostic formatter that preserves nested stderr/stdout/cause details - log richer context when `sessions_spawn(runtime=\"acp\")` fails during ACP session initialization - include last tool error and recent message targets in embedded run timeout logs ## Why Several current ACP/session issues are hard to debug from gateway logs because failures collapse down to generic strings like `ACP_SESSION_INIT_FAILED`, `acpx exited with code 1`, or a bare embedded timeout line. This does not change runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively. ## Related - #43667 - #43661 ## Testing - `corepack pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts` ## AI assistance AI-assisted. I reviewed and tested the patch locally. ## Changed files - `src/acp/runtime/errors.test.ts` (modified, +40/-1) - `src/acp/runtime/errors.ts` (modified, +96/-0) - `src/agents/acp-spawn.ts` (modified, +5/-0) - `src/agents/pi-embedded-runner/run/attempt.ts` (modified, +24/-1) --- # PR #44151: Improve ACP spawn and timeout diagnostics - Repository: openclaw/openclaw - Author: mySebbe - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/44151 ## Description (problem / solution / changelog) ## Summary - add reusable ACP diagnostic helpers that preserve nested `stderr`/`stdout`/`cause` details in log output - log richer context when `sessions_spawn(runtime="acp")` fails during ACP session initialization - include last tool error and recent message targets in embedded run timeout logs - carry forward the latest generated Swift protocol models required by current `main` - keep config help-label coverage in sync for the current `gateway.push*` fields - preserve falsy primitive ACP causes like `0` and `false` in diagnostic output ## Why Several current ACP/session issues are hard to debug from gateway logs because failures collapse down to generic strings like `ACP_SESSION_INIT_FAILED`, `acpx exited with code 1`, or a bare embedded timeout line. This patch does not change ACP runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively. The Swift protocol update and config help-label adjustment are carry-forward fixes needed to keep the PR green against the latest `main`. ## Related - #43667 - #43661 ## Testing - `pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts src/config/schema.help.quality.test.ts` - `pnpm protocol:check` - `pnpm check` (local run hit existing unrelated format issues in `ui/src/styles/chat/grouped.css`, `ui/src/styles/chat/layout.css`, and `ui/src/styles/layout.css`) ## AI assistance AI-assisted. I reviewed and tested the patch locally. ## Changed files - `apps/macos/Sources/OpenClawProtocol/GatewayModels.swift` (modified, +5/-1) - `apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift` (modified, +5/-1) - `src/acp/runtime/errors.test.ts` (modified, +57/-1) - `src/acp/runtime/errors.ts` (modified, +98/-0) - `src/agents/acp-spawn.ts` (modified, +5/-0) - `src/agents/pi-embedded-runner/run/attempt.ts` (modified, +24/-1) - `src/config/schema.labels.ts` (modified, +5/-5) - `src/config/sessions/targets.test.ts` (modified, +9/-10) - `ui/src/styles/chat/grouped.css` (modified, +10/-7) - `ui/src/styles/chat/layout.css` (modified, +3/-1) - `ui/src/styles/layout.css` (modified, +4/-4) ## Fixed - Fixed by PR: Improve ACP spawn and timeout diagnostics (https://github.com/openclaw/openclaw/pull/44124) - Fixed by PR: Improve ACP spawn and timeout diagnostics (https://github.com/openclaw/openclaw/pull/44151) ## Bug Report ### Summary When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve. ### Steps to Reproduce 1. Send a large message to the agent (e.g. a long X/Twitter post or

openclaw2026-03-12 04:12:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#43661•Fetched 2026-04-08 00:16:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×5cross-referenced ×2assigned ×1subscribed ×1

When a session context becomes large enough to trigger compaction, and that compaction process times out, the agent enters a silent failure loop. Each timeout (~10 min) triggers a delivery retry that resends the same message to the user repeatedly — with no recovery, no fallback, and no way for the session to self-resolve.

Error Message

If compaction fails, the agent should surface an error to the user (once), not retry the same send repeatedly

Root Cause

Fix Action

Fixed

Fixed by PR: Improve ACP spawn and timeout diagnostics (https://github.com/openclaw/openclaw/pull/44124)
Fixed by PR: Improve ACP spawn and timeout diagnostics (https://github.com/openclaw/openclaw/pull/44151)

PR fix notes

PR #44124: Improve ACP spawn and timeout diagnostics

Repository: openclaw/openclaw
Author: mySebbe
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/44124

Description (problem / solution / changelog)

Summary

add a reusable ACP diagnostic formatter that preserves nested stderr/stdout/cause details
log richer context when sessions_spawn(runtime=\"acp\") fails during ACP session initialization
include last tool error and recent message targets in embedded run timeout logs

Why

Several current ACP/session issues are hard to debug from gateway logs because failures collapse down to generic strings like ACP_SESSION_INIT_FAILED, acpx exited with code 1, or a bare embedded timeout line.

This does not change runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively.

#43667
#43661

Testing

corepack pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts

AI assistance

AI-assisted. I reviewed and tested the patch locally.

Changed files

src/acp/runtime/errors.test.ts (modified, +40/-1)
src/acp/runtime/errors.ts (modified, +96/-0)
src/agents/acp-spawn.ts (modified, +5/-0)
src/agents/pi-embedded-runner/run/attempt.ts (modified, +24/-1)

PR #44151: Improve ACP spawn and timeout diagnostics

Repository: openclaw/openclaw
Author: mySebbe
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/44151

Description (problem / solution / changelog)

Summary

add reusable ACP diagnostic helpers that preserve nested stderr/stdout/cause details in log output
log richer context when sessions_spawn(runtime="acp") fails during ACP session initialization
include last tool error and recent message targets in embedded run timeout logs
carry forward the latest generated Swift protocol models required by current main
keep config help-label coverage in sync for the current gateway.push* fields
preserve falsy primitive ACP causes like 0 and false in diagnostic output

Why

This patch does not change ACP runtime behavior or routing. It makes the failure path observable enough to diagnose workspace/cwd mismatches, backend stderr, and timeout context without reproducing everything interactively.

The Swift protocol update and config help-label adjustment are carry-forward fixes needed to keep the PR green against the latest main.

#43667
#43661

Testing

pnpm vitest run src/acp/runtime/errors.test.ts src/agents/acp-spawn.test.ts src/config/schema.help.quality.test.ts
pnpm protocol:check
pnpm check (local run hit existing unrelated format issues in ui/src/styles/chat/grouped.css, ui/src/styles/chat/layout.css, and ui/src/styles/layout.css)

AI assistance

AI-assisted. I reviewed and tested the patch locally.

Changed files

apps/macos/Sources/OpenClawProtocol/GatewayModels.swift (modified, +5/-1)
apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +5/-1)
src/acp/runtime/errors.test.ts (modified, +57/-1)
src/acp/runtime/errors.ts (modified, +98/-0)
src/agents/acp-spawn.ts (modified, +5/-0)
src/agents/pi-embedded-runner/run/attempt.ts (modified, +24/-1)
src/config/schema.labels.ts (modified, +5/-5)
src/config/sessions/targets.test.ts (modified, +9/-10)
ui/src/styles/chat/grouped.css (modified, +10/-7)
ui/src/styles/chat/layout.css (modified, +3/-1)
ui/src/styles/layout.css (modified, +4/-4)

Code Example

2026-03-12T03:35:13.596Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:35:13.599Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:35:13.880Z [telegram] sendMessage ok chat=... message=579

2026-03-12T03:45:15.075Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:45:15.078Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:45:16.385Z [telegram] sendMessage ok chat=... message=580

2026-03-12T03:55:17.499Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:55:17.503Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:55:18.827Z [telegram] sendMessage ok chat=... message=582

2026-03-12T04:05:17.739Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T04:05:17.743Z [agent/embedded] using current snapshot: timed out during compaction

2026-03-12T04:08:51.252Z [telegram] message failed: Call to 'sendMessage' failed! (400: Bad Request: message is too long)
2026-03-12T04:08:51.256Z [delivery-recovery] Retry failed for delivery ...: Call to 'sendMessage' failed! (400: Bad Request: message is too long)

RAW_BUFFERClick to expand / collapse

Bug Report

Summary

Steps to Reproduce

Send a large message to the agent (e.g. a long X/Twitter post or paste-heavy content)
This bloats the session context enough to trigger compaction
Compaction begins but hangs/times out (embedded run timeout: timed out during compaction)
Every ~10 minutes, the timed-out run retries delivery — sending the same message again

Observed Behavior

Repeated embedded run timeout: timed out during compaction entries in logs, spaced ~10 minutes apart
Each timeout triggers a sendMessage to the user with the same content (4x in this case)
One final send attempt fails with 400: Bad Request: message is too long (Telegram limit hit)
Agent is completely unresponsive during this entire period
Only resolution was a manual pm2 reload

Relevant Log Excerpts

2026-03-12T03:35:13.596Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:35:13.599Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:35:13.880Z [telegram] sendMessage ok chat=... message=579

2026-03-12T03:45:15.075Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:45:15.078Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:45:16.385Z [telegram] sendMessage ok chat=... message=580

2026-03-12T03:55:17.499Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T03:55:17.503Z [agent/embedded] using current snapshot: timed out during compaction
2026-03-12T03:55:18.827Z [telegram] sendMessage ok chat=... message=582

2026-03-12T04:05:17.739Z [agent/embedded] embedded run timeout: runId=... timeoutMs=600000
2026-03-12T04:05:17.743Z [agent/embedded] using current snapshot: timed out during compaction

2026-03-12T04:08:51.252Z [telegram] message failed: Call to 'sendMessage' failed! (400: Bad Request: message is too long)
2026-03-12T04:08:51.256Z [delivery-recovery] Retry failed for delivery ...: Call to 'sendMessage' failed! (400: Bad Request: message is too long)

Expected Behavior

Compaction timeout should trigger a clean failure path, not a retry loop
If compaction fails, the agent should surface an error to the user (once), not retry the same send repeatedly
The session should either recover gracefully or terminate cleanly, not hang for 30+ minutes requiring manual intervention

Environment

OpenClaw (self-hosted, pm2-managed)
Channel: Telegram
Trigger: Large inbound message (external content paste)

Impact

Agent unresponsive for 30+ minutes
User receives 4 identical messages
Manual restart required to recover

extent analysis

Fix Plan

To resolve the silent failure loop caused by compaction timeouts, we need to implement a clean failure path and prevent repeated delivery retries.

Step 1: Implement Exponential Backoff

Introduce an exponential backoff mechanism to prevent repeated retries. This can be achieved by adding a retry counter and a backoff factor to the delivery retry logic.

const maxRetries = 3;
const backoffFactor = 2;

let retryCount = 0;
let backoffTimeout = 1000; // initial backoff timeout (1 second)

function retryDelivery() {
  if (retryCount >= maxRetries) {
    // trigger clean failure path
    throw new Error('Max retries exceeded');
  }

  // calculate backoff timeout
  backoffTimeout *= backoffFactor;
  retryCount++;

  // schedule retry
  setTimeout(() => {
    // attempt delivery again
  }, backoffTimeout);
}

Step 2: Surface Error to User

Modify the delivery retry logic to surface an error to the user if compaction fails. This can be achieved by sending a custom error message to the user after a certain number of retries.

function handleCompactionFailure() {
  // send custom error message to user
  const errorMessage = 'Error: Compaction failed. Please try again later.';
  sendMessage(errorMessage);

  // trigger clean failure path
  throw new Error('Compaction failed');
}

Step 3: Implement Clean Failure Path

Introduce a clean failure path to prevent the agent from hanging indefinitely. This can be achieved by terminating the session or triggering a manual restart.

function triggerCleanFailurePath() {
  // terminate session or trigger manual restart
  // ...
}

Verification

To verify that the fix worked, test the following scenarios:

Send a large message to the agent to trigger compaction
Verify that the agent surfaces an error to the user if compaction fails
Verify that the agent does not enter a silent failure loop and retries the delivery repeatedly

Extra Tips

To prevent regressions, consider implementing additional logging and monitoring to detect compaction failures and silent failure loops. Additionally, consider implementing a more robust retry mechanism that takes into account the specific requirements of the agent and the user.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #tokenizer error #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Session hangs indefinitely when compaction times out, causing repeated duplicate message sends [2 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #44124: Improve ACP spawn and timeout diagnostics

Description (problem / solution / changelog)

Summary

Why

Related

Testing

AI assistance

Changed files

PR #44151: Improve ACP spawn and timeout diagnostics

Description (problem / solution / changelog)

Summary

Why

Related

Testing

AI assistance

Changed files

Code Example

Bug Report

Summary

Steps to Reproduce

Observed Behavior

Relevant Log Excerpts

Expected Behavior

Environment

Impact

extent analysis

Fix Plan

Step 1: Implement Exponential Backoff

Step 2: Surface Error to User

Step 3: Implement Clean Failure Path

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING