openclaw - 💡(How to fix) Fix Abort settle timeout causes zombie session write lock cascade

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When an embedded agent run times out (300s), the abort cleanup fails within the 2s settle window (embedded abort settle timed out: timeoutMs=2000), leaving the session .jsonl.lock file unreleased. This creates a zombie lock that blocks ALL subsequent messages from the same chat for 5+ minutes.

Root Cause

proper-lockfile considers the lock "compromised" after the abort, preventing release. The shouldReclaimContendedLockFile logic in session-write-lock-_a5O1H8L.js requires the lock to be stale (default 30min via DEFAULT_SESSION_WRITE_LOCK_STALE_MS) or the PID to be dead (5s via ORPHAN_LOCK_PAYLOAD_GRACE_MS). When the PID is the Gateway itself (alive), the lock can only be reclaimed after 30 minutes.

Fix Action

Workaround

Set OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=10000 and OPENCLAW_SESSION_WRITE_LOCK_STALE_MS=120000 as env vars on the Gateway service.

RAW_BUFFERClick to expand / collapse

Description

When an embedded agent run times out (300s), the abort cleanup fails within the 2s settle window (embedded abort settle timed out: timeoutMs=2000), leaving the session .jsonl.lock file unreleased. This creates a zombie lock that blocks ALL subsequent messages from the same chat for 5+ minutes.

Reproduction

  1. Maintain a large session context (150K+ tokens)
  2. Use a reasoning model that frequently approaches the 300s timeout
  3. When timeout occurs → embedded abort settle timed out → lock NOT released
  4. Subsequent messages hit SessionWriteLockTimeoutError for 60s each

Root Cause

proper-lockfile considers the lock "compromised" after the abort, preventing release. The shouldReclaimContendedLockFile logic in session-write-lock-_a5O1H8L.js requires the lock to be stale (default 30min via DEFAULT_SESSION_WRITE_LOCK_STALE_MS) or the PID to be dead (5s via ORPHAN_LOCK_PAYLOAD_GRACE_MS). When the PID is the Gateway itself (alive), the lock can only be reclaimed after 30 minutes.

Observed Impact

  • 5 consecutive message failures over 5 minutes on 2026-05-27
  • Each failed message waited the full 60s DEFAULT_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS
  • User experience: "Something went wrong" errors cascade

Suggested Fix

  1. Increase abort settle timeout from 2s to at least 10s
  2. Force-release the lock file after abort (delete .lock file) when stopReason is timeout
  3. Reduce DEFAULT_SESSION_WRITE_LOCK_STALE_MS from 30min to something more reasonable (e.g., 2min)

Workaround

Set OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=10000 and OPENCLAW_SESSION_WRITE_LOCK_STALE_MS=120000 as env vars on the Gateway service.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Abort settle timeout causes zombie session write lock cascade