openclaw - 💡(How to fix) Fix Session lock auto-cleanup on staleness detection

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Session JSONL lock files (.jsonl.lock) can become stale even when the gateway PID is alive and actively holding them, causing file lock stale errors in the sessions_send / sessions_spawn paths. This requires manual openclaw sessions cleanup intervention.

Root Cause

This mitigates the immediate impact but does not address the root cause.

Fix Action

Fix / Workaround

Current Workaround

Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)

Note: This issue is filed in parallel with our cron-based workaround deployment (Item A, Path 2 from internal substrate repair commission). We're requesting the upstream feature to enable eventual removal of the cron workaround once the feature is stable.

RAW_BUFFERClick to expand / collapse

Summary

Session JSONL lock files (.jsonl.lock) can become stale even when the gateway PID is alive and actively holding them, causing file lock stale errors in the sessions_send / sessions_spawn paths. This requires manual openclaw sessions cleanup intervention.

Background

Over the past 3 days (2026-05-26 to 2026-05-28), the OpenClaw gateway has experienced recurring "file lock stale" substrate failures affecting multiple agents simultaneously. The pattern:

  • Lock files for session JSONL files are flagged as "stale" by the runtime
  • The gateway PID is alive and actively working
  • No orphaned .lock files exist on disk at failure time
  • Logs show EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

This appears to be a race condition in embedded-session lock management during concurrent sessions_send / sub-agent spawn operations, particularly affecting high-volume agents (woodhouse, fleet-ops, PMs).

Evidence:

  • Instance #1: 2026-05-28 04:22 UTC (347h gateway uptime)
  • Instance #2: 2026-05-28 06:45 UTC (<2h uptime, post-restart)
  • Instance #3+: 2026-05-28 14:23+ UTC (multiple agents)
  • All instances affected the same high-volume session files
  • Manual openclaw sessions cleanup clears the issue temporarily

Full forensic: [internal audit doc available on request]

Current Workaround

We've deployed a cron-based auto-cleanup sniffer:

  • Runs every 5 minutes
  • Scans for locks >5 minutes old with no active process
  • Auto-executes openclaw sessions cleanup on P1/P0 alerts
  • Alerts operators via inbox

This mitigates the immediate impact but does not address the root cause.

Requested Feature

Auto-cleanup on lock staleness detection:

  1. When the runtime detects a stale lock internally, automatically run cleanup (equivalent to openclaw sessions cleanup) before retrying the operation
  2. Log the auto-cleanup event for forensics
  3. Optional: expose a config flag to disable auto-cleanup if operators want manual control

Benefits:

  • Eliminates the 5-15 minute window between lock staleness and cron-driven cleanup
  • Reduces fleet-wide stuck-but-alive agent occurrences
  • Provides immediate self-healing for transient lock races

Alternative: Root Cause Fix

If the lock-staleness detection itself is buggy (i.e., the lock is NOT actually stale, but the runtime incorrectly thinks it is), then the root cause is in the lock validation logic. In that case:

  1. Review the EmbeddedAttemptSessionTakeoverError path
  2. Check if lock acquisition/release during sessions_send has a race window
  3. Ensure lock validation accounts for concurrent operations on the same session file

We're happy to provide additional forensic data (logs, stack traces, timing) if that helps diagnose the root cause.

Impact

Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)

Frequency: 3-4 instances in 3 days across 10+ agents

Affected workflows:

  • Inter-agent messaging (sessions_send)
  • Sub-agent spawning (sessions_spawn)
  • Heartbeat coordination (agents appear "stuck but alive")

Environment

  • OpenClaw version: [current production, can provide specific version if needed]
  • Gateway uptime at first occurrence: 347h
  • Gateway uptime at recurrence: <2h (shows restart does not permanently resolve)
  • Affected session types: DM sessions (high concurrency), heartbeat sessions

Note: This issue is filed in parallel with our cron-based workaround deployment (Item A, Path 2 from internal substrate repair commission). We're requesting the upstream feature to enable eventual removal of the cron workaround once the feature is stable.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Session lock auto-cleanup on staleness detection