openclaw - 💡(How to fix) Fix Session lock auto-cleanup on staleness detection

openclaw2026-05-28 20:55:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Session JSONL lock files (.jsonl.lock) can become stale even when the gateway PID is alive and actively holding them, causing file lock stale errors in the sessions_send / sessions_spawn paths. This requires manual openclaw sessions cleanup intervention.

Root Cause

This mitigates the immediate impact but does not address the root cause.

Fix Action

Fix / Workaround

Current Workaround

Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)

Note: This issue is filed in parallel with our cron-based workaround deployment (Item A, Path 2 from internal substrate repair commission). We're requesting the upstream feature to enable eventual removal of the cron workaround once the feature is stable.

RAW_BUFFERClick to expand / collapse

Summary

Background

Over the past 3 days (2026-05-26 to 2026-05-28), the OpenClaw gateway has experienced recurring "file lock stale" substrate failures affecting multiple agents simultaneously. The pattern:

Lock files for session JSONL files are flagged as "stale" by the runtime
The gateway PID is alive and actively working
No orphaned .lock files exist on disk at failure time
Logs show EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

This appears to be a race condition in embedded-session lock management during concurrent sessions_send / sub-agent spawn operations, particularly affecting high-volume agents (woodhouse, fleet-ops, PMs).

Evidence:

Instance #1: 2026-05-28 04:22 UTC (347h gateway uptime)
Instance #2: 2026-05-28 06:45 UTC (<2h uptime, post-restart)
Instance #3+: 2026-05-28 14:23+ UTC (multiple agents)
All instances affected the same high-volume session files
Manual openclaw sessions cleanup clears the issue temporarily

Full forensic: [internal audit doc available on request]

Current Workaround

We've deployed a cron-based auto-cleanup sniffer:

Runs every 5 minutes
Scans for locks >5 minutes old with no active process
Auto-executes openclaw sessions cleanup on P1/P0 alerts
Alerts operators via inbox

This mitigates the immediate impact but does not address the root cause.

Requested Feature

Auto-cleanup on lock staleness detection:

When the runtime detects a stale lock internally, automatically run cleanup (equivalent to openclaw sessions cleanup) before retrying the operation
Log the auto-cleanup event for forensics
Optional: expose a config flag to disable auto-cleanup if operators want manual control

Benefits:

Eliminates the 5-15 minute window between lock staleness and cron-driven cleanup
Reduces fleet-wide stuck-but-alive agent occurrences
Provides immediate self-healing for transient lock races

Alternative: Root Cause Fix

If the lock-staleness detection itself is buggy (i.e., the lock is NOT actually stale, but the runtime incorrectly thinks it is), then the root cause is in the lock validation logic. In that case:

Review the EmbeddedAttemptSessionTakeoverError path
Check if lock acquisition/release during sessions_send has a race window
Ensure lock validation accounts for concurrent operations on the same session file

We're happy to provide additional forensic data (logs, stack traces, timing) if that helps diagnose the root cause.

Impact

Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)

Frequency: 3-4 instances in 3 days across 10+ agents

Affected workflows:

Inter-agent messaging (sessions_send)
Sub-agent spawning (sessions_spawn)
Heartbeat coordination (agents appear "stuck but alive")

Environment

OpenClaw version: [current production, can provide specific version if needed]
Gateway uptime at first occurrence: 347h
Gateway uptime at recurrence: <2h (shows restart does not permanently resolve)
Affected session types: DM sessions (high concurrency), heartbeat sessions

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Session lock auto-cleanup on staleness detection

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Summary

Background

Current Workaround

Requested Feature

Alternative: Root Cause Fix

Impact

Environment

Still need to ship something?

TRENDING