openclaw - 💡(How to fix) Fix [Bug]: Concurrent lane writes to same session file cause persistent SessionWriteLockTimeoutError (dual-lane race + lock lifetime mismatch)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Session JSONL write lock can be held across lane boundary by concurrent lane=main and lane=session:… writes to the same session file, causing SessionWriteLockTimeoutError after 60,000ms. The lock survives the timeout because maxHoldMs (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so isAlive(pid) → true and cleanStaleLockFiles() skips it. Only a full Gateway kill+restart permanently clears it.

Additionally, if the Gateway process is already in a compromised state (e.g., post-lock-timeout), a SIGUSR1 soft restart can trigger event loop starvation during startup: secrets.resolve fails, Discord fetches time out with eventLoopDelayHint: "timer delayed 8785ms, likely event-loop starvation", and health checks fail — leaving the old PID as a stale zombie.

Error Message

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

Embedded agent failed before reply: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock

Root Cause

Session JSONL write lock can be held across lane boundary by concurrent lane=main and lane=session:… writes to the same session file, causing SessionWriteLockTimeoutError after 60,000ms. The lock survives the timeout because maxHoldMs (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so isAlive(pid) → true and cleanStaleLockFiles() skips it. Only a full Gateway kill+restart permanently clears it.

Fix Action

Workaround

rm <session>.jsonl.lock + openclaw gateway restart (or full kill + LaunchAgent restart)

Code Example

{
  "pid": 44536,
  "createdAt": "2026-05-24T09:22:29.256Z",
  "maxHoldMs": 1020000
}

---

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError:
  session file locked (timeout 60000ms): pid=44536/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272
  error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
  pid=44536/31b35f44-…jsonl.lock"

Embedded agent failed before reply:
  session file locked (timeout 60000ms): pid=44536/31b35f44-…jsonl.lock

---

secrets.resolve failed: Secrets runtime snapshot is not active.

fetch timeout reached; aborting operation
  timeoutMs=10000 elapsedMs=18785 timerDelayMs=8785
  eventLoopDelayHint="timer delayed 8785ms, likely event-loop starvation"

Health check failed: gateway timeout after 10000ms

killing 1 stale gateway process(es) before restart: 44536
RAW_BUFFERClick to expand / collapse

Bug type

Bug (v2026.5.22)

Summary

Session JSONL write lock can be held across lane boundary by concurrent lane=main and lane=session:… writes to the same session file, causing SessionWriteLockTimeoutError after 60,000ms. The lock survives the timeout because maxHoldMs (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so isAlive(pid) → true and cleanStaleLockFiles() skips it. Only a full Gateway kill+restart permanently clears it.

Additionally, if the Gateway process is already in a compromised state (e.g., post-lock-timeout), a SIGUSR1 soft restart can trigger event loop starvation during startup: secrets.resolve fails, Discord fetches time out with eventLoopDelayHint: "timer delayed 8785ms, likely event-loop starvation", and health checks fail — leaving the old PID as a stale zombie.

Steps to reproduce

  1. Run OpenClaw Gateway 2026.5.22 on macOS (LaunchAgent, loopback bind)
  2. Have an agent with parallel lane support (two lanes writing the same session file simultaneously):
    • lane=main processing tool results and context assembly
    • lane=session:agent:<agent>:feishu:direct:<user> processing an inbound chat message
  3. Both lanes compete for the same sessions/<uuid>.jsonl.lock
  4. The lane holding the lock does a long API call (>60s); the waiting lane times out at 60,000ms
  5. Lock file persists after timeout. Manually deleting .lock restores the session for one turn, but the race reproduces on the next turn. Deleting the lock + Gateway restart is the only permanent fix.

Observed reproduction

  • Agent: silvermoon, model: deepseek-v4-pro / deepseek-v4-flash
  • Session: agent:silvermoon:feishu:direct:ou_8072af0701714ee36e1e66d6690be4d7
  • Session file: 31b35f44-3619-49ef-a6d7-eed0db14bd40.jsonl
  • Lock file contents:
{
  "pid": 44536,
  "createdAt": "2026-05-24T09:22:29.256Z",
  "maxHoldMs": 1020000
}

Log evidence (two occurrences within 15 minutes, same session):

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272
  error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
  pid=44536 …/31b35f44-…jsonl.lock"

Embedded agent failed before reply:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock

Event loop starvation on soft restart

After SIGUSR1 restart from the compromised state:

secrets.resolve failed: Secrets runtime snapshot is not active.

fetch timeout reached; aborting operation
  timeoutMs=10000 elapsedMs=18785 timerDelayMs=8785
  eventLoopDelayHint="timer delayed 8785ms, likely event-loop starvation"

Health check failed: gateway timeout after 10000ms

killing 1 stale gateway process(es) before restart: 44536

Expected behavior

  1. Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout)
  2. maxHoldMs should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release
  3. A stale same-PID lock with createdAt > laneTimeout + margin should be treated as stale by isAlive checks even when the PID is alive
  4. SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades

Impact

  • User receives "Something went wrong" error in-channel with no recovery path
  • Only Gateway restart (or manual .lock file deletion) restores the session
  • If the admin uses soft restart from the degraded state, the restart itself fails catastrophically

Workaround

rm <session>.jsonl.lock + openclaw gateway restart (or full kill + LaunchAgent restart)

Related issues

  • #84193 — same SessionWriteLockTimeoutError but triggered by auto-compaction path
  • #85913 — EmbeddedAttemptSessionTakeoverError from lane race (heartbeat vs channel)
  • #49603 — orphaned lock files not cleared when PID matches current process

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout)
  2. maxHoldMs should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release
  3. A stale same-PID lock with createdAt > laneTimeout + margin should be treated as stale by isAlive checks even when the PID is alive
  4. SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING