1. Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout) 2. `maxHoldMs` should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release 3. A stale same-PID lock with `createdAt > laneTimeout + margin` should be treated as stale by `isAlive` checks even when the PID is alive 4. SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades

openclaw - 💡(How to fix) Fix [Bug]: Concurrent lane writes to same session file cause persistent SessionWriteLockTimeoutError (dual-lane race + lock lifetime mismatch)

StepCodex · 2026-05-24T09:54:56Z

[openclaw] Session JSONL write lock can be held across lane boundary by concurrent lane=main and lane=session:… writes to the same session file, causing Sessio… Session JSONL write lock can be held across lane boundary by concurrent `lane=main` and `lane=session:…` writes to the same session file, causing `SessionWriteLockTimeoutError` after 60,000ms. The lock survives the timeout because `maxHoldMs` (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so `isAlive(pid)` → true and `cleanStaleLockFiles()` skips it. Only a full Gateway kill+restart permanently clears it. Additionally, if the Gateway process is already in a compromised state (e.g., post-lock-timeout), a SIGUSR1 soft restart can trigger **event loop starvation** during startup: `secrets.resolve` fails, Discord fetches time out with `eventLoopDelayHint: "timer delayed 8785ms, likely event-loop starvation"`, and health checks fail — leaving the old PID as a stale zombie. ## Workaround `rm .jsonl.lock` + `openclaw gateway restart` (or full kill + LaunchAgent restart) ### Bug type Bug (v2026.5.22) ### Summary Session JSONL write lock can be held across lane boundary by concurrent `lane=main` and `lane=session:…` writes to the same session file, causing `SessionWriteLockTimeoutError` after 60,000ms. The lock survives the timeout because `maxHoldMs` (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so `isAlive(pid)` → true and `cleanStaleLockFiles()` skips it. Only a full Gateway kill+restart permanently clears it. Additionally, if the Gateway process is already in a compromised state (e.g., post-lock-timeout), a SIGUSR1 soft restart can trigger **event loop starvation** during startup: `secrets.resolve` fails, Discord fetches time out with `eventLoopDelayHint: "timer delayed 8785ms, likely event-loop starvation"`, and health checks fail — leaving the old PID as a stale zombie. ### Steps to reproduce 1. Run OpenClaw Gateway `2026.5.22` on macOS (LaunchAgent, loopback bind) 2. Have an agent with parallel lane support (two lanes writing the same session file simultaneously): - `lane=main` processing tool results and context assembly - `lane=session:agent: :feishu:direct: ` processing an inbound chat message 3. Both lanes compete for the same `sessions/ .jsonl.lock` 4. The lane holding the lock does a long API call (>60s); the waiting lane times out at 60,000ms 5. Lock file persists after timeout. Manually deleting `.lock` restores the session for one turn, but the race reproduces on the next turn. Deleting the lock + Gateway restart is the only permanent fix. ### Observed reproduction - Agent: `silvermoon`, model: `deepseek-v4-pro` / `deepseek-v4-flash` - Session: `agent:silvermoon:feishu:direct:ou_8072af0701714ee36e1e66d6690be4d7` - Session file: `31b35f44-3619-49ef-a6d7-eed0db14bd40.jsonl` - Lock file contents: ```json { "pid": 44536, "createdAt": "2026-05-24T09:22:29.256Z", "maxHoldMs": 1020000 } ``` Log evidence (two occurrences within 15 minutes, same session): ``` lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock" lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock" Embedded agent failed before reply: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock ``` ### Event loop starvation on soft restart After SIGUSR1 restart from the compromised state: ``` secrets.resolve failed: Secrets runtime snapshot is not active. fetch timeout reached; aborting operation timeoutMs=10000 elapsedMs=18785 timerDelayMs=8785 eventLoopDelayHint="timer delayed 8785ms, likely event-loop starvation" Health check failed: gateway timeout after 10000ms killing 1 stale gateway process(es) before restart: 44536 ``` ### Expected behavior 1. Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout) 2. `maxHoldMs` should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release 3. A stale same-PID lock with `createdAt > laneTimeout + margin` should be treated as stale by `isAlive` checks even when the PID is alive 4. SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades ### Impact - User receives "Something went wrong" error in-channel with no recovery path - Only Gateway restart (or manual `.lock` file deletion) restores the session - If the admin uses soft restart from the degraded state, the restart itself fails catastrophically ### Workaround `rm .jsonl.lock` + `openclaw gateway restart` (or full kill + LaunchAgent restart) ### Related issues - #84193 — same `SessionWriteLockTimeoutError` but triggered by auto-compaction path - #85913 — `EmbeddedAttemptSessionTakeoverError` from

openclaw2026-05-24 09:54:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Session JSONL write lock can be held across lane boundary by concurrent lane=main and lane=session:… writes to the same session file, causing SessionWriteLockTimeoutError after 60,000ms. The lock survives the timeout because maxHoldMs (1,020,000ms) far exceeds the lane timeout. The lock owner is the current live Gateway PID, so isAlive(pid) → true and cleanStaleLockFiles() skips it. Only a full Gateway kill+restart permanently clears it.

Additionally, if the Gateway process is already in a compromised state (e.g., post-lock-timeout), a SIGUSR1 soft restart can trigger event loop starvation during startup: secrets.resolve fails, Discord fetches time out with eventLoopDelayHint: "timer delayed 8785ms, likely event-loop starvation", and health checks fail — leaving the old PID as a stale zombie.

Error Message

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272 error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

Embedded agent failed before reply: session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock

Root Cause

Fix Action

Workaround

rm <session>.jsonl.lock + openclaw gateway restart (or full kill + LaunchAgent restart)

Code Example

{
  "pid": 44536,
  "createdAt": "2026-05-24T09:22:29.256Z",
  "maxHoldMs": 1020000
}

---

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272
  error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
  pid=44536 …/31b35f44-…jsonl.lock"

Embedded agent failed before reply:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock

---

secrets.resolve failed: Secrets runtime snapshot is not active.

fetch timeout reached; aborting operation
  timeoutMs=10000 elapsedMs=18785 timerDelayMs=8785
  eventLoopDelayHint="timer delayed 8785ms, likely event-loop starvation"

Health check failed: gateway timeout after 10000ms

killing 1 stale gateway process(es) before restart: 44536

RAW_BUFFERClick to expand / collapse

Bug type

Bug (v2026.5.22)

Summary

Steps to reproduce

Run OpenClaw Gateway 2026.5.22 on macOS (LaunchAgent, loopback bind)
Have an agent with parallel lane support (two lanes writing the same session file simultaneously):
- lane=main processing tool results and context assembly
- lane=session:agent:<agent>:feishu:direct:<user> processing an inbound chat message
Both lanes compete for the same sessions/<uuid>.jsonl.lock
The lane holding the lock does a long API call (>60s); the waiting lane times out at 60,000ms
Lock file persists after timeout. Manually deleting .lock restores the session for one turn, but the race reproduces on the next turn. Deleting the lock + Gateway restart is the only permanent fix.

Observed reproduction

Agent: silvermoon, model: deepseek-v4-pro / deepseek-v4-flash
Session: agent:silvermoon:feishu:direct:ou_8072af0701714ee36e1e66d6690be4d7
Session file: 31b35f44-3619-49ef-a6d7-eed0db14bd40.jsonl
Lock file contents:

{
  "pid": 44536,
  "createdAt": "2026-05-24T09:22:29.256Z",
  "maxHoldMs": 1020000
}

Log evidence (two occurrences within 15 minutes, same session):

lane task error: lane=main durationMs=60271 error="SessionWriteLockTimeoutError:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock"

lane task error: lane=session:agent:silvermoon:feishu:direct:ou_8072… durationMs=60272
  error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
  pid=44536 …/31b35f44-…jsonl.lock"

Embedded agent failed before reply:
  session file locked (timeout 60000ms): pid=44536 …/31b35f44-…jsonl.lock

Event loop starvation on soft restart

After SIGUSR1 restart from the compromised state:

secrets.resolve failed: Secrets runtime snapshot is not active.

fetch timeout reached; aborting operation
  timeoutMs=10000 elapsedMs=18785 timerDelayMs=8785
  eventLoopDelayHint="timer delayed 8785ms, likely event-loop starvation"

Health check failed: gateway timeout after 10000ms

killing 1 stale gateway process(es) before restart: 44536

Expected behavior

Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout)
maxHoldMs should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release
A stale same-PID lock with createdAt > laneTimeout + margin should be treated as stale by isAlive checks even when the PID is alive
SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades

Impact

User receives "Something went wrong" error in-channel with no recovery path
Only Gateway restart (or manual .lock file deletion) restores the session
If the admin uses soft restart from the degraded state, the restart itself fails catastrophically

Workaround

rm <session>.jsonl.lock + openclaw gateway restart (or full kill + LaunchAgent restart)

Related issues

#84193 — same SessionWriteLockTimeoutError but triggered by auto-compaction path
#85913 — EmbeddedAttemptSessionTakeoverError from lane race (heartbeat vs channel)
#49603 — orphaned lock files not cleared when PID matches current process

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Lane timeouts should release the session JSONL write lock (lock steward should not outlive lane timeout)
maxHoldMs should be ≤ lane timeout, or the lock should detect that the holding lane has timed out and self-release
A stale same-PID lock with createdAt > laneTimeout + margin should be treated as stale by isAlive checks even when the PID is alive
SIGUSR1 restart from a compromised state should not trigger event loop starvation cascades

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Concurrent lane writes to same session file cause persistent SessionWriteLockTimeoutError (dual-lane race + lock lifetime mismatch)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug type

Summary

Steps to reproduce

Observed reproduction

Event loop starvation on soft restart

Expected behavior

Impact

Workaround

Related issues

FAQ

Expected behavior

Still need to ship something?

TRENDING