openclaw - 💡(How to fix) Fix SessionWriteLockTimeoutError: gateway never releases session file lock after embedded run timeout

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The gateway acquires a write lock on a session .jsonl file during an embedded agent run. If the run times out or fails, the lock is never released. All subsequent requests to that session block for 60 seconds waiting on the lock, then fail with SessionWriteLockTimeoutError. The retained session context leaks memory, contributing to monotonic RSS growth.

Error Message

2026-05-24T10:10:19.763+00:00 [agent/embedded] embedded run timeout: runId=e295c559-441d-459e-aea1-a5f2268a8a10 sessionId=555b0189-2ff8-483b-87f5-ebab41995342 timeoutMs=300000

2026-05-24T10:11:51.382+00:00 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE errorMessage=SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock: code=OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT

2026-05-24T10:14:01.488+00:00 Embedded agent failed before reply: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock

Root Cause

Root Cause Hypothesis

Fix Action

Workaround

  1. Automated stale lock cleanup (delete .lock files older than 2 minutes via cron)
  2. Reduce runRetries.max from 160 to 32 (limits churn on stuck sessions)
  3. Set timeoutSeconds and subagents.runTimeoutSeconds to finite values (prevents unbounded runs)
  4. Container restart clears all locks and releases retained memory

Code Example

2026-05-24T10:10:19.763+00:00 [agent/embedded] embedded run timeout: runId=e295c559-441d-459e-aea1-a5f2268a8a10 sessionId=555b0189-2ff8-483b-87f5-ebab41995342 timeoutMs=300000

2026-05-24T10:11:51.382+00:00 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE errorMessage=SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock: code=OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT

2026-05-24T10:14:01.488+00:00 Embedded agent failed before reply: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock

---

{
  "pid": 7,
  "createdAt": "2026-05-24T10:29:15.171Z",
  "maxHoldMs": 1020000,
  "starttime": 76698844
}
RAW_BUFFERClick to expand / collapse

Bug Report

Version: 2026.5.22 (also present in 2026.5.20) Platform: Docker (Debian-based), init: true (tini PID 1), Node.js v24.14.0

Summary

The gateway acquires a write lock on a session .jsonl file during an embedded agent run. If the run times out or fails, the lock is never released. All subsequent requests to that session block for 60 seconds waiting on the lock, then fail with SessionWriteLockTimeoutError. The retained session context leaks memory, contributing to monotonic RSS growth.

Reproduction Steps

  1. Configure an agent with timeoutSeconds: 300 (or any finite timeout)
  2. Trigger an embedded agent run (e.g., via cron agentTurn payload, or subagent spawn)
  3. Ensure the run exceeds the timeout, or a tool call inside it errors/stalls
  4. Observe that the .lock file on the session .jsonl persists indefinitely
  5. Any subsequent request to the same session fails with SessionWriteLockTimeoutError

Observed Behavior

The lock file remains on disk with pid matching the gateway process. The gateway itself holds the lock — it is not an orphaned child process. The lock's maxHoldMs grows unboundedly.

Log Evidence

2026-05-24T10:10:19.763+00:00 [agent/embedded] embedded run timeout: runId=e295c559-441d-459e-aea1-a5f2268a8a10 sessionId=555b0189-2ff8-483b-87f5-ebab41995342 timeoutMs=300000

2026-05-24T10:11:51.382+00:00 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE errorMessage=SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock: code=OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT

2026-05-24T10:14:01.488+00:00 Embedded agent failed before reply: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock

Lock File Contents (19 minutes after timeout)

{
  "pid": 7,
  "createdAt": "2026-05-24T10:29:15.171Z",
  "maxHoldMs": 1020000,
  "starttime": 76698844
}

Note: pid: 7 is the gateway process itself (PID 1 is tini). The lock is held by the gateway, not an orphaned child.

Impact

  • Memory leak: Each stuck lock holds the full session object graph in memory (message history, tool results, thinking blocks). With cron jobs spawning dozens of agentTurn sessions, RSS grows monotonically at 40–160 MB/min until OOM.
  • Session unavailability: The locked session becomes permanently inaccessible until the container is restarted or the lock file is manually deleted.
  • Cascading failures: With runRetries at default (max: 160), each retry attempt against the locked session adds more to the retained graph without releasing the previous attempt.

Root Cause Hypothesis

The embedded run timeout/error handler does not release the session file write lock in its cleanup path. The lock acquisition likely happens in the session write pipeline, and the timeout interrupts execution after the lock is acquired but before the finally block (or equivalent cleanup) runs.

Specifically:

  1. Gateway acquires write lock on {session}.jsonl.lock
  2. Embedded run starts (tool calls, model API calls)
  3. Run times out or tool call fails
  4. Error propagates but bypasses the lock release path
  5. Lock file persists on disk, gateway retains in-memory references

Environment Details

  • Docker container with init: true (tini as PID 1 for zombie reaping)
  • 74 cron jobs using agentTurn payload (each spawns a full LLM session)
  • Multiple agents configured in single container
  • Gateway PID: 7, Docker memory limit: 6GB
  • Host: 2× Xeon Gold 6248 (40C/80T), 125GB RAM

Workaround

  1. Automated stale lock cleanup (delete .lock files older than 2 minutes via cron)
  2. Reduce runRetries.max from 160 to 32 (limits churn on stuck sessions)
  3. Set timeoutSeconds and subagents.runTimeoutSeconds to finite values (prevents unbounded runs)
  4. Container restart clears all locks and releases retained memory

Expected Behavior

When an embedded run times out or fails, the gateway should:

  1. Release the session file write lock immediately
  2. Release all in-memory references to the session context
  3. Log the timeout/failure cleanly
  4. Allow subsequent requests to the session to proceed normally

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix SessionWriteLockTimeoutError: gateway never releases session file lock after embedded run timeout