openclaw - ✅(Solved) Fix Bug: Stale session lock after timeout leaves channel permanently dead [1 pull requests, 1 comments, 2 participants]

garrett22ge · 2026-04-03T01:24:59Z

[openclaw] After an agent run times out on a group session, a stale .jsonl.lock file is left behind with no corresponding .jsonl transcript. This causes the ga… After an agent run times out on a group session, a stale `.jsonl.lock` file is left behind with no corresponding `.jsonl` transcript. This causes the gateway to silently drop all subsequent inbound messages for that session — the channel appears permanently dead until manual intervention. # PR #60015: fix(session-lock): prevent stale lock file from permanently killing channel after timeout - Repository: openclaw/openclaw - Author: openperf - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/60015 ## Description (problem / solution / changelog) ### Summary - **Problem**: When an agent run times out (e.g. Opus hitting turn timeout in a Signal group with no fallback configured), a `.jsonl.lock` file is left behind in the sessions directory with no corresponding `.jsonl` transcript. All subsequent inbound messages for that channel are silently consumed with no agent response — no logs, no errors, zero visibility. The only recovery is manually deleting the lock file. See #59983. - **Root Cause**: The `finally` block in `src/agents/pi-embedded-runner/run/attempt.ts` (line ~1964) executes several async cleanup steps — `flushPendingToolResultsAfterIdle()`, `session?.dispose()`, `bundleLspRuntime?.dispose()` — **sequentially without individual error guards** before calling `sessionLock.release()`. If any cleanup step throws, the `finally` block short-circuits and `sessionLock.release()` is never reached. This causes two cascading failures depending on the process lifecycle: **In-process cascade (long-running daemon)**: The `HELD_LOCKS` in-memory map retains a stale entry with `count = 1`. On subsequent message arrivals, `acquireSessionWriteLock` finds the entry and takes the reentrant path (`count += 1` → 2). When that run completes, `release()` decrements count to 1 but never reaches 0, so the lock file is never deleted. The watchdog timer will eventually force-release after `maxHoldMs` (up to ~12 minutes for Opus timeouts), but until then the lock accumulates reentrant count that never drains. **Cross-process cascade (CLI/restart)**: If the process exits before the watchdog fires, the `.lock` file persists on disk while `HELD_LOCKS` state is lost. On restart, `acquireSessionWriteLock` encounters the orphaned lock file. If the original PID is dead, stale-detection reclaims it. However, on macOS where PID spaces are smaller and recycling is faster, the original PID may be reused by an unrelated process. Without `starttime` (unavailable on macOS via `/proc`), PID recycling cannot be detected, and the lock becomes permanently un-reclaimable — `acquireSessionWriteLock` times out and throws `"session file locked"`, killing the channel. Additionally, `shouldTreatAsOrphanSelfLock` in `src/agents/session-write-lock.ts` has a logic gap on Linux: when a lock file contains a valid `starttime` field (always the case on Linux), the function unconditionally returns `false`, preventing the orphan-self-lock reclaim path from ever activating for same-process orphans. This matters in the edge case where the watchdog's `fs.rm` fails (best-effort) but `HELD_LOCKS` is already cleared. - **Fix**: 1. **Primary fix** (`attempt.ts`): Wrap each cleanup step in the `finally` block with its own `try/catch`, ensuring `sessionLock.release()` is **always** reached regardless of cleanup failures. Errors are logged at `warn` level with the `runId` for debuggability. This eliminates the root cause — lock files can no longer leak due to cleanup exceptions. 2. **Secondary fix** (`session-write-lock.ts`): Restructure `shouldTreatAsOrphanSelfLock` to correctly detect same-process orphan locks on Linux by comparing the lock file's `starttime` against the current process's `starttime` via `getProcessStartTime(process.pid)`. The `HELD_LOCKS.has()` guard is moved to the top so actively-held locks are never misidentified. On non-Linux platforms where `getProcessStartTime` returns `null`, the function conservatively falls back to `false` to avoid false reclaims. The primary fix eliminates the root cause. The secondary fix provides defense-in-depth for the edge case where the primary fix is bypassed (e.g. process crash during cleanup, or `fs.rm` failure in watchdog force-release on Linux). - **What changed**: - `src/agents/pi-embedded-runner/run/attempt.ts`: Added individual `try/catch` guards around `flushPendingToolResultsAfterIdle`, `session?.dispose()`, and `bundleLspRuntime?.dispose()` in the `finally` block, with `log.warn` for each failure. - `src/agents/session-write-lock.ts`: Restructured `shouldTreatAsOrphanSelfLock` — moved `HELD_LOCKS.has()` check first, then added `starttime` comparison logic with `getProcessStartTime(process.pid)` for Linux and conservative fallback for non-Linux. - `src/agents/session-write-lock.test.ts`: Added 2 regression tests — one veri

openclaw2026-04-03 01:24:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#59983•Fetched 2026-04-08 02:37:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

garrett22ge

Participants

garrett22ge

openperf

Timeline (top)

referenced ×2commented ×1cross-referenced ×1mentioned ×1

After an agent run times out on a group session, a stale .jsonl.lock file is left behind with no corresponding .jsonl transcript. This causes the gateway to silently drop all subsequent inbound messages for that session — the channel appears permanently dead until manual intervention.

Error Message

All subsequent messages to the group are silently consumed — no response, no error, no log entry

Inbound messages should not be silently dropped — either queue them or surface an error

Logging: When a message is routed to a session that has a lock with no transcript, log a WARN-level entry so operators have visibility

Root Cause

Fix Action

Workaround

Manually delete the stale lock file:

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock

Channel resumes on next inbound message.

PR fix notes

PR #60015: fix(session-lock): prevent stale lock file from permanently killing channel after timeout

Repository: openclaw/openclaw
Author: openperf
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/60015

Description (problem / solution / changelog)

Summary

Problem: When an agent run times out (e.g. Opus hitting turn timeout in a Signal group with no fallback configured), a .jsonl.lock file is left behind in the sessions directory with no corresponding .jsonl transcript. All subsequent inbound messages for that channel are silently consumed with no agent response — no logs, no errors, zero visibility. The only recovery is manually deleting the lock file. See #59983.
Root Cause: The finally block in src/agents/pi-embedded-runner/run/attempt.ts (line ~1964) executes several async cleanup steps — flushPendingToolResultsAfterIdle(), session?.dispose(), bundleLspRuntime?.dispose() — sequentially without individual error guards before calling sessionLock.release(). If any cleanup step throws, the finally block short-circuits and sessionLock.release() is never reached. This causes two cascading failures depending on the process lifecycle:

In-process cascade (long-running daemon): The HELD_LOCKS in-memory map retains a stale entry with count = 1. On subsequent message arrivals, acquireSessionWriteLock finds the entry and takes the reentrant path (count += 1 → 2). When that run completes, release() decrements count to 1 but never reaches 0, so the lock file is never deleted. The watchdog timer will eventually force-release after maxHoldMs (up to ~12 minutes for Opus timeouts), but until then the lock accumulates reentrant count that never drains.

Cross-process cascade (CLI/restart): If the process exits before the watchdog fires, the .lock file persists on disk while HELD_LOCKS state is lost. On restart, acquireSessionWriteLock encounters the orphaned lock file. If the original PID is dead, stale-detection reclaims it. However, on macOS where PID spaces are smaller and recycling is faster, the original PID may be reused by an unrelated process. Without starttime (unavailable on macOS via /proc), PID recycling cannot be detected, and the lock becomes permanently un-reclaimable — acquireSessionWriteLock times out and throws "session file locked", killing the channel.

Additionally, shouldTreatAsOrphanSelfLock in src/agents/session-write-lock.ts has a logic gap on Linux: when a lock file contains a valid starttime field (always the case on Linux), the function unconditionally returns false, preventing the orphan-self-lock reclaim path from ever activating for same-process orphans. This matters in the edge case where the watchdog's fs.rm fails (best-effort) but HELD_LOCKS is already cleared.
Fix:
1. Primary fix (attempt.ts): Wrap each cleanup step in the finally block with its own try/catch, ensuring sessionLock.release() is always reached regardless of cleanup failures. Errors are logged at warn level with the runId for debuggability. This eliminates the root cause — lock files can no longer leak due to cleanup exceptions.
2. Secondary fix (session-write-lock.ts): Restructure shouldTreatAsOrphanSelfLock to correctly detect same-process orphan locks on Linux by comparing the lock file's starttime against the current process's starttime via getProcessStartTime(process.pid). The HELD_LOCKS.has() guard is moved to the top so actively-held locks are never misidentified. On non-Linux platforms where getProcessStartTime returns null, the function conservatively falls back to false to avoid false reclaims.
The primary fix eliminates the root cause. The secondary fix provides defense-in-depth for the edge case where the primary fix is bypassed (e.g. process crash during cleanup, or fs.rm failure in watchdog force-release on Linux).
What changed:
- src/agents/pi-embedded-runner/run/attempt.ts: Added individual try/catch guards around flushPendingToolResultsAfterIdle, session?.dispose(), and bundleLspRuntime?.dispose() in the finally block, with log.warn for each failure.
- src/agents/session-write-lock.ts: Restructured shouldTreatAsOrphanSelfLock — moved HELD_LOCKS.has() check first, then added starttime comparison logic with getProcessStartTime(process.pid) for Linux and conservative fallback for non-Linux.
- src/agents/session-write-lock.test.ts: Added 2 regression tests — one verifying orphan locks with valid starttime are reclaimed on Linux, one verifying actively-held locks with valid starttime are NOT reclaimed.
What did NOT change (scope boundary):
- No changes to lock acquisition logic, reentrant counting, or timeout behavior in acquireSessionWriteLock.
- No changes to watchdog timer, releaseHeldLock, or releaseAllLocksSync.
- No changes to inspectLockPayload, shouldReclaimContendedLockFile, or cross-process stale detection.
- No changes to gateway message routing, queue policy, or dispatch logic.
- No changes to run.ts failover/retry logic.
- releaseWsSession is not wrapped because it is synchronous and does not throw in practice.

Reproduction

Configure a Signal group channel with claude-opus-4-6 and no fallback model.
Send a message that triggers a long agent run.
Wait for the run to time out (or force timeout via config).
Observe that a .jsonl.lock file exists in the sessions directory but no .jsonl transcript.
Send another message to the same channel.
Observe that the message is silently consumed with no agent response.
Manually delete the .lock file → channel recovers immediately.

Risk / Mitigation

Risk: The individual try/catch blocks in the finally clause could mask cleanup errors that indicate deeper issues.
Mitigation: Each caught error is logged at warn level with the runId, ensuring full visibility in logs. The cleanup operations (dispose, flush) are already best-effort by design — their failure should never prevent lock release. The shouldTreatAsOrphanSelfLock change is conservative: on non-Linux (macOS), it falls back to the existing behavior (false), so there is zero behavioral change for the Issue reporter's platform. On Linux, the new starttime comparison is strictly more correct than the previous unconditional false. The new regression tests verify both the positive case (orphan reclaimed) and the negative case (active lock not reclaimed).

Change Type (select all)

Bug fix

Scope (select all touched areas)

Agent Runner
Session Management
Tests

Linked Issue/PR

Fixes #59983

Changed files

src/agents/pi-embedded-runner/run/attempt.ts (modified, +31/-7)
src/agents/session-write-lock.test.ts (modified, +54/-0)
src/agents/session-write-lock.ts (modified, +38/-2)

Code Example

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock

RAW_BUFFERClick to expand / collapse

Description

Steps to Reproduce

Agent session begins processing a message in a Signal group
The run times out (e.g., Opus hit the turn timeout)
Gateway logs: embedded_run_failover_decision: timeout, aborted=true, fallbackConfigured=false
A .jsonl.lock file remains in the sessions directory, but no .jsonl transcript exists
All subsequent messages to the group are silently consumed — no response, no error, no log entry

Expected Behavior

Stale locks should be auto-cleaned after a timeout/abort
At minimum, the gateway should log a warning when it encounters a lock with no corresponding transcript
Inbound messages should not be silently dropped — either queue them or surface an error

Actual Behavior

Lock file persists indefinitely after timeout
Gateway silently drops all inbound messages for the locked session
No log entries for dropped messages — zero visibility into the problem
Manual deletion of the .lock file is required to restore the channel

Environment

OpenClaw: npm global install (latest as of 2026-04-02)
macOS (Apple Silicon)
Channel: Signal group
Model: claude-opus-4-6 (no fallback configured)

Suggested Fix

Auto-cleanup: After a timeout/abort, if the session has a .lock but no .jsonl, remove the stale lock
Staleness check: On inbound message routing, if a .lock exists but is older than N minutes (e.g., 5) with no active run, treat it as stale and remove it
Logging: When a message is routed to a session that has a lock with no transcript, log a WARN-level entry so operators have visibility
Graceful degradation: Consider queuing messages for locked sessions rather than dropping them silently

Workaround

Manually delete the stale lock file:

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock

Channel resumes on next inbound message.

extent analysis

TL;DR

Implement auto-cleanup of stale .jsonl.lock files after a timeout or abort to prevent the gateway from silently dropping inbound messages.

Guidance

Implement a staleness check to remove .lock files older than a specified time (e.g., 5 minutes) with no active run to prevent indefinite locking.
Add logging to warn operators when a message is routed to a session with a lock but no transcript, providing visibility into the issue.
Consider queuing messages for locked sessions instead of dropping them silently to improve robustness.
Manually deleting the stale lock file can serve as a temporary workaround to restore the channel.

Example

No code snippet is provided as the issue suggests configuration or implementation changes rather than a specific code fix.

Notes

The suggested fixes and workarounds assume that the issue is primarily related to the handling of .jsonl.lock files and the gateway's behavior upon encountering them. The effectiveness of these suggestions may depend on the specific implementation details of OpenClaw and its handling of session locks and timeouts.

Recommendation

Apply the suggested fixes, particularly the auto-cleanup of stale locks and the addition of logging for visibility, to address the issue of silently dropped messages due to stale locks. This approach directly targets the identified problem and provides a more robust handling of session timeouts and locks.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Bug: Stale session lock after timeout leaves channel permanently dead [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #60015: fix(session-lock): prevent stale lock file from permanently killing channel after timeout

Description (problem / solution / changelog)

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Changed files

Code Example

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Suggested Fix

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING