openclaw - ✅(Solved) Fix Bug: Stale session lock after timeout leaves channel permanently dead [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#59983Fetched 2026-04-08 02:37:58
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
1
Timeline (top)
referenced ×2commented ×1cross-referenced ×1mentioned ×1

After an agent run times out on a group session, a stale .jsonl.lock file is left behind with no corresponding .jsonl transcript. This causes the gateway to silently drop all subsequent inbound messages for that session — the channel appears permanently dead until manual intervention.

Error Message

  1. All subsequent messages to the group are silently consumed — no response, no error, no log entry
  • Inbound messages should not be silently dropped — either queue them or surface an error
  1. Logging: When a message is routed to a session that has a lock with no transcript, log a WARN-level entry so operators have visibility

Root Cause

After an agent run times out on a group session, a stale .jsonl.lock file is left behind with no corresponding .jsonl transcript. This causes the gateway to silently drop all subsequent inbound messages for that session — the channel appears permanently dead until manual intervention.

Fix Action

Workaround

Manually delete the stale lock file:

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock

Channel resumes on next inbound message.

PR fix notes

PR #60015: fix(session-lock): prevent stale lock file from permanently killing channel after timeout

Description (problem / solution / changelog)

Summary

  • Problem: When an agent run times out (e.g. Opus hitting turn timeout in a Signal group with no fallback configured), a .jsonl.lock file is left behind in the sessions directory with no corresponding .jsonl transcript. All subsequent inbound messages for that channel are silently consumed with no agent response — no logs, no errors, zero visibility. The only recovery is manually deleting the lock file. See #59983.

  • Root Cause: The finally block in src/agents/pi-embedded-runner/run/attempt.ts (line ~1964) executes several async cleanup steps — flushPendingToolResultsAfterIdle(), session?.dispose(), bundleLspRuntime?.dispose()sequentially without individual error guards before calling sessionLock.release(). If any cleanup step throws, the finally block short-circuits and sessionLock.release() is never reached. This causes two cascading failures depending on the process lifecycle:

    In-process cascade (long-running daemon): The HELD_LOCKS in-memory map retains a stale entry with count = 1. On subsequent message arrivals, acquireSessionWriteLock finds the entry and takes the reentrant path (count += 1 → 2). When that run completes, release() decrements count to 1 but never reaches 0, so the lock file is never deleted. The watchdog timer will eventually force-release after maxHoldMs (up to ~12 minutes for Opus timeouts), but until then the lock accumulates reentrant count that never drains.

    Cross-process cascade (CLI/restart): If the process exits before the watchdog fires, the .lock file persists on disk while HELD_LOCKS state is lost. On restart, acquireSessionWriteLock encounters the orphaned lock file. If the original PID is dead, stale-detection reclaims it. However, on macOS where PID spaces are smaller and recycling is faster, the original PID may be reused by an unrelated process. Without starttime (unavailable on macOS via /proc), PID recycling cannot be detected, and the lock becomes permanently un-reclaimable — acquireSessionWriteLock times out and throws "session file locked", killing the channel.

    Additionally, shouldTreatAsOrphanSelfLock in src/agents/session-write-lock.ts has a logic gap on Linux: when a lock file contains a valid starttime field (always the case on Linux), the function unconditionally returns false, preventing the orphan-self-lock reclaim path from ever activating for same-process orphans. This matters in the edge case where the watchdog's fs.rm fails (best-effort) but HELD_LOCKS is already cleared.

  • Fix:

    1. Primary fix (attempt.ts): Wrap each cleanup step in the finally block with its own try/catch, ensuring sessionLock.release() is always reached regardless of cleanup failures. Errors are logged at warn level with the runId for debuggability. This eliminates the root cause — lock files can no longer leak due to cleanup exceptions.
    2. Secondary fix (session-write-lock.ts): Restructure shouldTreatAsOrphanSelfLock to correctly detect same-process orphan locks on Linux by comparing the lock file's starttime against the current process's starttime via getProcessStartTime(process.pid). The HELD_LOCKS.has() guard is moved to the top so actively-held locks are never misidentified. On non-Linux platforms where getProcessStartTime returns null, the function conservatively falls back to false to avoid false reclaims.

    The primary fix eliminates the root cause. The secondary fix provides defense-in-depth for the edge case where the primary fix is bypassed (e.g. process crash during cleanup, or fs.rm failure in watchdog force-release on Linux).

  • What changed:

    • src/agents/pi-embedded-runner/run/attempt.ts: Added individual try/catch guards around flushPendingToolResultsAfterIdle, session?.dispose(), and bundleLspRuntime?.dispose() in the finally block, with log.warn for each failure.
    • src/agents/session-write-lock.ts: Restructured shouldTreatAsOrphanSelfLock — moved HELD_LOCKS.has() check first, then added starttime comparison logic with getProcessStartTime(process.pid) for Linux and conservative fallback for non-Linux.
    • src/agents/session-write-lock.test.ts: Added 2 regression tests — one verifying orphan locks with valid starttime are reclaimed on Linux, one verifying actively-held locks with valid starttime are NOT reclaimed.
  • What did NOT change (scope boundary):

    • No changes to lock acquisition logic, reentrant counting, or timeout behavior in acquireSessionWriteLock.
    • No changes to watchdog timer, releaseHeldLock, or releaseAllLocksSync.
    • No changes to inspectLockPayload, shouldReclaimContendedLockFile, or cross-process stale detection.
    • No changes to gateway message routing, queue policy, or dispatch logic.
    • No changes to run.ts failover/retry logic.
    • releaseWsSession is not wrapped because it is synchronous and does not throw in practice.

Reproduction

  1. Configure a Signal group channel with claude-opus-4-6 and no fallback model.
  2. Send a message that triggers a long agent run.
  3. Wait for the run to time out (or force timeout via config).
  4. Observe that a .jsonl.lock file exists in the sessions directory but no .jsonl transcript.
  5. Send another message to the same channel.
  6. Observe that the message is silently consumed with no agent response.
  7. Manually delete the .lock file → channel recovers immediately.

Risk / Mitigation

  • Risk: The individual try/catch blocks in the finally clause could mask cleanup errors that indicate deeper issues.
  • Mitigation: Each caught error is logged at warn level with the runId, ensuring full visibility in logs. The cleanup operations (dispose, flush) are already best-effort by design — their failure should never prevent lock release. The shouldTreatAsOrphanSelfLock change is conservative: on non-Linux (macOS), it falls back to the existing behavior (false), so there is zero behavioral change for the Issue reporter's platform. On Linux, the new starttime comparison is strictly more correct than the previous unconditional false. The new regression tests verify both the positive case (orphan reclaimed) and the negative case (active lock not reclaimed).

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agent Runner
  • Session Management
  • Tests

Linked Issue/PR

Fixes #59983

Changed files

  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +31/-7)
  • src/agents/session-write-lock.test.ts (modified, +54/-0)
  • src/agents/session-write-lock.ts (modified, +38/-2)

Code Example

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock
RAW_BUFFERClick to expand / collapse

Description

After an agent run times out on a group session, a stale .jsonl.lock file is left behind with no corresponding .jsonl transcript. This causes the gateway to silently drop all subsequent inbound messages for that session — the channel appears permanently dead until manual intervention.

Steps to Reproduce

  1. Agent session begins processing a message in a Signal group
  2. The run times out (e.g., Opus hit the turn timeout)
  3. Gateway logs: embedded_run_failover_decision: timeout, aborted=true, fallbackConfigured=false
  4. A .jsonl.lock file remains in the sessions directory, but no .jsonl transcript exists
  5. All subsequent messages to the group are silently consumed — no response, no error, no log entry

Expected Behavior

  • Stale locks should be auto-cleaned after a timeout/abort
  • At minimum, the gateway should log a warning when it encounters a lock with no corresponding transcript
  • Inbound messages should not be silently dropped — either queue them or surface an error

Actual Behavior

  • Lock file persists indefinitely after timeout
  • Gateway silently drops all inbound messages for the locked session
  • No log entries for dropped messages — zero visibility into the problem
  • Manual deletion of the .lock file is required to restore the channel

Environment

  • OpenClaw: npm global install (latest as of 2026-04-02)
  • macOS (Apple Silicon)
  • Channel: Signal group
  • Model: claude-opus-4-6 (no fallback configured)

Suggested Fix

  1. Auto-cleanup: After a timeout/abort, if the session has a .lock but no .jsonl, remove the stale lock
  2. Staleness check: On inbound message routing, if a .lock exists but is older than N minutes (e.g., 5) with no active run, treat it as stale and remove it
  3. Logging: When a message is routed to a session that has a lock with no transcript, log a WARN-level entry so operators have visibility
  4. Graceful degradation: Consider queuing messages for locked sessions rather than dropping them silently

Workaround

Manually delete the stale lock file:

rm ~/.openclaw/agents/main/sessions/<session-id>.jsonl.lock

Channel resumes on next inbound message.

extent analysis

TL;DR

Implement auto-cleanup of stale .jsonl.lock files after a timeout or abort to prevent the gateway from silently dropping inbound messages.

Guidance

  • Implement a staleness check to remove .lock files older than a specified time (e.g., 5 minutes) with no active run to prevent indefinite locking.
  • Add logging to warn operators when a message is routed to a session with a lock but no transcript, providing visibility into the issue.
  • Consider queuing messages for locked sessions instead of dropping them silently to improve robustness.
  • Manually deleting the stale lock file can serve as a temporary workaround to restore the channel.

Example

No code snippet is provided as the issue suggests configuration or implementation changes rather than a specific code fix.

Notes

The suggested fixes and workarounds assume that the issue is primarily related to the handling of .jsonl.lock files and the gateway's behavior upon encountering them. The effectiveness of these suggestions may depend on the specific implementation details of OpenClaw and its handling of session locks and timeouts.

Recommendation

Apply the suggested fixes, particularly the auto-cleanup of stale locks and the addition of logging for visibility, to address the issue of silently dropped messages due to stale locks. This approach directly targets the identified problem and provides a more robust handling of session timeouts and locks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING