openclaw - ✅(Solved) Fix BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs [1 pull requests, 5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70004Fetched 2026-04-23 07:30:32
View on GitHub
Comments
5
Participants
2
Timeline
10
Reactions
0
Timeline (top)
commented ×5cross-referenced ×3mentioned ×1subscribed ×1

When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

Error Message

Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock

Root Cause

When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

Fix Action

Fix / Workaround

Workarounds Found

Issues with Workaround:

  1. Must run BEFORE each new agent command (otherwise new command fails)
  2. Loses session history for debugging
  3. User must detect the stuck state manually
  4. Not feasible for automated/scripted agent runs

PR fix notes

PR #70094: fix(agents): send SIGTERM instead of SIGKILL to allow lock cleanup (#70026)

Description (problem / solution / changelog)

Fixes #70026 — Supervisor was sending SIGKILL instead of SIGTERM, bypassing CLEANUP_SIGNALS so releaseAllLocksSync() never runs, causing session lock files to persist and cascade into subsequent run failures.

Root cause: supervisor.ts cancelAdapter sent adapter.kill('SIGKILL') which terminates the process immediately without running cleanup handlers.

Fix: Changed to adapter.kill('SIGTERM') to allow graceful shutdown including session lock cleanup.

This also resolves #70004 (session lock cascade) which was caused by the same issue.

Changed files

  • extensions/browser/src/browser/pw-session.test.ts (modified, +75/-0)
  • extensions/browser/src/browser/pw-session.ts (modified, +65/-0)
  • extensions/browser/src/browser/pw-tools-core.browser-ssrf-guard.test.ts (modified, +1/-0)
  • extensions/browser/src/browser/pw-tools-core.snapshot.ts (modified, +10/-1)
  • extensions/feishu/package.json (modified, +3/-0)
  • extensions/telegram/package.json (modified, +1/-1)
  • extensions/telegram/src/bot-message-context.body.ts (modified, +10/-1)
  • extensions/whatsapp/src/auto-reply.web-auto-reply.last-route.test.ts (modified, +109/-0)
  • extensions/whatsapp/src/auto-reply/monitor/on-message.ts (modified, +6/-0)
  • src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.custom-provider-payloads.test.ts (added, +113/-0)
  • src/agents/pi-embedded-subscribe.ts (modified, +5/-3)
  • src/agents/sandbox/remote-fs-bridge.test.ts (modified, +57/-0)
  • src/agents/sandbox/remote-fs-bridge.ts (modified, +16/-2)
  • src/infra/system-events.test.ts (modified, +31/-0)
  • src/infra/system-events.ts (modified, +6/-1)
  • src/plugins/bundled-capability-runtime.ts (modified, +1/-1)
  • src/plugins/bundled-channel-config-metadata.ts (modified, +1/-1)
  • src/plugins/loader.ts (modified, +1/-1)
  • src/plugins/public-surface-loader.ts (modified, +2/-2)
  • src/plugins/source-loader.ts (modified, +1/-1)
  • src/process/supervisor/supervisor.ts (modified, +1/-1)
  • src/tasks/task-registry.audit.test.ts (modified, +77/-0)
  • src/tasks/task-registry.ts (modified, +7/-4)

Code Example

"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}

---

Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock

---

session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock

---

All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)

---

# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted

---

{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}

---

# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run

---

// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

---

// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();

---

// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale
RAW_BUFFERClick to expand / collapse

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

Environment

  • OpenClaw Version: v2026.4.15 (also observed on v2026.4.14)
  • OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
  • Node.js: v25.8.1
  • Shell: zsh
  • Host: Mac mini (Apple Silicon)

Configuration

"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}

Steps to Reproduce

  1. Start a long-running agent: openclaw agent --agent coder --message "complex task" --timeout 300
  2. While the agent is running (e.g., at 12+ minutes), send a new agent command OR the gateway sends a heartbeat check
  3. First agent gets SIGKILL'd by supervisor (timeout or new request)
  4. Lock file remains: agents/coder/sessions/<uuid>.jsonl.lock
  5. All subsequent agent runs fail immediately with:
    Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock
  6. This cascades through ALL fallback models (5 attempts, all fail with same lock)

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock

Full Fallback Chain Failure:

All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)

Key Observations:

  1. Lock persists for hours — The lock from pid=25358 (from a run at ~07:00) was still blocking runs at ~07:43 (40+ minutes later)
  2. Lock blocks ALL models — Not just the original model, but ALL fallback models fail with the SAME lock
  3. SIGKILL doesn't clean up — When supervisor kills the process, the lock file remains on disk
  4. Hardcoded timeout — The 10000ms (10s) timeout appears to be hardcoded, not configurable
  5. No automatic cleanup — There's no mechanism to detect stale locks (e.g., checking if PID is still alive)

Log Evidence

Repeated Lock Errors (40+ minutes):

# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted

Model Fallback Decisions (all failing on same lock):

{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}

Workarounds Found

Manual (User-Level):

# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run

Issues with Workaround:

  1. Must run BEFORE each new agent command (otherwise new command fails)
  2. Loses session history for debugging
  3. User must detect the stuck state manually
  4. Not feasible for automated/scripted agent runs

Expected Behavior

  1. Stale lock detection: If PID in lock file is no longer alive, automatically remove the lock
  2. SIGKILL cleanup: Register signal handlers to clean up locks before process terminates
  3. Lock timeout: Configurable timeout (not hardcoded 10s), or at least attempt cleanup on timeout
  4. Per-run locks: Each agent invocation should get its own lock, not share a single lock file

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

Option B: Process signal handlers

// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();

Option C: Lock file with timestamp

// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale

Impact

  • High: Completely blocks all agent functionality
  • Frequency: Reproducible on every long-running (> 10min) agent
  • Affected Users: Anyone using openclaw agent with timeout > 60s
  • Regression: Likely introduced in recent session persistence feature

Additional Context

  • Also observed: Gateway agent failed; falling back to embedded: Error: gateway timeout after 630000ms — suggesting the gateway timeout (10.5 min) conflicts with agent run timeout
  • When gateway restarts or sends heartbeat, it may trigger agent runs that conflict with existing long-running agents
  • The SIGKILL from supervisor (OpenClaw issue #66359/#66399) exacerbates this — killed agents leave locks behind

Related Issues

  • SIGKILL instead of SIGTERM: OpenClaw #66359/#66399
  • Gateway timeout: 630000ms (10.5 minutes) vs agent timeout

Attachments

  • Full OpenClaw log file (openclaw-2026-04-22.log)
  • Session lock files (if preserved)
  • openclaw.json configuration (sanitized)

Reported by: Johannes Huijbregts via Echo assistant Date: 2026-04-22 OpenClaw Version: v2026.4.15

extent analysis

TL;DR

Implement a mechanism to automatically remove stale locks, such as checking if the PID in the lock file is still alive, to prevent session locks from persisting after an agent run crashes or times out.

Guidance

  1. Detect stale locks: Implement a check to see if the PID in the lock file is still alive before attempting to acquire the lock. If the PID is not alive, the lock can be safely removed.
  2. Use signal handlers: Register signal handlers for SIGTERM, SIGINT, and SIGKILL to clean up locks before the process terminates.
  3. Implement lock timeouts: Introduce a configurable timeout for locks, so that if a lock is held for too long, it can be automatically removed.
  4. Review gateway timeouts: Investigate the gateway timeout (10.5 minutes) and how it interacts with agent run timeouts to prevent conflicts.

Example

// Pseudocode for lock acquisition with stale lock detection
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

Notes

The provided pseudocode and suggestions are based on the information given in the issue report. Further testing and implementation details may be necessary to ensure a complete fix.

Recommendation

Apply workaround: Implement a stale lock detection mechanism, such as the suggested PID-based approach, to automatically remove locks when the associated process is no longer alive. This should help prevent session locks from blocking subsequent agent runs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs [1 pull requests, 5 comments, 2 participants]