openclaw - ✅(Solved) Fix BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs [1 pull requests, 5 comments, 2 participants]

openclaw2026-04-22 05:54:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70004•Fetched 2026-04-23 07:30:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Johannes0402

Participants

dengluozhang

Johannes0402

Timeline (top)

commented ×5cross-referenced ×3mentioned ×1subscribed ×1

When an agent run crashes (SIGKILL) or times out, the session lock file (*.jsonl.lock) is NOT released. This prevents ALL subsequent agent runs from starting, regardless of which model is requested. The lock blocks the entire agent for ALL models in the fallback chain.

Error Message

Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock

Root Cause

Fix Action

Fix / Workaround

Workarounds Found

Issues with Workaround:

Must run BEFORE each new agent command (otherwise new command fails)
Loses session history for debugging
User must detect the stuck state manually
Not feasible for automated/scripted agent runs

PR fix notes

PR #70094: fix(agents): send SIGTERM instead of SIGKILL to allow lock cleanup (#70026)

Repository: openclaw/openclaw
Author: EronFan
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/70094

Description (problem / solution / changelog)

Fixes #70026 — Supervisor was sending SIGKILL instead of SIGTERM, bypassing CLEANUP_SIGNALS so releaseAllLocksSync() never runs, causing session lock files to persist and cascade into subsequent run failures.

Root cause: supervisor.ts cancelAdapter sent adapter.kill('SIGKILL') which terminates the process immediately without running cleanup handlers.

Fix: Changed to adapter.kill('SIGTERM') to allow graceful shutdown including session lock cleanup.

This also resolves #70004 (session lock cascade) which was caused by the same issue.

Changed files

extensions/browser/src/browser/pw-session.test.ts (modified, +75/-0)
extensions/browser/src/browser/pw-session.ts (modified, +65/-0)
extensions/browser/src/browser/pw-tools-core.browser-ssrf-guard.test.ts (modified, +1/-0)
extensions/browser/src/browser/pw-tools-core.snapshot.ts (modified, +10/-1)
extensions/feishu/package.json (modified, +3/-0)
extensions/telegram/package.json (modified, +1/-1)
extensions/telegram/src/bot-message-context.body.ts (modified, +10/-1)
extensions/whatsapp/src/auto-reply.web-auto-reply.last-route.test.ts (modified, +109/-0)
extensions/whatsapp/src/auto-reply/monitor/on-message.ts (modified, +6/-0)
src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.custom-provider-payloads.test.ts (added, +113/-0)
src/agents/pi-embedded-subscribe.ts (modified, +5/-3)
src/agents/sandbox/remote-fs-bridge.test.ts (modified, +57/-0)
src/agents/sandbox/remote-fs-bridge.ts (modified, +16/-2)
src/infra/system-events.test.ts (modified, +31/-0)
src/infra/system-events.ts (modified, +6/-1)
src/plugins/bundled-capability-runtime.ts (modified, +1/-1)
src/plugins/bundled-channel-config-metadata.ts (modified, +1/-1)
src/plugins/loader.ts (modified, +1/-1)
src/plugins/public-surface-loader.ts (modified, +2/-2)
src/plugins/source-loader.ts (modified, +1/-1)
src/process/supervisor/supervisor.ts (modified, +1/-1)
src/tasks/task-registry.audit.test.ts (modified, +77/-0)
src/tasks/task-registry.ts (modified, +7/-4)

Code Example

"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}

---

Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock

---

session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock

---

All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)

---

# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted

---

{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}

---

# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run

---

// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

---

// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();

---

// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale

RAW_BUFFERClick to expand / collapse

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

Environment

OpenClaw Version: v2026.4.15 (also observed on v2026.4.14)
OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
Node.js: v25.8.1
Shell: zsh
Host: Mac mini (Apple Silicon)

Configuration

"agents": {
  "defaults": {
    "model": {
      "primary": "ollama/glm-5.1:cloud",
      "fallbacks": [
        "ollama/qwen3.5:397b-cloud",
        "ollama/glm-5.1:cloud",
        "xai/grok-4-1-fast-reasoning",
        "anthropic/claude-haiku-4-5"
      ]
    }
  }
}

Steps to Reproduce

Start a long-running agent: openclaw agent --agent coder --message "complex task" --timeout 300
While the agent is running (e.g., at 12+ minutes), send a new agent command OR the gateway sends a heartbeat check
First agent gets SIGKILL'd by supervisor (timeout or new request)
Lock file remains: agents/coder/sessions/<uuid>.jsonl.lock

All subsequent agent runs fail immediately with:

Error: session file locked (timeout 10000ms): pid=<old_pid> /path/to/<uuid>.jsonl.lock

This cascades through ALL fallback models (5 attempts, all fail with same lock)

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

session file locked (timeout 10000ms): pid=25358 /Users/botje/.openclaw/agents/coder/sessions/b68174ac-1d06-4f19-a8f7-f055b2fa51af.jsonl.lock

Full Fallback Chain Failure:

All models failed (5):
- ollama/kimi-k2.6:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/qwen3.5:397b-cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- ollama/glm-5.1:cloud: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- xai/grok-4-1-fast: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)
- anthropic/claude-haiku-4-5: session file locked (timeout 10000ms): pid=25358 ...jsonl.lock (timeout)

Key Observations:

Lock persists for hours — The lock from pid=25358 (from a run at ~07:00) was still blocking runs at ~07:43 (40+ minutes later)
Lock blocks ALL models — Not just the original model, but ALL fallback models fail with the SAME lock
SIGKILL doesn't clean up — When supervisor kills the process, the lock file remains on disk
Hardcoded timeout — The 10000ms (10s) timeout appears to be hardcoded, not configurable
No automatic cleanup — There's no mechanism to detect stale locks (e.g., checking if PID is still alive)

Log Evidence

Repeated Lock Errors (40+ minutes):

# From 07:03 to 07:43 — same lock file blocks all attempts
# pid=22626: initial run
# pid=25358: subsequent run
# Both locks persisted until manually deleted

Model Fallback Decisions (all failing on same lock):

{
  "event": "model_fallback_decision",
  "runId": "cd6ed7c8-3cb3-47e2-a572-3c44c80d5ec0",
  "decision": "candidate_failed",
  "requestedModel": "kimi-k2.6:cloud",
  "candidateModel": "kimi-k2.6:cloud",
  "attempt": 1,
  "reason": "timeout",
  "errorPreview": "session file locked (timeout 10000ms): pid=25358 ...jsonl.lock"
}

Workarounds Found

Manual (User-Level):

# Must be done after EVERY stuck agent run
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent --agent coder"
# Then retry agent run

Issues with Workaround:

Must run BEFORE each new agent command (otherwise new command fails)
Loses session history for debugging
User must detect the stuck state manually
Not feasible for automated/scripted agent runs

Expected Behavior

Stale lock detection: If PID in lock file is no longer alive, automatically remove the lock
SIGKILL cleanup: Register signal handlers to clean up locks before process terminates
Lock timeout: Configurable timeout (not hardcoded 10s), or at least attempt cleanup on timeout
Per-run locks: Each agent invocation should get its own lock, not share a single lock file

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

// Pseudocode for lock acquisition
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

Option B: Process signal handlers

// On SIGTERM/SIGINT/SIGKILL
cleanupLockFile();

Option C: Lock file with timestamp

// Lock includes timestamp, auto-expire after configurable timeout
// e.g., lock older than 30s is considered stale

Impact

High: Completely blocks all agent functionality
Frequency: Reproducible on every long-running (> 10min) agent
Affected Users: Anyone using openclaw agent with timeout > 60s
Regression: Likely introduced in recent session persistence feature

Additional Context

Also observed: Gateway agent failed; falling back to embedded: Error: gateway timeout after 630000ms — suggesting the gateway timeout (10.5 min) conflicts with agent run timeout
When gateway restarts or sends heartbeat, it may trigger agent runs that conflict with existing long-running agents
The SIGKILL from supervisor (OpenClaw issue #66359/#66399) exacerbates this — killed agents leave locks behind

Related Issues

SIGKILL instead of SIGTERM: OpenClaw #66359/#66399
Gateway timeout: 630000ms (10.5 minutes) vs agent timeout

Attachments

Full OpenClaw log file (openclaw-2026-04-22.log)
Session lock files (if preserved)
openclaw.json configuration (sanitized)

Reported by: Johannes Huijbregts via Echo assistant Date: 2026-04-22 OpenClaw Version: v2026.4.15

extent analysis

TL;DR

Implement a mechanism to automatically remove stale locks, such as checking if the PID in the lock file is still alive, to prevent session locks from persisting after an agent run crashes or times out.

Guidance

Detect stale locks: Implement a check to see if the PID in the lock file is still alive before attempting to acquire the lock. If the PID is not alive, the lock can be safely removed.
Use signal handlers: Register signal handlers for SIGTERM, SIGINT, and SIGKILL to clean up locks before the process terminates.
Implement lock timeouts: Introduce a configurable timeout for locks, so that if a lock is held for too long, it can be automatically removed.
Review gateway timeouts: Investigate the gateway timeout (10.5 minutes) and how it interacts with agent run timeouts to prevent conflicts.

Example

// Pseudocode for lock acquisition with stale lock detection
if (lockFileExists()) {
  const lockPid = readLockFile();
  if (!isProcessAlive(lockPid)) {
    deleteLockFile(); // Stale lock, safe to remove
  } else {
    waitForLock(); // Real lock, wait
  }
}

Notes

The provided pseudocode and suggestions are based on the information given in the issue report. Further testing and implementation details may be necessary to ensure a complete fix.

Recommendation

Apply workaround: Implement a stale lock detection mechanism, such as the suggested PID-based approach, to automatically remove locks when the associated process is no longer alive. This should help prevent session locks from blocking subsequent agent runs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#indexing error #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix BUG: Agent session lock not released after crash/SIGKILL — blocks all subsequent runs [1 pull requests, 5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workarounds Found

Issues with Workaround:

PR fix notes

PR #70094: fix(agents): send SIGTERM instead of SIGKILL to allow lock cleanup (#70026)

Description (problem / solution / changelog)

Changed files

Code Example

Bug Report: Agent Session Lock Not Released After Crash/SIGKILL

Summary

Environment

Configuration

Steps to Reproduce

Observed Behavior

Error Pattern (repeated every ~10 seconds for 5 models):

Full Fallback Chain Failure:

Key Observations:

Log Evidence

Repeated Lock Errors (40+ minutes):

Model Fallback Decisions (all failing on same lock):

Workarounds Found

Manual (User-Level):

Issues with Workaround:

Expected Behavior

Suggested Fix

Option A: PID-based stale lock detection (Recommended)

Option B: Process signal handlers

Option C: Lock file with timestamp

Impact

Additional Context

Related Issues

Attachments

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING