openclaw - ✅(Solved) Fix BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade [2 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70026Fetched 2026-04-23 07:30:14
View on GitHub
Comments
4
Participants
2
Timeline
8
Reactions
0
Timeline (top)
commented ×4cross-referenced ×3referenced ×1

When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends SIGKILL instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

Workarounds

PR fix notes

PR #70094: fix(agents): send SIGTERM instead of SIGKILL to allow lock cleanup (#70026)

Description (problem / solution / changelog)

Fixes #70026 — Supervisor was sending SIGKILL instead of SIGTERM, bypassing CLEANUP_SIGNALS so releaseAllLocksSync() never runs, causing session lock files to persist and cascade into subsequent run failures.

Root cause: supervisor.ts cancelAdapter sent adapter.kill('SIGKILL') which terminates the process immediately without running cleanup handlers.

Fix: Changed to adapter.kill('SIGTERM') to allow graceful shutdown including session lock cleanup.

This also resolves #70004 (session lock cascade) which was caused by the same issue.

Changed files

  • extensions/browser/src/browser/pw-session.test.ts (modified, +75/-0)
  • extensions/browser/src/browser/pw-session.ts (modified, +65/-0)
  • extensions/browser/src/browser/pw-tools-core.browser-ssrf-guard.test.ts (modified, +1/-0)
  • extensions/browser/src/browser/pw-tools-core.snapshot.ts (modified, +10/-1)
  • extensions/feishu/package.json (modified, +3/-0)
  • extensions/telegram/package.json (modified, +1/-1)
  • extensions/telegram/src/bot-message-context.body.ts (modified, +10/-1)
  • extensions/whatsapp/src/auto-reply.web-auto-reply.last-route.test.ts (modified, +109/-0)
  • extensions/whatsapp/src/auto-reply/monitor/on-message.ts (modified, +6/-0)
  • src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.custom-provider-payloads.test.ts (added, +113/-0)
  • src/agents/pi-embedded-subscribe.ts (modified, +5/-3)
  • src/agents/sandbox/remote-fs-bridge.test.ts (modified, +57/-0)
  • src/agents/sandbox/remote-fs-bridge.ts (modified, +16/-2)
  • src/infra/system-events.test.ts (modified, +31/-0)
  • src/infra/system-events.ts (modified, +6/-1)
  • src/plugins/bundled-capability-runtime.ts (modified, +1/-1)
  • src/plugins/bundled-channel-config-metadata.ts (modified, +1/-1)
  • src/plugins/loader.ts (modified, +1/-1)
  • src/plugins/public-surface-loader.ts (modified, +2/-2)
  • src/plugins/source-loader.ts (modified, +1/-1)
  • src/process/supervisor/supervisor.ts (modified, +1/-1)
  • src/tasks/task-registry.audit.test.ts (modified, +77/-0)
  • src/tasks/task-registry.ts (modified, +7/-4)

PR #69893: fix: multiple bundled plugin and channel regression fixes

Description (problem / solution / changelog)

Multiple regression fixes for bundled plugins and channel plugins.

Fixes included:

  • #69793: Telegram photo inbound media type classification
  • #69783: Bun global install hang (jitiFilename modulePath fix)
  • #69831: Telegram grammy version mismatch (staging skip + bundledDependencies)
  • #70025: Feishu @larksuiteoapi/node-sdk missing bundledDependencies
  • #70026: SIGTERM instead of SIGKILL for supervisor cleanup
  • #69478: Deduplicate enqueued system events
  • #69229: Task createdAt/startAt timestamp clamping
  • #69410: AssistantTexts populated at message_end
  • #69369: Docker binds in RemoteShellSandboxFsBridge
  • #69289: Resolve ax refs in browser actions

Changed files

  • extensions/browser/src/browser/pw-session.test.ts (modified, +75/-0)
  • extensions/browser/src/browser/pw-session.ts (modified, +65/-0)
  • extensions/browser/src/browser/pw-tools-core.browser-ssrf-guard.test.ts (modified, +1/-0)
  • extensions/browser/src/browser/pw-tools-core.snapshot.ts (modified, +10/-1)
  • extensions/feishu/package.json (modified, +3/-0)
  • extensions/telegram/package.json (modified, +4/-1)
  • extensions/telegram/src/bot-message-context.body.ts (modified, +10/-1)
  • extensions/whatsapp/src/auto-reply.web-auto-reply.last-route.test.ts (modified, +109/-0)
  • extensions/whatsapp/src/auto-reply/monitor/on-message.ts (modified, +6/-0)
  • src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.custom-provider-payloads.test.ts (added, +113/-0)
  • src/agents/pi-embedded-subscribe.ts (modified, +5/-3)
  • src/agents/sandbox/remote-fs-bridge.test.ts (modified, +57/-0)
  • src/agents/sandbox/remote-fs-bridge.ts (modified, +16/-2)
  • src/infra/system-events.test.ts (modified, +31/-0)
  • src/infra/system-events.ts (modified, +6/-1)
  • src/plugins/bundled-capability-runtime.ts (modified, +1/-1)
  • src/plugins/bundled-channel-config-metadata.ts (modified, +1/-1)
  • src/plugins/loader.ts (modified, +1/-1)
  • src/plugins/public-surface-loader.ts (modified, +2/-2)
  • src/plugins/source-loader.ts (modified, +1/-1)
  • src/process/supervisor/supervisor.ts (modified, +1/-1)
  • src/tasks/task-registry.audit.test.ts (modified, +77/-0)
  • src/tasks/task-registry.ts (modified, +7/-4)

Code Example

# Long-running agent (e.g., Kimi K2.6 with web search + code generation)
openclaw agent --agent coder --message "complex task" --timeout 300

# Observed: Process killed around 60-90s mark
# Result: SIGKILL (no cleanup possible)
# Evidence: Session lock file remains (.jsonl.lock)
# Process no longer exists but lock persists

---

{"subsystem":"gateway","message":"signal SIGTERM received"}
{"subsystem":"gateway","message":"received SIGTERM; shutting down"}

---

# No log entries - process is killed without warning
# Lock file: agents/coder/sessions/<uuid>.jsonl.lock
# Lock owner PID no longer exists
# All subsequent agent runs fail with "session file locked"

---

openclaw agent --agent researcher \
     --message "Research GLM alternatives, 10+ sources" \
     --timeout 300

---

// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
  if (processStillExists(process.pid)) {
    kill(process.pid, 'SIGKILL'); // Force kill only after grace period
  }
}, 5000); // 5s grace period for cleanup

---

process.on('SIGTERM', () => {
  releaseSessionLock();
  process.exit(0);
});

---

# After every killed agent run:
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent"

---

# Wrap agent calls with cleanup
run_agent() {
  openclaw agent "$@"
  sleep 1
  rm -f ~/.openclaw/agents/*/sessions/*.lock
}
RAW_BUFFERClick to expand / collapse

Bug Report: Supervisor Sends SIGKILL Instead of SIGTERM for Long-Running Agents

Summary

When an agent run exceeds a certain duration (observed at 60-90+ seconds), the OpenClaw supervisor sends SIGKILL instead of SIGTERM to terminate the process. SIGKILL prevents any cleanup (including session lock removal), directly causing the session lock issue reported in #70004.

Environment

  • OpenClaw Version: v2026.4.20 (115f05d)
  • OS: macOS 15.4.1 (Darwin 25.4.0 arm64)
  • Node.js: v25.8.1

Observed Behavior

Pattern 1: Agent Runs Killed After ~60-90s

# Long-running agent (e.g., Kimi K2.6 with web search + code generation)
openclaw agent --agent coder --message "complex task" --timeout 300

# Observed: Process killed around 60-90s mark
# Result: SIGKILL (no cleanup possible)
# Evidence: Session lock file remains (.jsonl.lock)
# Process no longer exists but lock persists

Pattern 2: SIGTERM vs SIGKILL

SIGTERM (graceful - gateway shutdown):

{"subsystem":"gateway","message":"signal SIGTERM received"}
{"subsystem":"gateway","message":"received SIGTERM; shutting down"}

→ Gateway handles this gracefully, cleans up resources

SIGKILL (abrupt - agent runs):

# No log entries - process is killed without warning
# Lock file: agents/coder/sessions/<uuid>.jsonl.lock
# Lock owner PID no longer exists
# All subsequent agent runs fail with "session file locked"

→ No cleanup, session lock persists indefinitely

Pattern 3: Reproducible Steps

  1. Start a complex agent run (e.g., researcher with web search):
    openclaw agent --agent researcher \
      --message "Research GLM alternatives, 10+ sources" \
      --timeout 300
  2. Agent starts processing, makes API calls
  3. Around 60-90s: Process disappears (SIGKILL)
  4. Lock file remains: sessions/<uuid>.jsonl.lock
  5. Check: ps aux | grep <pid> → PID no longer exists
  6. New agent run: Fails with "session file locked (timeout 10000ms)"

Root Cause Analysis

Evidence Points to Supervisor Timeout:

  1. Timeout mismatch:

    • User sets --timeout 300 (5 minutes)
    • Gateway timeout: 630000ms (10.5 minutes)
    • Supervisor timeout: Likely 60-90s (hardcoded?)
  2. Process lifecycle:

    • Gateway receives SIGTERM → graceful shutdown
    • Agent run receives no signal → abruptly killed (SIGKILL)
    • Suggests supervisor/process manager is killing the agent, not the gateway
  3. SIGKILL characteristics:

    • Cannot be caught or handled
    • No cleanup possible
    • Process state shows "killed" or missing PID
    • Lock files remain orphaned

Impact

  • Session Lock Issue (#70004): Direct cause - locks not cleaned up
  • Data Loss: Agent output lost mid-generation
  • Resource Waste: Failed runs consume API tokens without completion
  • User Experience: Requires manual lock cleanup after every long run

Suggested Fix

Option 1: Use SIGTERM with Grace Period (Recommended)

// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
  if (processStillExists(process.pid)) {
    kill(process.pid, 'SIGKILL'); // Force kill only after grace period
  }
}, 5000); // 5s grace period for cleanup

Option 2: Extend Supervisor Timeout

  • Make supervisor timeout configurable or match --timeout flag
  • If user sets --timeout 300, supervisor should wait 300s before any kill

Option 3: Pre-Kill Hook

  • Register cleanup function before kill:
process.on('SIGTERM', () => {
  releaseSessionLock();
  process.exit(0);
});
  • Then use SIGTERM instead of SIGKILL

Workarounds

User-Level (Current):

# After every killed agent run:
rm -f ~/.openclaw/agents/coder/sessions/*.lock
pkill -f "openclaw agent"

Script-Level:

# Wrap agent calls with cleanup
run_agent() {
  openclaw agent "$@"
  sleep 1
  rm -f ~/.openclaw/agents/*/sessions/*.lock
}

Related Issues

  • #70004: Session Lock Not Released After Crash/SIGKILL
  • Possibly related: Gateway timeout (630000ms) configuration

Additional Context

  • This may be related to openclaw agent using embedded runs vs gateway runs
  • Embedded runs might have different supervisor logic than gateway-managed runs
  • The 60-90s timeout suggests a hardcoded limit, not the user-specified --timeout

Attachments

  • Full log excerpt showing agent start → disappearance
  • Process monitor output (ps aux timestamps)
  • Session lock files with timestamps

Reported by: Johannes Huijbregts via Echo assistant Date: 2026-04-22 OpenClaw Version: v2026.4.20 (115f05d)

extent analysis

TL;DR

The most likely fix is to modify the supervisor to send SIGTERM with a grace period instead of SIGKILL to allow for cleanup.

Guidance

  • Review the supervisor's timeout configuration to ensure it matches the user-specified --timeout flag.
  • Consider implementing a pre-kill hook to register a cleanup function before sending SIGTERM.
  • Verify that the gateway's timeout configuration (630000ms) is not interfering with the supervisor's timeout.
  • Test the suggested fix using the provided pseudocode for the supervisor.

Example

// Pseudocode for supervisor
kill(process.pid, 'SIGTERM');
setTimeout(() => {
  if (processStillExists(process.pid)) {
    kill(process.pid, 'SIGKILL'); // Force kill only after grace period
  }
}, 5000); // 5s grace period for cleanup

Notes

The provided information suggests a hardcoded supervisor timeout, but the exact value and configuration are unclear. The suggested fix may need to be adapted based on the specific supervisor implementation.

Recommendation

Apply the workaround using SIGTERM with a grace period, as it allows for cleanup and is a more elegant solution than extending the supervisor timeout or using a pre-kill hook. This approach ensures that the process has time to clean up resources before being forcibly killed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix BUG: Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade [2 pull requests, 4 comments, 2 participants]