openclaw - 💡(How to fix) Fix Bug: SIGUSR1 gateway restart leaves sessions.json.lock stale, causing message drops [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77721Fetched 2026-05-06 06:22:26
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

When gateway receives `SIGUSR1` mid-conversation, it can leave the `sessions.json` file lock (`sessions.json.lock`) in a stale state. New messages queue for 10–30 seconds with no retry, then silently drop. The lock timeout (observed at 10s default) is not recovered, causing sustained message delivery failure until the next full gateway restart.

Error Message

  1. Gateway log shows no error — the messages simply vanish
  • Messages sent during restart are either queued with retry or the sender receives a clear error
  • No retry attempted, no error returned to sender

Root Cause

When gateway receives `SIGUSR1` mid-conversation, it can leave the `sessions.json` file lock (`sessions.json.lock`) in a stale state. New messages queue for 10–30 seconds with no retry, then silently drop. The lock timeout (observed at 10s default) is not recovered, causing sustained message delivery failure until the next full gateway restart.

RAW_BUFFERClick to expand / collapse

Summary

When gateway receives `SIGUSR1` mid-conversation, it can leave the `sessions.json` file lock (`sessions.json.lock`) in a stale state. New messages queue for 10–30 seconds with no retry, then silently drop. The lock timeout (observed at 10s default) is not recovered, causing sustained message delivery failure until the next full gateway restart.

Reproduction

  1. Have an active conversation with the main agent (messages in flight)
  2. Send `SIGUSR1` to the gateway process (e.g., `openclaw gateway restart` or `kill -USR1 <pid>`)
  3. Observe: messages sent during the restart window are silently dropped
  4. Gateway log shows no error — the messages simply vanish

Expected Behavior

  • Gateway restarts gracefully, releasing all locks promptly
  • Messages sent during restart are either queued with retry or the sender receives a clear error
  • No silent message drops

Actual Behavior

  • `sessions.json.lock` remains locked after SIGUSR1, preventing session writes
  • Concurrent message delivery attempts fail to acquire lock within timeout (~10s)
  • After timeout expiry, message is dropped with no notification to sender
  • Issue is self-perpetuating: the `sessions-rebuild-poller` cron job (which could clear the lock) is itself blocked by the same lock

Version

``` OpenClaw 2026.5.3-1 (2eae30e) ```

OS: Linux 6.8.0-110-generic (x64)

Evidence

Incident log (2026-05-01 16:30 GMT+10):

  • SIGUSR1 sent to gateway process during active conversation
  • Message delivery failed silently for ~15 minutes
  • No retry attempted, no error returned to sender
  • Gateway restarted a second time to recover
  • `sessions-rebuild-poller` cron job (running every 5 minutes) observed as blocked by the same lock

Cron job also affected:

Job IDNameScheduleIssue
`9b007ea1-b85e-44ea-abbc-5de1e0f23e5e`sessions-rebuild-pollerevery 5minBlocked by sessions.json.lock

Impact

  • Silent message drops during rolling restarts
  • Degrades user trust in the agent
  • `sessions-rebuild-poller` job (intended to recover from lock contention) becomes a victim of the same issue it is meant to fix

Severity

High — reliability/availability regression during normal gateway lifecycle operations.

Proposed Fix (direction, not prescriptive)

  1. Ensure `sessions.json.lock` is released synchronously before SIGUSR1 handler completes
  2. Alternatively: add lock acquisition retry with exponential backoff and sender notification on final failure
  3. Add a gateway-level health check that detects stale locks and forces release before accepting new sessions writes

Related Issues

  • LRN-20260501-002 (internal) — internal learnings doc with full incident timeline
  • #11040 (Feature: First-class session/task chain tracking) — session lineage tracking, not directly related

extent analysis

TL;DR

Release the sessions.json.lock file synchronously before the SIGUSR1 handler completes to prevent silent message drops during gateway restarts.

Guidance

  • Review the SIGUSR1 handler code to ensure it releases the sessions.json.lock file before completing, allowing new sessions to be written.
  • Consider implementing a lock acquisition retry mechanism with exponential backoff to handle cases where the lock is still held after the initial release attempt.
  • Add a gateway-level health check to detect stale locks and force their release before accepting new session writes, preventing the sessions-rebuild-poller job from becoming blocked.

Example

import os
import fcntl

# Release the lock file before SIGUSR1 handler completes
def release_lock(file_path):
    with open(file_path, 'w') as f:
        fcntl.flock(f, fcntl.LOCK_UN)

# Call release_lock before completing the SIGUSR1 handler
release_lock('sessions.json.lock')

Notes

The proposed fix direction suggests ensuring the sessions.json.lock file is released synchronously before the SIGUSR1 handler completes. However, the exact implementation details may vary depending on the specific codebase and requirements.

Recommendation

Apply a workaround by releasing the sessions.json.lock file synchronously before the SIGUSR1 handler completes, as this directly addresses the root cause of the issue and prevents silent message drops during gateway restarts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING