openclaw - ✅(Solved) Fix Feature: Graceful Gateway Restart with Session Recovery [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57425Fetched 2026-04-08 01:49:52
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1referenced ×1

Error Message

  • Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.

Fix Action

Fix / Workaround

Current workaround

We've built a 3-layer userspace workaround that validates the approach:

PR fix notes

PR #57556: feat: graceful gateway restart with session recovery

Description (problem / solution / changelog)

Problem

When the gateway restarts — whether from openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

  • Kill 3-4 active conversations simultaneously
  • Orphan subagents mid-task (research, code generation, file operations)
  • Drop cron jobs that were minutes into expensive multi-tool workflows
  • Leave group chat messages permanently unanswered

The blast radius scales with agent count. Community reports: #51917, #30043, #4410, #43311.

Solution

Five-part restart protection system:

1. Pre-restart session manifest

Before drain, enumerate active sessions, cron runs, and subagents. Write restart-manifest.json to the state directory with full context (session keys, status, channels, last message previews, active subagents).

2. Post-restart session recovery

On startup, read manifest and:

  • Inject [System] events into interrupted sessions so agents know they were restarted
  • Re-queue interrupted cron runs (if configured)
  • Notify parent sessions about killed subagents
  • Replay messages that arrived during drain

3. Readiness gate

Both CLI (openclaw gateway restart) and the agent gateway tool check active sessions before proceeding. Returns a warning with session details. Requires --force / force: true to override.

4. Drain-aware message queuing

Messages arriving during the drain window catch GatewayDrainingError and are appended to the manifest for post-restart replay, instead of being silently dropped.

5. Configuration

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

All settings default to enabled with safe values.

What this does NOT solve

  • Crash recovery — no pre-crash manifest possible; requires periodic snapshots (separate issue)
  • Subagent state resume — agents decide their own recovery strategy
  • Idempotency — recovery is a safety net, not a substitute for good workflow design

Files changed

  • New: src/gateway/restart-recovery.ts (~500 lines) — core module
  • New: src/agents/tools/gateway-tool.restart.test.ts — readiness gate tests
  • Modified: 16 files across gateway tool, CLI lifecycle, run loop, dispatch, config schema, server startup, restart infrastructure

18 files changed, 854 insertions, 6 deletions. 19 tests passing.

Testing

  • Unit tests cover readiness gate blocking, force-restart override, mock typing across all integration points
  • Built from source and ran a second isolated gateway instance (OPENCLAW_HOME=/tmp/openclaw-test, port 18790) — manifest write on SIGUSR1 confirmed working
  • Full integration test with active sessions pending (WS protocol v3 device pairing makes automated session creation non-trivial from outside the gateway)

Prior art

  • Hermes (Nous Research) is building per-tool-call checkpointing + SQLite ResponseStore persistence (issue #344, shipped in v0.4.0)
  • Community workarounds: BOOT.md scripts scanning JSONL transcripts — fragile and expensive

Environment

  • OpenClaw 2026.3.28-3.30
  • macOS (Darwin 25.3.0, arm64), LaunchAgent
  • 7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs

Closes #57425

Changed files

  • src/agents/tools/gateway-tool.restart.test.ts (added, +111/-0)
  • src/agents/tools/gateway-tool.ts (modified, +14/-0)
  • src/auto-reply/reply/dispatch-from-config.ts (modified, +8/-0)
  • src/cli/daemon-cli/lifecycle-core.ts (modified, +6/-0)
  • src/cli/daemon-cli/lifecycle.test.ts (modified, +39/-1)
  • src/cli/daemon-cli/lifecycle.ts (modified, +28/-0)
  • src/cli/daemon-cli/register-service-commands.ts (modified, +1/-0)
  • src/cli/daemon-cli/types.ts (modified, +1/-0)
  • src/cli/gateway-cli/run-loop.test.ts (modified, +55/-0)
  • src/cli/gateway-cli/run-loop.ts (modified, +36/-7)
  • src/config/schema.help.ts (modified, +14/-0)
  • src/config/schema.labels.ts (modified, +7/-0)
  • src/config/types.gateway.ts (modified, +16/-0)
  • src/config/zod-schema.ts (modified, +11/-0)
  • src/gateway/restart-recovery.ts (added, +502/-0)
  • src/gateway/server-reload-handlers.ts (modified, +3/-0)
  • src/gateway/server.impl.ts (modified, +8/-0)
  • src/infra/restart.ts (modified, +15/-0)

Code Example

{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}

---

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}
RAW_BUFFERClick to expand / collapse

Problem

When the gateway restarts — whether from openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

What happens today

  1. Gateway receives restart signal
  2. Drain period begins (90s timeout)
  3. Active sessions are killed after drain
  4. Gateway comes back up — fresh slate
  5. Sessions persist on disk (JSONL) but no one reads them
  6. Subagents die without reporting back to parents
  7. Cron jobs interrupted mid-execution are not retried until next scheduled time
  8. Users in group chats see "read" receipts but never get responses

Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

  • Kill 3-4 active conversations simultaneously
  • Orphan subagents mid-task (research, code generation, file operations)
  • Drop cron jobs that were minutes into expensive multi-tool workflows
  • Leave group chat messages permanently unanswered
  • Break multi-step workflows where step N completed but step N+1 never fires

The blast radius scales with agent count. This is not a solo-agent problem.

Existing community reports

  • #51917 — Auto-resume unanswered sessions (27-agent Signal deployment)
  • #30043 — Resume interrupted sessions and cron runs (macOS LaunchAgent)
  • #4410 — Auto-restart on stuck sessions
  • #43311 — Self-decapitation: agent-triggered restart kills its own session
  • #43178 — Telegram watchdog triggers restart under 10-agent load

Prior art: Hermes Agent

Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):

  • Per-tool-call checkpointing — Sub-agent state persisted to ~/.hermes/checkpoints/ after each tool call. On failure, resume from checkpoint.
  • ResponseStore persistence — SQLite-backed state that survives restarts (shipped in v0.4.0)
  • Three-level failure escalation — Retry → Replan → Decompose further
  • One-shot job recovery — Interrupted cron-like jobs are automatically retried

Proposed solution

1. Pre-restart session manifest

Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:

{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}

2. Post-restart session recovery

After startup, read the manifest and for each interrupted session, inject a system event:

[System] Gateway restarted at {time}. Reason: {reason}. You were interrupted mid-task. Review conversation context and respond to any unanswered messages.

For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.

3. Restart readiness gate

When gateway restart is called:

  • Enumerate active sessions
  • If count > 0, return a warning with the list of sessions that will be interrupted
  • Require --force to skip the check
  • For agent-triggered restarts, return the warning as a tool result so the agent can decide

4. Drain-aware message queuing

Messages received during drain should be queued and replayed after restart, not rejected. The current resetAllLanes() mechanism should be made reliable.

5. Configuration

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

What this does NOT solve (and shouldn't)

  • Crash recovery — No pre-crash manifest without periodic state snapshots or WAL-style journaling. Separate issue.
  • Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.
  • Idempotency — Agents should still design idempotent workflows. Recovery is a safety net, not a substitute for good design.

Current workaround

We've built a 3-layer userspace workaround that validates the approach:

  1. Pre-restart manifest script — Shell script calls openclaw sessions --all-agents --active 10 --json, writes restart-manifest.json
  2. BOOT.md hook — Reads the manifest on startup, sends a notification summarizing interrupted work, deletes the manifest
  3. One-shot POL cron — Backup proof-of-life scheduled before restart, fires after startup

This works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.

Environment

  • OpenClaw 2026.3.28
  • macOS (Darwin 25.3.0, arm64), LaunchAgent
  • 7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs
  • Restarts happen 2-5x per day during active development

extent analysis

Fix Plan

To address the issue of silently killed in-flight work when the gateway restarts, we will implement the following steps:

  • Create a pre-restart session manifest to enumerate active sessions and write a manifest file
  • Implement post-restart session recovery to inject a system event for interrupted sessions and re-queue interrupted cron runs
  • Add a restart readiness gate to warn about interrupted sessions and require --force to skip the check
  • Make drain-aware message queuing reliable by queuing messages received during drain and replaying them after restart

Code Changes

Here are some example code snippets to illustrate the changes:

// Configuration
{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}
# Pre-restart session manifest
import json

def create_manifest():
    manifest = {
        "timestamp": "2026-03-29T23:25:30Z",
        "reason": "config-change",
        "triggeredBy": "agent:main:guild-agent-birthing",
        "activeSessions": [],
        "activeCronRuns": []
    }
    # Enumerate active sessions and cron runs
    # ...
    with open("restart-manifest.json", "w") as f:
        json.dump(manifest, f)

# Post-restart session recovery
def recover_sessions():
    with open("restart-manifest.json", "r") as f:
        manifest = json.load(f)
    # Inject system event for interrupted sessions
    # ...
    # Re-queue interrupted cron runs
    # ...

Verification

To verify that the fix worked, we can:

  • Restart the gateway and check that interrupted sessions are recovered and cron runs are re-queued
  • Verify that the restart readiness gate warns about interrupted sessions and requires --force to skip the check
  • Test that messages received during drain are queued and replayed after restart

Extra Tips

  • Make sure to handle errors and exceptions properly when creating and reading the manifest file
  • Consider adding additional logging and monitoring to track the effectiveness of the fix
  • Review the code changes carefully to ensure that they do not introduce any new issues or regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Feature: Graceful Gateway Restart with Session Recovery [1 pull requests, 1 participants]