openclaw - ✅(Solved) Fix Feature: Graceful Gateway Restart with Session Recovery [1 pull requests, 1 participants]

openclaw2026-03-30 02:51:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#57425•Fetched 2026-04-08 01:49:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lmdeagles

Participants

lmdeagles

Timeline (top)

cross-referenced ×1referenced ×1

Error Message

Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.

Fix Action

Fix / Workaround

Current workaround

We've built a 3-layer userspace workaround that validates the approach:

PR fix notes

PR #57556: feat: graceful gateway restart with session recovery

Repository: openclaw/openclaw
Author: lmdeagles
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/57556

Description (problem / solution / changelog)

Problem

When the gateway restarts — whether from openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

Kill 3-4 active conversations simultaneously
Orphan subagents mid-task (research, code generation, file operations)
Drop cron jobs that were minutes into expensive multi-tool workflows
Leave group chat messages permanently unanswered

The blast radius scales with agent count. Community reports: #51917, #30043, #4410, #43311.

Solution

Five-part restart protection system:

1. Pre-restart session manifest

Before drain, enumerate active sessions, cron runs, and subagents. Write restart-manifest.json to the state directory with full context (session keys, status, channels, last message previews, active subagents).

2. Post-restart session recovery

On startup, read manifest and:

Inject [System] events into interrupted sessions so agents know they were restarted
Re-queue interrupted cron runs (if configured)
Notify parent sessions about killed subagents
Replay messages that arrived during drain

3. Readiness gate

Both CLI (openclaw gateway restart) and the agent gateway tool check active sessions before proceeding. Returns a warning with session details. Requires --force / force: true to override.

4. Drain-aware message queuing

Messages arriving during the drain window catch GatewayDrainingError and are appended to the manifest for post-restart replay, instead of being silently dropped.

5. Configuration

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

All settings default to enabled with safe values.

What this does NOT solve

Crash recovery — no pre-crash manifest possible; requires periodic snapshots (separate issue)
Subagent state resume — agents decide their own recovery strategy
Idempotency — recovery is a safety net, not a substitute for good workflow design

Files changed

New: src/gateway/restart-recovery.ts (~500 lines) — core module
New: src/agents/tools/gateway-tool.restart.test.ts — readiness gate tests
Modified: 16 files across gateway tool, CLI lifecycle, run loop, dispatch, config schema, server startup, restart infrastructure

18 files changed, 854 insertions, 6 deletions. 19 tests passing.

Testing

Unit tests cover readiness gate blocking, force-restart override, mock typing across all integration points
Built from source and ran a second isolated gateway instance (OPENCLAW_HOME=/tmp/openclaw-test, port 18790) — manifest write on SIGUSR1 confirmed working
Full integration test with active sessions pending (WS protocol v3 device pairing makes automated session creation non-trivial from outside the gateway)

Prior art

Hermes (Nous Research) is building per-tool-call checkpointing + SQLite ResponseStore persistence (issue #344, shipped in v0.4.0)
Community workarounds: BOOT.md scripts scanning JSONL transcripts — fragile and expensive

Environment

OpenClaw 2026.3.28-3.30
macOS (Darwin 25.3.0, arm64), LaunchAgent
7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs

Closes #57425

Changed files

src/agents/tools/gateway-tool.restart.test.ts (added, +111/-0)
src/agents/tools/gateway-tool.ts (modified, +14/-0)
src/auto-reply/reply/dispatch-from-config.ts (modified, +8/-0)
src/cli/daemon-cli/lifecycle-core.ts (modified, +6/-0)
src/cli/daemon-cli/lifecycle.test.ts (modified, +39/-1)
src/cli/daemon-cli/lifecycle.ts (modified, +28/-0)
src/cli/daemon-cli/register-service-commands.ts (modified, +1/-0)
src/cli/daemon-cli/types.ts (modified, +1/-0)
src/cli/gateway-cli/run-loop.test.ts (modified, +55/-0)
src/cli/gateway-cli/run-loop.ts (modified, +36/-7)
src/config/schema.help.ts (modified, +14/-0)
src/config/schema.labels.ts (modified, +7/-0)
src/config/types.gateway.ts (modified, +16/-0)
src/config/zod-schema.ts (modified, +11/-0)
src/gateway/restart-recovery.ts (added, +502/-0)
src/gateway/server-reload-handlers.ts (modified, +3/-0)
src/gateway/server.impl.ts (modified, +8/-0)
src/infra/restart.ts (modified, +15/-0)

Code Example

{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}

---

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

RAW_BUFFERClick to expand / collapse

Problem

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

What happens today

Gateway receives restart signal
Drain period begins (90s timeout)
Active sessions are killed after drain
Gateway comes back up — fresh slate
Sessions persist on disk (JSONL) but no one reads them
Subagents die without reporting back to parents
Cron jobs interrupted mid-execution are not retried until next scheduled time
Users in group chats see "read" receipts but never get responses

Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

Kill 3-4 active conversations simultaneously
Orphan subagents mid-task (research, code generation, file operations)
Drop cron jobs that were minutes into expensive multi-tool workflows
Leave group chat messages permanently unanswered
Break multi-step workflows where step N completed but step N+1 never fires

The blast radius scales with agent count. This is not a solo-agent problem.

Existing community reports

#51917 — Auto-resume unanswered sessions (27-agent Signal deployment)
#30043 — Resume interrupted sessions and cron runs (macOS LaunchAgent)
#4410 — Auto-restart on stuck sessions
#43311 — Self-decapitation: agent-triggered restart kills its own session
#43178 — Telegram watchdog triggers restart under 10-agent load

Prior art: Hermes Agent

Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):

Per-tool-call checkpointing — Sub-agent state persisted to ~/.hermes/checkpoints/ after each tool call. On failure, resume from checkpoint.
ResponseStore persistence — SQLite-backed state that survives restarts (shipped in v0.4.0)
Three-level failure escalation — Retry → Replan → Decompose further
One-shot job recovery — Interrupted cron-like jobs are automatically retried

Proposed solution

1. Pre-restart session manifest

Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:

{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}

2. Post-restart session recovery

After startup, read the manifest and for each interrupted session, inject a system event:

[System] Gateway restarted at {time}. Reason: {reason}. You were interrupted mid-task. Review conversation context and respond to any unanswered messages.

For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.

3. Restart readiness gate

When gateway restart is called:

Enumerate active sessions
If count > 0, return a warning with the list of sessions that will be interrupted
Require --force to skip the check
For agent-triggered restarts, return the warning as a tool result so the agent can decide

4. Drain-aware message queuing

Messages received during drain should be queued and replayed after restart, not rejected. The current resetAllLanes() mechanism should be made reliable.

5. Configuration

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

What this does NOT solve (and shouldn't)

Crash recovery — No pre-crash manifest without periodic state snapshots or WAL-style journaling. Separate issue.
Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.
Idempotency — Agents should still design idempotent workflows. Recovery is a safety net, not a substitute for good design.

Current workaround

We've built a 3-layer userspace workaround that validates the approach:

Pre-restart manifest script — Shell script calls openclaw sessions --all-agents --active 10 --json, writes restart-manifest.json
BOOT.md hook — Reads the manifest on startup, sends a notification summarizing interrupted work, deletes the manifest
One-shot POL cron — Backup proof-of-life scheduled before restart, fires after startup

This works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.

Environment

OpenClaw 2026.3.28
macOS (Darwin 25.3.0, arm64), LaunchAgent
7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs
Restarts happen 2-5x per day during active development

extent analysis

Fix Plan

To address the issue of silently killed in-flight work when the gateway restarts, we will implement the following steps:

Create a pre-restart session manifest to enumerate active sessions and write a manifest file
Implement post-restart session recovery to inject a system event for interrupted sessions and re-queue interrupted cron runs
Add a restart readiness gate to warn about interrupted sessions and require --force to skip the check
Make drain-aware message queuing reliable by queuing messages received during drain and replaying them after restart

Code Changes

Here are some example code snippets to illustrate the changes:

// Configuration
{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

# Pre-restart session manifest
import json

def create_manifest():
    manifest = {
        "timestamp": "2026-03-29T23:25:30Z",
        "reason": "config-change",
        "triggeredBy": "agent:main:guild-agent-birthing",
        "activeSessions": [],
        "activeCronRuns": []
    }
    # Enumerate active sessions and cron runs
    # ...
    with open("restart-manifest.json", "w") as f:
        json.dump(manifest, f)

# Post-restart session recovery
def recover_sessions():
    with open("restart-manifest.json", "r") as f:
        manifest = json.load(f)
    # Inject system event for interrupted sessions
    # ...
    # Re-queue interrupted cron runs
    # ...

Verification

To verify that the fix worked, we can:

Restart the gateway and check that interrupted sessions are recovered and cron runs are re-queued
Verify that the restart readiness gate warns about interrupted sessions and requires --force to skip the check
Test that messages received during drain are queued and replayed after restart

Extra Tips

Make sure to handle errors and exceptions properly when creating and reading the manifest file
Consider adding additional logging and monitoring to track the effectiveness of the fix
Review the code changes carefully to ensure that they do not introduce any new issues or regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Feature: Graceful Gateway Restart with Session Recovery [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Current workaround

PR fix notes

PR #57556: feat: graceful gateway restart with session recovery

Description (problem / solution / changelog)

Problem

Real-world impact

Solution

1. Pre-restart session manifest

2. Post-restart session recovery

3. Readiness gate

4. Drain-aware message queuing

5. Configuration

What this does NOT solve

Files changed

Testing

Prior art

Environment

Changed files

Code Example

Problem

What happens today

Real-world impact

Existing community reports

Prior art: Hermes Agent

Proposed solution

1. Pre-restart session manifest

2. Post-restart session recovery

3. Restart readiness gate

4. Drain-aware message queuing

5. Configuration

What this does NOT solve (and shouldn't)

Current workaround

Environment

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING