openclaw - ✅(Solved) Fix RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume) [1 pull requests, 1 participants]

Gernspi · 2026-04-04T12:55:33Z

[openclaw] OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts . When a restart occurs, whethe… OpenClaw currently lacks a standardized, first-class mechanism to continue **in-progress tasks** across **gateway restarts**. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration. This RFC proposes a pragmatic **v1 checkpoint + auto-resume mechanism**: * **No full in-memory state rehydration** * **No external scripts required** * Continuation implemented as a **new follow-up run seeded from a persisted checkpoint** * Task continues automatically **without requiring a new user prompt** # PR #63406: fix(gateway): preserve restart continuations after reboot - Repository: openclaw/openclaw - Author: VACInc - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/63406 ## Description (problem / solution / changelog) ## Summary - Problem: restart-sentinel wake preserved the restart notice, but requested post-restart continuation work could be dropped, delayed, or lose context after reboot. - Why it matters: `gateway.restart` is used specifically to resume work after reboot, so silent continuation loss breaks the main use case and is hard to diagnose. - What changed: the sentinel now carries an optional one-shot continuation, `gateway.restart` can request it, restart wake dispatches it back into the same session/thread, and the edge cases raised in review now warn or re-wake instead of silently failing. - What did NOT change (scope boundary): this does not add a durable continuation queue, change restart authorization, or broaden routing beyond the existing session-scoped restart sentinel flow. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [x] Memory / storage - [ ] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #50137 - Related #53940, #60864 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: the original restart-sentinel path only needed to enqueue and wake a restart notice; once continuation support was added, the later continuation enqueue/dispatch path inherited assumptions that were only safe for the original single-event wake flow. - Missing detection / guardrail: there was no regression coverage for post-restart continuation delivery semantics, especially around schema shape, missing routing/session state, timestamp preservation, and `systemEvent` wake timing. - Contributing context (if known): `gateway.restart` allows continuation requests without an explicit `sessionKey`, and `systemEvent` continuation enqueue happens later than the original restart wake. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [ ] Unit test - [x] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `src/gateway/server-restart-sentinel.test.ts`, `src/agents/tools/gateway-tool.test.ts`, `src/infra/restart-sentinel.test.ts` - Scenario the test should lock in: restart continuation survives reboot in the same routed context, warns instead of silently disappearing when session/route state is missing, keeps stamped agent context, and re-wakes `systemEvent` continuations. - Why this is the smallest reliable guardrail: the failures happen in the orchestration seam between persisted sentinel payloads, restart wake scheduling, session routing, and tool schema generation, so helper-only tests would miss them. - Existing test that already covers this (if any): None before this PR. - If no new test is added, why not: N/A ## User-visible / Behavior Changes - `gateway.restart` can resume a one-shot continuation after reboot in the same session/thread instead of only sending a restart notice. - `agentTurn` continuations now preserve the stamped agent-facing text used for time-sensitive follow-up work. - `systemEvent` continuations now trigger their own wake instead of depending on the earlier restart-notice wake timing. - If restart continuation cannot run because route or session state is missing, the failure is now surfaced as a warning instead of being silently dropped. ## Diagram (if applicable) ```text Before: [gateway.restart + continuation] -> [restart notice wake] -> [continuation may be dropped/delayed] After: [gateway.restart + continuation] -> [restart notice wake] -> [continuation dispatched or warning logged] -> [result visible] ``` ## Security Impact (required) - New permissions/capabilities? (`Yes/No`) Yes - Secrets/tokens handling changed? (`Yes/No`) No - New/changed network calls? (`Yes/No`) No - Command/tool execution surface changed? (`Yes/No`) Yes - Dat

openclaw2026-04-04 12:55:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#60864•Fetched 2026-04-08 02:46:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Gernspi

Participants

Gernspi

Timeline (top)

cross-referenced ×1subscribed ×1

OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

No full in-memory state rehydration
No external scripts required
Continuation implemented as a new follow-up run seeded from a persisted checkpoint
Task continues automatically without requiring a new user prompt

Error Message

error diagnostics where applicable
checkpoint write failure → explicit error state, no silent restart
resume failure → failed_resume with error details
If resume fails, show explicit status plus error summary.

Root Cause

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

No full in-memory state rehydration
No external scripts required
Continuation implemented as a new follow-up run seeded from a persisted checkpoint
Task continues automatically without requiring a new user prompt

Fix Action

Fix / Workaround

Runs are interrupted by restart.
There is no built-in continuation mechanism.
External workarounds are required today.
Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.
Gateway startup/bootstrap sequence
Run scheduler/dispatcher
Status reporting

PR fix notes

PR #63406: fix(gateway): preserve restart continuations after reboot

Repository: openclaw/openclaw
Author: VACInc
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/63406

Description (problem / solution / changelog)

Summary

Problem: restart-sentinel wake preserved the restart notice, but requested post-restart continuation work could be dropped, delayed, or lose context after reboot.
Why it matters: gateway.restart is used specifically to resume work after reboot, so silent continuation loss breaks the main use case and is hard to diagnose.
What changed: the sentinel now carries an optional one-shot continuation, gateway.restart can request it, restart wake dispatches it back into the same session/thread, and the edge cases raised in review now warn or re-wake instead of silently failing.
What did NOT change (scope boundary): this does not add a durable continuation queue, change restart authorization, or broaden routing beyond the existing session-scoped restart sentinel flow.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #50137
Related #53940, #60864
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: the original restart-sentinel path only needed to enqueue and wake a restart notice; once continuation support was added, the later continuation enqueue/dispatch path inherited assumptions that were only safe for the original single-event wake flow.
Missing detection / guardrail: there was no regression coverage for post-restart continuation delivery semantics, especially around schema shape, missing routing/session state, timestamp preservation, and systemEvent wake timing.
Contributing context (if known): gateway.restart allows continuation requests without an explicit sessionKey, and systemEvent continuation enqueue happens later than the original restart wake.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-restart-sentinel.test.ts, src/agents/tools/gateway-tool.test.ts, src/infra/restart-sentinel.test.ts
Scenario the test should lock in: restart continuation survives reboot in the same routed context, warns instead of silently disappearing when session/route state is missing, keeps stamped agent context, and re-wakes systemEvent continuations.
Why this is the smallest reliable guardrail: the failures happen in the orchestration seam between persisted sentinel payloads, restart wake scheduling, session routing, and tool schema generation, so helper-only tests would miss them.
Existing test that already covers this (if any): None before this PR.
If no new test is added, why not: N/A

User-visible / Behavior Changes

gateway.restart can resume a one-shot continuation after reboot in the same session/thread instead of only sending a restart notice.
agentTurn continuations now preserve the stamped agent-facing text used for time-sensitive follow-up work.
systemEvent continuations now trigger their own wake instead of depending on the earlier restart-notice wake timing.
If restart continuation cannot run because route or session state is missing, the failure is now surfaced as a warning instead of being silently dropped.

Diagram (if applicable)

Before:
[gateway.restart + continuation] -> [restart notice wake] -> [continuation may be dropped/delayed]

After:
[gateway.restart + continuation] -> [restart notice wake] -> [continuation dispatched or warning logged] -> [result visible]

Security Impact (required)

New permissions/capabilities? (Yes/No) Yes
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) Yes
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation: the owner-only gateway.restart tool now accepts optional continuation inputs and can resume one follow-up turn after reboot. Risk is constrained by the existing restart auth boundary, one-shot sentinel consumption, same-session routing, and explicit warning/fail-fast handling when continuation delivery is unavailable.

Repro + Verification

Environment

OS: Linux
Runtime/container: local Node/pnpm workspace
Model/provider: N/A
Integration/channel (if any): mocked channel routing in targeted tests
Relevant config (redacted): default local test config

Steps

Invoke gateway.restart with continuationMessage in a routed session.
Let restart sentinel wake run after startup.
Exercise agentTurn, systemEvent, missing-route, and missing-sessionKey continuation cases.

Expected

Continuation resumes in the same routed context after reboot, or logs a visible warning when it cannot be delivered.
systemEvent continuation requests a wake after enqueue.
agentTurn continuation keeps stamped BodyForAgent context.

Actual

Before this fix set, continuation edge cases could be silently dropped, delayed, or lose the stamped agent context.
After this fix set, the tested restart continuation paths behave deterministically and fail visibly when delivery is unavailable.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: ran targeted restart-sentinel/tool tests covering agent-turn continuation dispatch, systemEvent continuation wake, missing-route warning, missing-sessionKey warning, one-shot consumption, and flat enum tool schema generation.
Edge cases checked: timestamp preservation in BodyForAgent, no-route fail-fast path, no-sessionKey warning path, and delayed systemEvent wake after restart notice delivery.
What you did not verify: live restart behavior against a real channel/integration.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps:

Risks and Mitigations

Risk: continuation behavior now depends on the persisted restart sentinel shape and the new gateway tool inputs staying aligned.
- Mitigation: targeted tests cover sentinel payload, tool schema shape, restart wake dispatch, and the reviewed edge cases.
Risk: systemEvent continuation now schedules an extra heartbeat wake for the same session.
- Mitigation: the wake is narrowly scoped to the continuation session and covered by regression test coverage in src/gateway/server-restart-sentinel.test.ts.

Changed files

src/agents/tools/gateway-tool.test.ts (added, +135/-0)
src/agents/tools/gateway-tool.ts (modified, +17/-2)
src/gateway/server-restart-sentinel.test.ts (modified, +313/-4)
src/gateway/server-restart-sentinel.ts (modified, +211/-44)
src/infra/restart-sentinel.test.ts (modified, +6/-0)
src/infra/restart-sentinel.ts (modified, +11/-0)

RAW_BUFFERClick to expand / collapse

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

Summary

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

No full in-memory state rehydration
No external scripts required
Continuation implemented as a new follow-up run seeded from a persisted checkpoint
Task continues automatically without requiring a new user prompt

Goals

Persist sufficient state before restart to continue the same task
Transition run state to paused_for_restart
On startup, automatically resume exactly once (idempotent / at-most-once)
Resume via a new run with:
same user goal
same plan progress (last completed + next step)
relevant intermediate context/artifacts
No manual intervention required
Explicit failure states, with no silent “running without progress”

Non-Goals (v1)

Full RAM snapshot or exact process rehydration
Perfect reproduction of all tool/subagent internal state
Mandatory full chat transcript restoration
External orchestration as a requirement

Current Limitation / Motivation

Runs are interrupted by restart.
There is no built-in continuation mechanism.
External workarounds are required today.
Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.

Proposed Approach

Checkpoint Lifecycle

A run that requires a restart should follow a standardized lifecycle:

running → paused_for_restart → resuming → resumed

If needed, resuming can be optional internally, but the persisted state model should support it.

Trigger Conditions

Checkpointing should occur for:

Explicit restart, for example openclaw gateway restart
Restart required due to config reload, for example “config change requires restart”

As soon as a restart becomes part of normal task execution, continuation should apply.

Checkpoint Content (minimal)

A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.

Identity

original run/task IDs
origin/routing: channel/provider, peer/chat identifiers
account ID where applicable
timestamps

Task semantics

userGoal for the original request
plan plus cursor:
last completed step
next step
relevant intermediate results/artifacts:
file diffs
computed outputs
decisions
tool-context references, without secrets

Resume metadata

resume reason: explicit restart vs restart-required-by-reload
resume policy: at-most-once

Status

paused_for_restart | resuming | resumed | failed_resume
error diagnostics where applicable

Storage should be local, durable, and use atomic writes.

Resume Mechanism

On gateway start:

detect checkpoints in paused_for_restart
atomically claim one for resume
create a new follow-up run seeded with checkpoint context
mark checkpoint as resumed, linking it to the new run ID

Idempotency / At-Most-Once

Requirements:

no duplicate resume
safe under restart loops
atomic state transitions

Suggested model:

paused_for_restart → resuming(token) → resumed | failed_resume

Failure Handling

checkpoint write failure → explicit error state, no silent restart
resume failure → failed_resume with error details
bounded retries only, no infinite loops
multiple checkpoints resumed independently

UI / Messaging Behavior

No additional user prompt should be required to resume.
Optional minimal info: “Resumed after restart; continuing from step X…”
No duplicate notifications and no debug dumps.
If resume fails, show explicit status plus error summary.

Backwards Compatibility

Feature should activate only when restart is required, either explicitly or internally detected.
Existing behavior remains unchanged otherwise.

Acceptance Criteria

A run that hits requires_restart is checkpointed before restart.
The run transitions to paused_for_restart.
After gateway start, resume occurs automatically and exactly once.
Resume creates a new run that continues the same task with:
same goal
same plan progress, including last and next step
relevant context and artifacts
No manual input is required.
No double resume occurs.
Clear failure states exist, with no silent hangs.

Open Questions

checkpoint storage location and format
definition of safe plan-step boundaries
handling non-idempotent tool calls with side effects
cross-backend consistency for subagents, ACP, and similar runtimes

PR Implementation Plan

Overview / Strategy

Implement this in two small, reviewable phases:

Checkpoint persistence + status model
Auto-resume on gateway startup

v1 should stay intentionally minimal: resume by creating a new follow-up run seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.

Phase 1: Checkpoint + Status Model

Areas likely affected

Gateway restart orchestration
Restart-required-by-reload logic
Run/task runtime state model
Persistence/storage layer

Core changes

Add explicit statuses:
paused_for_restart
resuming
resumed
failed_resume
Define checkpoint schema, for example JSON, containing:
IDs and routing info
goal
plan cursor with last and next step
artifact summary such as paths and diff pointers
tool-context references without secrets
resume reason and policy
Implement durable checkpoint store, for example under:
~/.openclaw/state/checkpoints/*.json
Use atomic write strategy:
write temp file
rename into place
Add checkpoint creation hook:
when restart is required, serialize checkpoint for each impacted run
transition run state to paused_for_restart
fail clearly if checkpoint write fails

Tests

unit tests for schema validation
unit tests for atomic write and atomic claim
unit tests for run status transitions
integration-style test for restart-required flow creating checkpoint and paused state

Risks

defining a stable plan cursor if planner state is currently too implicit
accidentally persisting secrets from tool or environment context

Phase 2: Auto-Resume Hook

Areas likely affected

Gateway startup/bootstrap sequence
Run scheduler/dispatcher
Status reporting

Core changes

On gateway start:
scan checkpoint store for paused_for_restart
claim checkpoint atomically
create a new run with:
same routing
same goal
seeded context summary
mark checkpoint resumed with resumedRunId
Enforce at-most-once:
claim token plus persisted transition prevents double resume
handle concurrent startup paths safely

Tests

unit tests for claim semantics and no double resume
integration test:
create paused checkpoint
run startup resume hook
assert exactly one new run is created
assert checkpoint is marked resumed
failure-mode tests:
corrupted checkpoint file → failed_resume with diagnostic
resume creation failure → failed_resume without infinite retry

Failure Modes / Handling Checklist

Checkpoint write fails → run marked failed and restart blocked or clearly reported
Resume fails → checkpoint marked failed_resume; no follow-up run created
Restart loop → checkpoint remains paused/resuming with bounded retry or manual intervention path
Multiple paused runs → each checkpoint resumed independently with bounded concurrency

Deliberate Constraints for v1

Not full memory rehydration
Not restoring complete subagent graphs
Resume is always a new run seeded from checkpoint
No external scripts required

extent analysis

TL;DR

Implement a checkpoint and auto-resume mechanism to continue in-progress tasks across gateway restarts by creating a new follow-up run seeded from a persisted checkpoint.

Guidance

Define a checkpoint schema: Determine the necessary information to include in the checkpoint, such as task IDs, user goal, plan progress, and relevant context/artifacts.
Implement durable checkpoint storage: Choose a suitable storage location and format for checkpoints, ensuring atomic writes and data integrity.
Develop an auto-resume hook: Create a mechanism to detect and claim checkpoints on gateway startup, and then create a new run with the same goal and context as the original task.
Enforce at-most-once resume: Implement a token-based system to prevent duplicate resumes and ensure idempotency.
Test and validate: Write comprehensive unit and integration tests to verify the correctness and robustness of the checkpoint and auto-resume mechanism.

Example

// Example checkpoint schema
{
  "taskId": "example-task-123",
  "userGoal": "example-goal",
  "planProgress": {
    "lastCompletedStep": "step-1",
    "nextStep": "step-2"
  },
  "context": {
    "artifacts": ["artifact-1", "artifact-2"]
  },
  "resumeReason": "explicit-restart",
  "resumePolicy": "at-most-once"
}

Notes

The implementation should focus on a minimal, functional checkpoint and auto-resume mechanism, without attempting to rehydrate full in-memory state or restore complete subagent graphs.

Recommendation

Apply the proposed checkpoint and auto-resume mechanism to ensure task continuation across gateway restarts, as it provides a pragmatic and effective solution to the current limitation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#cache error #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #63406: fix(gateway): preserve restart continuations after reboot

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

Summary

Goals

Non-Goals (v1)

Current Limitation / Motivation

Proposed Approach

Checkpoint Lifecycle

Trigger Conditions

Checkpoint Content (minimal)

Resume Mechanism

Idempotency / At-Most-Once

Failure Handling

UI / Messaging Behavior

Backwards Compatibility

Acceptance Criteria

Open Questions

PR Implementation Plan

Overview / Strategy

Phase 1: Checkpoint + Status Model

Phase 2: Auto-Resume Hook

Failure Modes / Handling Checklist

Deliberate Constraints for v1

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING