openclaw - ✅(Solved) Fix RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60864Fetched 2026-04-08 02:46:19
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
2
Author
Participants
Timeline (top)
cross-referenced ×1subscribed ×1

OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

  • No full in-memory state rehydration
  • No external scripts required
  • Continuation implemented as a new follow-up run seeded from a persisted checkpoint
  • Task continues automatically without requiring a new user prompt

Error Message

  • error diagnostics where applicable
  • checkpoint write failure → explicit error state, no silent restart
  • resume failure → failed_resume with error details
  • If resume fails, show explicit status plus error summary.

Root Cause

OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

  • No full in-memory state rehydration
  • No external scripts required
  • Continuation implemented as a new follow-up run seeded from a persisted checkpoint
  • Task continues automatically without requiring a new user prompt

Fix Action

Fix / Workaround

  • Runs are interrupted by restart.

  • There is no built-in continuation mechanism.

  • External workarounds are required today.

  • Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.

  • Gateway startup/bootstrap sequence

  • Run scheduler/dispatcher

  • Status reporting

PR fix notes

PR #63406: fix(gateway): preserve restart continuations after reboot

Description (problem / solution / changelog)

Summary

  • Problem: restart-sentinel wake preserved the restart notice, but requested post-restart continuation work could be dropped, delayed, or lose context after reboot.
  • Why it matters: gateway.restart is used specifically to resume work after reboot, so silent continuation loss breaks the main use case and is hard to diagnose.
  • What changed: the sentinel now carries an optional one-shot continuation, gateway.restart can request it, restart wake dispatches it back into the same session/thread, and the edge cases raised in review now warn or re-wake instead of silently failing.
  • What did NOT change (scope boundary): this does not add a durable continuation queue, change restart authorization, or broaden routing beyond the existing session-scoped restart sentinel flow.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #50137
  • Related #53940, #60864
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the original restart-sentinel path only needed to enqueue and wake a restart notice; once continuation support was added, the later continuation enqueue/dispatch path inherited assumptions that were only safe for the original single-event wake flow.
  • Missing detection / guardrail: there was no regression coverage for post-restart continuation delivery semantics, especially around schema shape, missing routing/session state, timestamp preservation, and systemEvent wake timing.
  • Contributing context (if known): gateway.restart allows continuation requests without an explicit sessionKey, and systemEvent continuation enqueue happens later than the original restart wake.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-restart-sentinel.test.ts, src/agents/tools/gateway-tool.test.ts, src/infra/restart-sentinel.test.ts
  • Scenario the test should lock in: restart continuation survives reboot in the same routed context, warns instead of silently disappearing when session/route state is missing, keeps stamped agent context, and re-wakes systemEvent continuations.
  • Why this is the smallest reliable guardrail: the failures happen in the orchestration seam between persisted sentinel payloads, restart wake scheduling, session routing, and tool schema generation, so helper-only tests would miss them.
  • Existing test that already covers this (if any): None before this PR.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • gateway.restart can resume a one-shot continuation after reboot in the same session/thread instead of only sending a restart notice.
  • agentTurn continuations now preserve the stamped agent-facing text used for time-sensitive follow-up work.
  • systemEvent continuations now trigger their own wake instead of depending on the earlier restart-notice wake timing.
  • If restart continuation cannot run because route or session state is missing, the failure is now surfaced as a warning instead of being silently dropped.

Diagram (if applicable)

Before:
[gateway.restart + continuation] -> [restart notice wake] -> [continuation may be dropped/delayed]

After:
[gateway.restart + continuation] -> [restart notice wake] -> [continuation dispatched or warning logged] -> [result visible]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) Yes
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) Yes
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: the owner-only gateway.restart tool now accepts optional continuation inputs and can resume one follow-up turn after reboot. Risk is constrained by the existing restart auth boundary, one-shot sentinel consumption, same-session routing, and explicit warning/fail-fast handling when continuation delivery is unavailable.

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local Node/pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): mocked channel routing in targeted tests
  • Relevant config (redacted): default local test config

Steps

  1. Invoke gateway.restart with continuationMessage in a routed session.
  2. Let restart sentinel wake run after startup.
  3. Exercise agentTurn, systemEvent, missing-route, and missing-sessionKey continuation cases.

Expected

  • Continuation resumes in the same routed context after reboot, or logs a visible warning when it cannot be delivered.
  • systemEvent continuation requests a wake after enqueue.
  • agentTurn continuation keeps stamped BodyForAgent context.

Actual

  • Before this fix set, continuation edge cases could be silently dropped, delayed, or lose the stamped agent context.
  • After this fix set, the tested restart continuation paths behave deterministically and fail visibly when delivery is unavailable.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: ran targeted restart-sentinel/tool tests covering agent-turn continuation dispatch, systemEvent continuation wake, missing-route warning, missing-sessionKey warning, one-shot consumption, and flat enum tool schema generation.
  • Edge cases checked: timestamp preservation in BodyForAgent, no-route fail-fast path, no-sessionKey warning path, and delayed systemEvent wake after restart notice delivery.
  • What you did not verify: live restart behavior against a real channel/integration.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: continuation behavior now depends on the persisted restart sentinel shape and the new gateway tool inputs staying aligned.
    • Mitigation: targeted tests cover sentinel payload, tool schema shape, restart wake dispatch, and the reviewed edge cases.
  • Risk: systemEvent continuation now schedules an extra heartbeat wake for the same session.
    • Mitigation: the wake is narrowly scoped to the continuation session and covered by regression test coverage in src/gateway/server-restart-sentinel.test.ts.

Changed files

  • src/agents/tools/gateway-tool.test.ts (added, +135/-0)
  • src/agents/tools/gateway-tool.ts (modified, +17/-2)
  • src/gateway/server-restart-sentinel.test.ts (modified, +313/-4)
  • src/gateway/server-restart-sentinel.ts (modified, +211/-44)
  • src/infra/restart-sentinel.test.ts (modified, +6/-0)
  • src/infra/restart-sentinel.ts (modified, +11/-0)
RAW_BUFFERClick to expand / collapse

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

Summary

OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

  • No full in-memory state rehydration
  • No external scripts required
  • Continuation implemented as a new follow-up run seeded from a persisted checkpoint
  • Task continues automatically without requiring a new user prompt

Goals

  • Persist sufficient state before restart to continue the same task

  • Transition run state to paused_for_restart

  • On startup, automatically resume exactly once (idempotent / at-most-once)

  • Resume via a new run with:

  • same user goal

  • same plan progress (last completed + next step)

  • relevant intermediate context/artifacts

  • No manual intervention required

  • Explicit failure states, with no silent “running without progress”

Non-Goals (v1)

  • Full RAM snapshot or exact process rehydration
  • Perfect reproduction of all tool/subagent internal state
  • Mandatory full chat transcript restoration
  • External orchestration as a requirement

Current Limitation / Motivation

  • Runs are interrupted by restart.
  • There is no built-in continuation mechanism.
  • External workarounds are required today.
  • Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.

Proposed Approach

Checkpoint Lifecycle

A run that requires a restart should follow a standardized lifecycle:

running → paused_for_restart → resuming → resumed

If needed, resuming can be optional internally, but the persisted state model should support it.

Trigger Conditions

Checkpointing should occur for:

  • Explicit restart, for example openclaw gateway restart
  • Restart required due to config reload, for example “config change requires restart”

As soon as a restart becomes part of normal task execution, continuation should apply.

Checkpoint Content (minimal)

A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.

Identity

  • original run/task IDs
  • origin/routing: channel/provider, peer/chat identifiers
  • account ID where applicable
  • timestamps

Task semantics

  • userGoal for the original request

  • plan plus cursor:

  • last completed step

  • next step

  • relevant intermediate results/artifacts:

  • file diffs

  • computed outputs

  • decisions

  • tool-context references, without secrets

Resume metadata

  • resume reason: explicit restart vs restart-required-by-reload
  • resume policy: at-most-once

Status

  • paused_for_restart | resuming | resumed | failed_resume
  • error diagnostics where applicable

Storage should be local, durable, and use atomic writes.

Resume Mechanism

On gateway start:

  • detect checkpoints in paused_for_restart
  • atomically claim one for resume
  • create a new follow-up run seeded with checkpoint context
  • mark checkpoint as resumed, linking it to the new run ID

Idempotency / At-Most-Once

Requirements:

  • no duplicate resume
  • safe under restart loops
  • atomic state transitions

Suggested model:

  • paused_for_restart → resuming(token) → resumed | failed_resume

Failure Handling

  • checkpoint write failure → explicit error state, no silent restart
  • resume failure → failed_resume with error details
  • bounded retries only, no infinite loops
  • multiple checkpoints resumed independently

UI / Messaging Behavior

  • No additional user prompt should be required to resume.
  • Optional minimal info: “Resumed after restart; continuing from step X…”
  • No duplicate notifications and no debug dumps.
  • If resume fails, show explicit status plus error summary.

Backwards Compatibility

  • Feature should activate only when restart is required, either explicitly or internally detected.
  • Existing behavior remains unchanged otherwise.

Acceptance Criteria

  • A run that hits requires_restart is checkpointed before restart.

  • The run transitions to paused_for_restart.

  • After gateway start, resume occurs automatically and exactly once.

  • Resume creates a new run that continues the same task with:

  • same goal

  • same plan progress, including last and next step

  • relevant context and artifacts

  • No manual input is required.

  • No double resume occurs.

  • Clear failure states exist, with no silent hangs.

Open Questions

  • checkpoint storage location and format
  • definition of safe plan-step boundaries
  • handling non-idempotent tool calls with side effects
  • cross-backend consistency for subagents, ACP, and similar runtimes

PR Implementation Plan

Overview / Strategy

Implement this in two small, reviewable phases:

  1. Checkpoint persistence + status model
  2. Auto-resume on gateway startup

v1 should stay intentionally minimal: resume by creating a new follow-up run seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.

Phase 1: Checkpoint + Status Model

Areas likely affected

  • Gateway restart orchestration
  • Restart-required-by-reload logic
  • Run/task runtime state model
  • Persistence/storage layer

Core changes

  • Add explicit statuses:

  • paused_for_restart

  • resuming

  • resumed

  • failed_resume

  • Define checkpoint schema, for example JSON, containing:

  • IDs and routing info

  • goal

  • plan cursor with last and next step

  • artifact summary such as paths and diff pointers

  • tool-context references without secrets

  • resume reason and policy

  • Implement durable checkpoint store, for example under:

  • ~/.openclaw/state/checkpoints/*.json

  • Use atomic write strategy:

  • write temp file

  • rename into place

  • Add checkpoint creation hook:

  • when restart is required, serialize checkpoint for each impacted run

  • transition run state to paused_for_restart

  • fail clearly if checkpoint write fails

Tests

  • unit tests for schema validation
  • unit tests for atomic write and atomic claim
  • unit tests for run status transitions
  • integration-style test for restart-required flow creating checkpoint and paused state

Risks

  • defining a stable plan cursor if planner state is currently too implicit
  • accidentally persisting secrets from tool or environment context

Phase 2: Auto-Resume Hook

Areas likely affected

  • Gateway startup/bootstrap sequence
  • Run scheduler/dispatcher
  • Status reporting

Core changes

  • On gateway start:

  • scan checkpoint store for paused_for_restart

  • claim checkpoint atomically

  • create a new run with:

  • same routing

  • same goal

  • seeded context summary

  • mark checkpoint resumed with resumedRunId

  • Enforce at-most-once:

  • claim token plus persisted transition prevents double resume

  • handle concurrent startup paths safely

Tests

  • unit tests for claim semantics and no double resume

  • integration test:

  • create paused checkpoint

  • run startup resume hook

  • assert exactly one new run is created

  • assert checkpoint is marked resumed

  • failure-mode tests:

  • corrupted checkpoint file → failed_resume with diagnostic

  • resume creation failure → failed_resume without infinite retry

Failure Modes / Handling Checklist

  • Checkpoint write fails → run marked failed and restart blocked or clearly reported
  • Resume fails → checkpoint marked failed_resume; no follow-up run created
  • Restart loop → checkpoint remains paused/resuming with bounded retry or manual intervention path
  • Multiple paused runs → each checkpoint resumed independently with bounded concurrency

Deliberate Constraints for v1

  • Not full memory rehydration
  • Not restoring complete subagent graphs
  • Resume is always a new run seeded from checkpoint
  • No external scripts required

extent analysis

TL;DR

Implement a checkpoint and auto-resume mechanism to continue in-progress tasks across gateway restarts by creating a new follow-up run seeded from a persisted checkpoint.

Guidance

  1. Define a checkpoint schema: Determine the necessary information to include in the checkpoint, such as task IDs, user goal, plan progress, and relevant context/artifacts.
  2. Implement durable checkpoint storage: Choose a suitable storage location and format for checkpoints, ensuring atomic writes and data integrity.
  3. Develop an auto-resume hook: Create a mechanism to detect and claim checkpoints on gateway startup, and then create a new run with the same goal and context as the original task.
  4. Enforce at-most-once resume: Implement a token-based system to prevent duplicate resumes and ensure idempotency.
  5. Test and validate: Write comprehensive unit and integration tests to verify the correctness and robustness of the checkpoint and auto-resume mechanism.

Example

// Example checkpoint schema
{
  "taskId": "example-task-123",
  "userGoal": "example-goal",
  "planProgress": {
    "lastCompletedStep": "step-1",
    "nextStep": "step-2"
  },
  "context": {
    "artifacts": ["artifact-1", "artifact-2"]
  },
  "resumeReason": "explicit-restart",
  "resumePolicy": "at-most-once"
}

Notes

The implementation should focus on a minimal, functional checkpoint and auto-resume mechanism, without attempting to rehydrate full in-memory state or restore complete subagent graphs.

Recommendation

Apply the proposed checkpoint and auto-resume mechanism to ensure task continuation across gateway restarts, as it provides a pragmatic and effective solution to the current limitation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING