openclaw - 💡(How to fix) Fix [Feature] Session Checkpoint and Resume: survive gateway restarts without losing task context [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78409Fetched 2026-05-07 03:37:12
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
2
Timeline (top)
commented ×1

When the OpenClaw gateway is restarted (systemd restart, crash, system reboot), all in-flight agent sessions lose their complete memory state — including the current task being executed, tool-call history, and mid-turn execution context. Users experience this as the agent "forgetting" what it was doing and losing all progress.

This is a fundamental limitation in OpenClaw's session model: session state is kept entirely in the node process memory. There is no durable checkpoint written during normal agent operation that would allow restart-resumption.

Root Cause

For personal AI agent setups that run multi-hour autonomous workflows, gateway restarts are inevitable (system updates, crashes, resource pressure). Without checkpoint/resume, every restart costs potentially hours of lost work.

This is a core reliability feature for autonomous agent operation.

RAW_BUFFERClick to expand / collapse

Summary

When the OpenClaw gateway is restarted (systemd restart, crash, system reboot), all in-flight agent sessions lose their complete memory state — including the current task being executed, tool-call history, and mid-turn execution context. Users experience this as the agent "forgetting" what it was doing and losing all progress.

This is a fundamental limitation in OpenClaw's session model: session state is kept entirely in the node process memory. There is no durable checkpoint written during normal agent operation that would allow restart-resumption.

Problem Statement

For users running long-duration agentic workflows (code generation, research synthesis, skill evolution pipelines, KG evolution), a gateway restart mid-task means:

  1. Task progress is lost — The agent cannot report what it was working on
  2. No automatic recovery — The user must manually re-explain the context
  3. Mid-turn execution context is lost — If a long-running tool (exec, python script, subprocess) was in progress when restart happened, there is no way to know what state the task was in
  4. The delivery continuation problem (#76087) compounds this: post-restart continuation messages can get stuck in ~/.openclaw/session-delivery-queue/ without automatic retry

Observed Behavior

When systemctl --user restart openclaw-gateway is executed:

  • Gateway node process is killed (SIGTERM)
  • All sessions in memory are lost
  • Any active heartbeat cron tasks abort
  • On restart, new sessions are created fresh — no knowledge of previous task state
  • The agent has no signal that it was "interrupted mid-task"

Related Issues

  • #76087 (open): Restart-sentinel continuation delivery can get stuck in session-delivery-queue/ after transient failure — confirms restart-resume is an acknowledged problem space
  • #75151 (open): Session orphaning after context overflow — same problem family

Proposed Solution

Tier 1: Session Checkpoint (minimum viable)

At the end of each agent turn, write a minimal checkpoint to disk:

  • Current session key
  • Current task description
  • Last executed tool call and its arguments (not results)
  • Timestamp
  • Next expected action

Checkpoint format: JSON in ~/.openclaw/sessions/checkpoints/<session_key>.json

Tier 2: Restart Detection and Notification

On gateway startup:

  1. Scan checkpoint directory for sessions with recent timestamps
  2. For each active session, inject a system prompt: "Session was interrupted. Last action: <tool_call>. Continue only if user confirms."
  3. Notify the user via channel that a task was in progress and was interrupted

Tier 3: Automated Resume (stretch goal)

With Tier 2 in place, add an explicit /resume command that reads the checkpoint, reconstructs state, and re-executes the last tool call with identical parameters.

Why This Matters

For personal AI agent setups that run multi-hour autonomous workflows, gateway restarts are inevitable (system updates, crashes, resource pressure). Without checkpoint/resume, every restart costs potentially hours of lost work.

This is a core reliability feature for autonomous agent operation.

References

  • #76087 (open): Restart-sentinel continuation delivery gets stuck — same problem family
  • #75151 (open): Session orphaning after context overflow
  • #10164: Temporal.io integration for durable workflows (closed, related architecture)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Feature] Session Checkpoint and Resume: survive gateway restarts without losing task context [1 comments, 2 participants]