openclaw - 💡(How to fix) Fix Config-reload deferral logged but not honored — systemd SIGTERM kills gateway, in-flight user message lost with no retry [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73918Fetched 2026-04-29 06:13:16
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but systemd issues SIGTERM ~9 seconds later regardless. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.

In my case: an inbound Telegram message (Is that ok?, msg 3675) landed at 23:00:30 UTC. The next assistant reply for that turn happened 2 hours 14 minutes later, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.

This is related to but distinct from:

  • #57425 (broad "graceful restart with session recovery" feature)
  • #71178 (openclaw update mid-turn message loss)

This issue is the concrete config-reload + systemd path, with two narrow bugs that could be fixed in isolation.

Root Cause

Root cause: two narrow bugs

Fix Action

Workaround

None. Until either fix lands, users have to notice the silence and re-prompt manually.

Code Example

for each session jsonl modified in last 60 minutes:
    last_entry = tail -1
    if last_entry.role == "user" and no_assistant_reply_after(last_entry):
        dispatch_to_agent(session, last_entry)
        log "[startup] retried orphaned user message <id> from <session>"
RAW_BUFFERClick to expand / collapse

Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry

Summary

When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but systemd issues SIGTERM ~9 seconds later regardless. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.

In my case: an inbound Telegram message (Is that ok?, msg 3675) landed at 23:00:30 UTC. The next assistant reply for that turn happened 2 hours 14 minutes later, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.

This is related to but distinct from:

  • #57425 (broad "graceful restart with session recovery" feature)
  • #71178 (openclaw update mid-turn message loss)

This issue is the concrete config-reload + systemd path, with two narrow bugs that could be fixed in isolation.

Environment

  • OpenClaw: 2026.4.25
  • Linux 6.8.0-110-generic, node 25.x
  • Gateway under systemd user unit (openclaw-gateway.service)
  • Channel: Telegram, forum group with thread routing
  • Agent: claude-max-proxy backend, model claude-opus-4-7
  • Date: 2026-04-28 → 2026-04-29

Timeline (real incident, journalctl --user)

Time (UTC)Event
23:00:30User msg 3675 ("Is that ok?") arrives, persisted to session JSONL with runtime context
23:01:03[reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames)
23:01:09[reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete
23:01:18systemd: Stopping openclaw-gateway.service (deferral did not hold)
23:01:18[gateway] signal SIGTERM received; shutting down
23:01:19Stopped openclaw-gateway.service. Consumed 4h 22min CPU time
23:01:41New gateway process loading configuration
23:01:59[gateway] ready (6 plugins; 17.7s)
23:01:59No retry of msg 3675. No assistant reply ever generated for that turn.
01:14 (next day)User re-pings ("Why didn't you respond earlier..."), Claude finally sees both messages and responds to both at once

Total dispatch loss: 2h 14min, only resolved by user-initiated re-prompt.

Root cause: two narrow bugs

Bug 1 — Deferral isn't honored by systemd

The reload path logs deferring until N operations complete (line 23:01:09), but systemd SIGTERM's the unit 9 seconds later anyway. Either:

  • The deferral mechanism is purely in-process (logs intent but doesn't actually delay the systemctl restart call), or
  • systemctl restart is being called immediately after the deferral logging without honoring the in-process gate, or
  • TimeoutStopSec in the unit file is too short for the 4 operations + 2 replies + 2 embedded runs to drain

Whichever it is, the user-facing effect is that the deferral log message is misleading — it suggests the gateway will hold off, but it doesn't.

Bug 2 — No retry-on-restart for already-persisted messages

The user message landed at 23:00:30 and was persisted to the session JSONL before the restart. After the restart at 23:01:59, the gateway came up clean — but it never scanned the session JSONL for messages whose newest sibling isn't an assistant reply.

The data is already on disk. The only missing piece is a startup pass that:

  1. Walks recent session JSONLs (last hour, say)
  2. Identifies turns where the last entry is a user message with no assistant reply
  3. Re-dispatches those to the appropriate agent

This would close the gap for any restart cause — config reload, openclaw update, OOM, manual restart — without needing a per-cause fix.

Proposed fixes

Fix 1 (smaller, easier): make deferral actually defer

If the deferral mechanism is meant to gate restart, wire it through to systemctl. Either:

  • systemctl restart --no-block immediately, but have the gateway internally hold the SIGTERM handler until ops drain (suspect this is what the current logic thinks it does)
  • Or: the config-watcher should not call systemctl restart directly — it should set a "pending restart" flag, complete the in-flight ops, then call restart

If the deferral is purely advisory and the restart is non-negotiable, remove the misleading log line so operators don't think they have a grace period.

Fix 2 (bigger, more durable): startup retry from session JSONL

On gateway startup, after channels and plugins load:

for each session jsonl modified in last 60 minutes:
    last_entry = tail -1
    if last_entry.role == "user" and no_assistant_reply_after(last_entry):
        dispatch_to_agent(session, last_entry)
        log "[startup] retried orphaned user message <id> from <session>"

This piggybacks on the existing JSONL persistence and would fix the entire class of "restart killed in-flight dispatch" bugs covered partially by #57425, #71178, #71429, and this issue.

Severity

High for any user using OpenClaw as a primary chat surface. Silent message loss with multi-hour delay is the worst possible failure mode — the user thinks the assistant is ignoring them, the assistant has no record of being asked, and recovery requires the user to figure out something is wrong and re-prompt. In my case the assistant only realized what happened after the user explicitly asked "why didn't you respond earlier."

The exact config change in this incident was an SSRF allowlist update — a routine operation. This will fire any time someone touches browser.ssrfPolicy (or any other reloadable config) while a conversation is active.

Workaround

None. Until either fix lands, users have to notice the silence and re-prompt manually.

Cross-references

  • #57425 (Feature: Graceful Gateway Restart with Session Recovery) — this issue is a concrete instance / smaller scope
  • #71178 (openclaw update mid-turn message loss) — same failure class, different trigger
  • #71429 (Telegram drops in-flight messages on sendChatAction failure) — same data-loss surface
  • #55412 (GatewayDrainingError should auto-retry) — adjacent

extent analysis

TL;DR

Implement a startup retry mechanism from session JSONL to handle orphaned user messages after a gateway restart.

Guidance

  • Identify the root cause of the issue: either the deferral mechanism not being honored by systemd or the lack of retry-on-restart for already-persisted messages.
  • Consider implementing a startup retry mechanism that walks recent session JSONLs, identifies turns with no assistant reply, and re-dispatches those messages to the agent.
  • Review the TimeoutStopSec value in the systemd unit file to ensure it allows sufficient time for in-flight operations to complete.
  • Evaluate the feasibility of making the deferral mechanism actually defer the restart by wiring it through to systemctl or setting a "pending restart" flag.

Example

for each session_jsonl_modified_in_last_60_minutes:
    last_entry = tail -1
    if last_entry.role == "user" and no_assistant_reply_after(last_entry):
        dispatch_to_agent(session, last_entry)
        log "[startup] retried orphaned user message <id> from <session>"

Notes

The provided solution focuses on the retry-on-restart mechanism, which addresses the specific issue of orphaned user messages. However, the deferral mechanism not being honored by systemd may still require additional attention.

Recommendation

Apply the workaround by implementing the startup retry mechanism from session JSONL, as it provides a more comprehensive solution to the issue of silent message loss with multi-hour delay.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING