openclaw - 💡(How to fix) Fix [Bug]: New inbound message during deferred gateway restart leaves main session stuck in processing (restart-deferred deadlock) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#72903Fetched 2026-04-28 06:30:35
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

When config.apply triggers a gateway restart that gets deferred until active task runs drain, any new inbound message that lands in that deferral window opens a brand-new task run. Since restartPending is a closure-local variable inside createGatewayReloadHandlers() and the inbound dispatch path doesn't observe it, the new run is dispatched normally and markProcessing() is called. That run keeps getActiveCounts().totalActive > 0, so deferGatewayRestartUntilIdle() waits indefinitely (default gateway.reload.deferralTimeoutMs is unset → no upper bound). The main session sits in state=processing until the gateway is force-killed.

End-user symptom: webchat hangs (assistant never responds) for tens of minutes, eventually requiring the user to force-reset the host.

Root Cause

When config.apply triggers a gateway restart that gets deferred until active task runs drain, any new inbound message that lands in that deferral window opens a brand-new task run. Since restartPending is a closure-local variable inside createGatewayReloadHandlers() and the inbound dispatch path doesn't observe it, the new run is dispatched normally and markProcessing() is called. That run keeps getActiveCounts().totalActive > 0, so deferGatewayRestartUntilIdle() waits indefinitely (default gateway.reload.deferralTimeoutMs is unset → no upper bound). The main session sits in state=processing until the gateway is force-killed.

End-user symptom: webchat hangs (assistant never responds) for tens of minutes, eventually requiring the user to force-reset the host.

Fix Action

Fix / Workaround

When config.apply triggers a gateway restart that gets deferred until active task runs drain, any new inbound message that lands in that deferral window opens a brand-new task run. Since restartPending is a closure-local variable inside createGatewayReloadHandlers() and the inbound dispatch path doesn't observe it, the new run is dispatched normally and markProcessing() is called. That run keeps getActiveCounts().totalActive > 0, so deferGatewayRestartUntilIdle() waits indefinitely (default gateway.reload.deferralTimeoutMs is unset → no upper bound). The main session sits in state=processing until the gateway is force-killed.

While a restart is pending, inbound messages should not be allowed to start new task runs that themselves block the deferral. Reasonable options: reject with a gateway restarting reply, queue and dispatch after restart, or impose a hard timeout on the deferral.

The new message is dispatched as if no restart were pending. It opens a new task run, totalActive stays > 0 forever, and the session is stuck until external kill.

Code Example

22:42:42  [gateway] config.apply write changedPaths=agents.defaults.model.primary restartReason=config.apply
22:42:43  [reload] config change detected; evaluating reload (agents.defaults.model.primary, ..., plugins.entries.openai.config, plugins.entries.test-openclaw-sider.config, plugins.entries.memory-core, commands, messages)
22:42:43  [reload] config change requires gateway restart (plugins.entries.openai.config, ..., commands) — deferring until 1 task run(s) complete
22:43:32  [reload] restart still deferred after 49378ms with 1 task run(s) active
22:44:39  [reload] restart still deferred after 115905ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 1 task run(s) active
22:44:54  before_message_write { role: 'user', content: <new prompt while reload is still deferred> }
22:47:09  [diagnostic] stuck session: sessionId=main state=processing age=135s queueDepth=1
...
23:10:40  [diagnostic] stuck session: sessionId=main state=processing age=1546s queueDepth=1
23:10:50  [reload] restart still deferred after 1687477ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 1 task run(s) active
23:11:04  <last log line — host force-reset via cloud console; 21 min of silence, then cold boot>
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash) — runtime deadlock

Beta release blocker

No

Summary

When config.apply triggers a gateway restart that gets deferred until active task runs drain, any new inbound message that lands in that deferral window opens a brand-new task run. Since restartPending is a closure-local variable inside createGatewayReloadHandlers() and the inbound dispatch path doesn't observe it, the new run is dispatched normally and markProcessing() is called. That run keeps getActiveCounts().totalActive > 0, so deferGatewayRestartUntilIdle() waits indefinitely (default gateway.reload.deferralTimeoutMs is unset → no upper bound). The main session sits in state=processing until the gateway is force-killed.

End-user symptom: webchat hangs (assistant never responds) for tens of minutes, eventually requiring the user to force-reset the host.

Steps to reproduce

  1. Start a gateway with default config (gateway.reload.deferralTimeoutMs unset).
  2. From webchat, kick off a prompt that keeps a task run busy for a few minutes (any long tool loop works — we hit it via a mcporter call remote-browser.* automation, but the tool is incidental).
  3. While that run is active, call config.apply with a change that requires restart, e.g. agents.defaults.model.primary plus any plugins.entries.*.config change.
  4. Observe [reload] config change requires gateway restart (...) — deferring until 1 task run(s) complete.
  5. Within the deferral window, send a new message in the same webchat session.
  6. [diagnostic] stuck session ... state=processing age=Ns starts firing every 30 s and age grows without bound.
  7. The gateway never restarts. Killing the process is the only exit.

Expected behavior

While a restart is pending, inbound messages should not be allowed to start new task runs that themselves block the deferral. Reasonable options: reject with a gateway restarting reply, queue and dispatch after restart, or impose a hard timeout on the deferral.

Actual behavior

The new message is dispatched as if no restart were pending. It opens a new task run, totalActive stays > 0 forever, and the session is stuck until external kill.

OpenClaw version

Reproduced against upstream/main @ 8ce4f8fc84 (2026-04-26). Originally hit on a slightly older build deployed via Sider's Go wrapper.

Operating system

Ubuntu 24.04 / kernel 6.8.0-100-generic / Aliyun ECS (2 vCPU, 4 GB)

Install method

npm global, run via Sider's Go wrapper (Sider-ai/openclaw-console). The wrapper does not touch reload or session-state code paths.

Model

gpt-5.5 via openai-responses. Not required for the deadlock — any model + any tool that keeps a task run busy works.

Provider / routing chain

OpenClaw → openai-responses; tool path uses mcporter for remote-browser MCP. Incidental.

Additional provider/model setup details

n/a

Logs, screenshots, and evidence

Trimmed timeline from the incident (host clock):

22:42:42  [gateway] config.apply write changedPaths=agents.defaults.model.primary restartReason=config.apply
22:42:43  [reload] config change detected; evaluating reload (agents.defaults.model.primary, ..., plugins.entries.openai.config, plugins.entries.test-openclaw-sider.config, plugins.entries.memory-core, commands, messages)
22:42:43  [reload] config change requires gateway restart (plugins.entries.openai.config, ..., commands) — deferring until 1 task run(s) complete
22:43:32  [reload] restart still deferred after 49378ms with 1 task run(s) active
22:44:39  [reload] restart still deferred after 115905ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 1 task run(s) active
22:44:54  before_message_write { role: 'user', content: <new prompt while reload is still deferred> }
22:47:09  [diagnostic] stuck session: sessionId=main state=processing age=135s queueDepth=1
...
23:10:40  [diagnostic] stuck session: sessionId=main state=processing age=1546s queueDepth=1
23:10:50  [reload] restart still deferred after 1687477ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 1 task run(s) active
23:11:04  <last log line — host force-reset via cloud console; 21 min of silence, then cold boot>

OS-level metrics during the same window were healthy (CPU < 50 %, mem < 40 % of 4 GB, no OOM, no kernel errors, no IO wait, journal kept writing until the last second). The deadlock is purely application-level.

Code references (upstream/main @ 8ce4f8fc84)

  • restartPending is a closure-local let, never exposed: src/gateway/server-reload-handlers.ts:315, used at L332 / L338 / L349 / L360 / L366.
  • deferGatewayRestartUntilIdle() only times out when maxWaitMs is a finite positive number (src/infra/restart.ts:392-450). Default gateway.reload.deferralTimeoutMs is unset → no upper bound.
  • The websocket inbound path (src/gateway/server/ws-connection/message-handler.ts) and the dispatcher (src/auto-reply/reply/dispatch-from-config.ts) contain zero references to restartPending / reloadPending / gatewayRestarting. markProcessing() at dispatch-from-config.ts:710 runs unconditionally for every new task run.

Relation to #63974

Not a duplicate. #63974 was closed after the fix tracked pending outbound replies in the deferral count and made outbound delivery durable across restarts. Those changes do not address the inbound side: a new inbound webchat message during the deferral still opens a fresh task run and contributes to totalActive, which is exactly what creates this deadlock. If anything, #63974's fix widens the deferral window (it now also waits for replies), which makes inbound messages more likely to land inside it.

Impact and severity

High when it triggers. Real-user symptom on the affected box: webchat appears completely frozen for ~25 minutes; only resolved by force-resetting the host. The user has no in-process recovery path — the gateway is alive and won't exit on its own. Triggering condition is a natural sequence: long task running + change default model from the UI + keep chatting.

Suggested fix directions

  1. Hoist restartPending (and the active-count reasons) into a small accessor the inbound path can read; have the inbound dispatcher either reject or queue new messages while a restart is pending.
  2. As a safety net, give deferGatewayRestartUntilIdle() a sane default maxWaitMs (e.g. 60 s) so the gateway always eventually restarts even if active counts never drop. Today the default is "wait forever".
  3. Add a regression test alongside server-restart-deferral.test.ts that holds a task run open (not just a reply) and dispatches a fresh inbound message during the deferral, asserting the session does not end up in state=processing indefinitely.

Happy to put up a PR if directions 1 + 2 look reasonable.

extent analysis

TL;DR

To fix the runtime deadlock, hoist restartPending into an accessible variable and modify the inbound dispatcher to either reject or queue new messages while a restart is pending.

Guidance

  1. Hoist restartPending: Make restartPending accessible to the inbound path to check the restart status.
  2. Modify the dispatcher: Update the dispatcher to either reject new messages or queue them when a restart is pending, preventing new task runs from blocking the deferral.
  3. Set a default maxWaitMs: Provide a default maxWaitMs value (e.g., 60 seconds) for deferGatewayRestartUntilIdle() to ensure the gateway restarts even if active counts never drop.
  4. Add regression testing: Create a test that simulates a task run during deferral and dispatches a new inbound message, verifying the session does not indefinitely stay in state=processing.

Example

No code example is provided due to the complexity and specificity of the issue, but the suggested fix directions offer a clear path forward.

Notes

The provided solution directions are based on the detailed analysis of the issue and should address the runtime deadlock. However, without direct access to the codebase, it's essential to carefully review and test any changes.

Recommendation

Apply the suggested fix directions, specifically hoisting restartPending and modifying the dispatcher, as these changes directly address the identified issue and prevent the deadlock.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

While a restart is pending, inbound messages should not be allowed to start new task runs that themselves block the deferral. Reasonable options: reject with a gateway restarting reply, queue and dispatch after restart, or impose a hard timeout on the deferral.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: New inbound message during deferred gateway restart leaves main session stuck in processing (restart-deferred deadlock) [1 participants]