openclaw - ✅(Solved) Fix Docker in-process gateway restart can leave command queue draining while healthz/readyz report OK [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78136Fetched 2026-05-06 06:16:37
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
commented ×1cross-referenced ×1

On a Docker install, an in-process gateway restart appears able to leave the command queue in the gatewayDraining state after the gateway has restarted and logged ready. During this state, model calls are rejected with Gateway is draining for restart; new tasks are not accepted, while Docker health and OpenClaw readiness endpoints still report healthy/ready.

This looks related to restart/drain behavior in #43178, but this report is narrower: the gateway can be ready from a health/readiness perspective while still rejecting new model work because the internal drain flag is stuck.

Root Cause

This looks related to restart/drain behavior in #43178, but this report is narrower: the gateway can be ready from a health/readiness perspective while still rejecting new model work because the internal drain flag is stuck.

Fix Action

Workaround

A host-side Docker Compose recreate of only the gateway clears the bad in-memory state because it gives the gateway a fresh Node process.

Example workaround:

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

For now I am also considering disabling OpenClaw self-restart paths in Docker and using Compose as the only restart mechanism.

PR fix notes

PR #78144: fix: report gateway draining in readiness

Description (problem / solution / changelog)

Summary

Fixes #78136 by making gateway readiness reflect the command-queue restart drain gate. While GatewayDrainingError would reject new model/tool work, /readyz now returns not-ready with gateway-draining; /healthz remains shallow liveness.

Changes

  • Export isGatewayDraining() from the command queue runtime.
  • Wire gateway readiness to the command-queue drain flag.
  • Avoid caching the transient draining readiness state.
  • Add regression tests for the readiness mismatch and queue drain-state accessor.

Testing

  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/test-projects.mjs src/process/command-queue.test.ts — passed (23 tests)
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/run-vitest.mjs run --config test/vitest/vitest.gateway.config.ts src/gateway/server/readiness.test.ts — passed (15 tests)
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm oxfmt --check src/process/command-queue.ts src/process/command-queue.test.ts src/gateway/server/readiness.ts src/gateway/server/readiness.test.ts src/gateway/server.impl.ts — passed
  • git diff --check — passed
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs — failed in core test typecheck on unrelated pre-existing src/agents/model-fallback.test.ts errors (expectedReason missing at lines 1091/1093); core production typecheck passed.

Fixes openclaw/openclaw#78136

Changed files

  • src/gateway/server.impl.ts (modified, +2/-1)
  • src/gateway/server/readiness.test.ts (modified, +41/-0)
  • src/gateway/server/readiness.ts (modified, +7/-0)
  • src/process/command-queue.test.ts (modified, +4/-0)
  • src/process/command-queue.ts (modified, +4/-0)

Code Example

2026-05-05T23:25:31.686+00:00 [reload] config change requires gateway restart (plugins.installs.codex) — deferring until 2 operation(s), 1 reply(ies), 1 embedded run(s) complete
2026-05-05T23:25:39.964+00:00 [gateway-tool] gateway tool: restart requested (delayMs=default, reason=Apply approved conservative plugins.allow allowlist and bundledDiscovery=allowlist configuration.)
2026-05-05T23:28:57.692+00:00 [reload] all operations and replies completed; restarting gateway now
2026-05-05T23:28:57.694+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:57.705+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.143+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:58.968+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:58.970+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.974+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:59.699+00:00 [gateway] ready
2026-05-05T23:29:03.217+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=unknown next=openai-codex/gpt-5.4-mini detail=Gateway is draining for restart; new tasks are not accepted
...
2026-05-05T23:50:01.419+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.4-mini candidate=openai-codex/gpt-5.1-codex-max reason=unknown next=openrouter/openai/gpt-5-nano detail=Gateway is draining for restart; new tasks are not accepted

---

/healthz -> 200 {"ok":true,"status":"live"}
/readyz  -> 200 {"ready":true,"failing":[], ...}

---

Gateway is draining for restart; new tasks are not accepted

---

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway
RAW_BUFFERClick to expand / collapse

Summary

On a Docker install, an in-process gateway restart appears able to leave the command queue in the gatewayDraining state after the gateway has restarted and logged ready. During this state, model calls are rejected with Gateway is draining for restart; new tasks are not accepted, while Docker health and OpenClaw readiness endpoints still report healthy/ready.

This looks related to restart/drain behavior in #43178, but this report is narrower: the gateway can be ready from a health/readiness perspective while still rejecting new model work because the internal drain flag is stuck.

Environment

  • OpenClaw: 2026.5.4
  • Image revision: 325df3efefe9c0887d9357732e68fc8556e78d79
  • Image: locally built from ghcr.io/openclaw/openclaw:latest
  • Install method: Docker Compose
  • Host: Raspberry Pi / Linux arm64
  • Kernel: Linux raspberrypi 6.12.62+rpt-rpi-2712 ... aarch64 GNU/Linux
  • Node inside image: 24.14.0
  • Gateway container: openclaw-gateway
  • Docker healthcheck uses /healthz

The compose file is a hardened Docker setup, but it does not set OPENCLAW_NO_RESPAWN. Verified from Docker config and from the actual gateway process environment.

What happened

A config change required a gateway restart. Separately, the gateway restart tool also requested a restart. After active work drained, two SIGUSR1 restarts were handled back-to-back. The gateway then logged ready, but subsequent model calls continued to fail with Gateway is draining for restart; new tasks are not accepted.

Sanitized log sequence:

2026-05-05T23:25:31.686+00:00 [reload] config change requires gateway restart (plugins.installs.codex) — deferring until 2 operation(s), 1 reply(ies), 1 embedded run(s) complete
2026-05-05T23:25:39.964+00:00 [gateway-tool] gateway tool: restart requested (delayMs=default, reason=Apply approved conservative plugins.allow allowlist and bundledDiscovery=allowlist configuration.)
2026-05-05T23:28:57.692+00:00 [reload] all operations and replies completed; restarting gateway now
2026-05-05T23:28:57.694+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:57.705+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.143+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:58.968+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:58.970+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.974+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:59.699+00:00 [gateway] ready
2026-05-05T23:29:03.217+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=unknown next=openai-codex/gpt-5.4-mini detail=Gateway is draining for restart; new tasks are not accepted
...
2026-05-05T23:50:01.419+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.4-mini candidate=openai-codex/gpt-5.1-codex-max reason=unknown next=openrouter/openai/gpt-5-nano detail=Gateway is draining for restart; new tasks are not accepted

Note: the log says in-process restart (OPENCLAW_NO_RESPAWN), but OPENCLAW_NO_RESPAWN was not present in the container config or in /proc/<gateway-pid>/environ. This seems to be the generic in-process fallback label. The installed code also appears to intentionally disable fresh process respawn in container environments.

Health/readiness mismatch

While the gateway was still rejecting model work with Gateway is draining for restart, the container stayed healthy and the endpoints reported OK:

/healthz -> 200 {"ok":true,"status":"live"}
/readyz  -> 200 {"ready":true,"failing":[], ...}

Docker also showed the container as running/healthy, so the bad state was invisible to Docker health checks.

Expected behavior

After an in-process restart completes and the gateway logs ready, new model/tool work should be accepted again.

If the gateway is still intentionally draining/rejecting new work, /readyz should not return ready.

Duplicate/coalesced restart requests during restart/startup should not leave the command queue permanently draining.

Actual behavior

The gateway logged ready and readiness endpoints returned OK, but model calls continued to fail for at least ~20 minutes with:

Gateway is draining for restart; new tasks are not accepted

Workaround

A host-side Docker Compose recreate of only the gateway clears the bad in-memory state because it gives the gateway a fresh Node process.

Example workaround:

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

For now I am also considering disabling OpenClaw self-restart paths in Docker and using Compose as the only restart mechanism.

Suggested fixes

  • Ensure gatewayDraining is cleared reliably after every in-process restart path.
  • Coalesce or ignore duplicate restart signals while a restart iteration is already in progress.
  • Make /readyz fail when the gateway is rejecting new model/tool work due to drain state.
  • Consider documenting Docker installs as requiring supervisor/container-level restarts for restart-required config changes, or provide a Docker-aware restart mode that asks the host/supervisor to recreate the gateway container.

extent analysis

TL;DR

The gateway gets stuck in a draining state after an in-process restart, causing model calls to fail, and can be temporarily resolved by recreating the gateway container using Docker Compose.

Guidance

  • Verify that the gatewayDraining flag is being properly cleared after an in-process restart by checking the gateway logs and code.
  • Consider implementing a mechanism to coalesce or ignore duplicate restart signals while a restart iteration is already in progress to prevent the gateway from getting stuck in a draining state.
  • Update the /readyz endpoint to return a failure status when the gateway is rejecting new model/tool work due to drain state, to ensure consistency with the actual gateway state.
  • Explore using a Docker-aware restart mode that asks the host/supervisor to recreate the gateway container, or document the requirement for supervisor/container-level restarts for restart-required config changes in Docker installs.

Example

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

This command can be used as a temporary workaround to recreate the gateway container and clear the bad in-memory state.

Notes

The issue seems to be related to the in-process restart mechanism and the handling of duplicate restart signals. Further investigation is needed to determine the root cause and implement a reliable fix.

Recommendation

Apply the workaround by recreating the gateway container using Docker Compose, as it provides a temporary solution to clear the bad in-memory state and allow the gateway to accept new model calls. This approach can be used until a more permanent fix is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

After an in-process restart completes and the gateway logs ready, new model/tool work should be accepted again.

If the gateway is still intentionally draining/rejecting new work, /readyz should not return ready.

Duplicate/coalesced restart requests during restart/startup should not leave the command queue permanently draining.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Docker in-process gateway restart can leave command queue draining while healthz/readyz report OK [1 pull requests, 1 comments, 2 participants]