openclaw - ✅(Solved) Fix Docker in-process gateway restart can leave command queue draining while healthz/readyz report OK [1 pull requests, 1 comments, 2 participants]

Q: Expected behavior

After an in-process restart completes and the gateway logs `ready`, new model/tool work should be accepted again. If the gateway is still intentionally draining/rejecting new work, `/readyz` should not return ready. Duplicate/coalesced restart requests during restart/startup should not leave the command queue permanently draining.

openclaw2026-05-05 23:56:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#78136•Fetched 2026-05-06 06:16:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

maxschachere

Participants

clawsweeper[bot]

maxschachere

Timeline (top)

commented ×1cross-referenced ×1

On a Docker install, an in-process gateway restart appears able to leave the command queue in the gatewayDraining state after the gateway has restarted and logged ready. During this state, model calls are rejected with Gateway is draining for restart; new tasks are not accepted, while Docker health and OpenClaw readiness endpoints still report healthy/ready.

This looks related to restart/drain behavior in #43178, but this report is narrower: the gateway can be ready from a health/readiness perspective while still rejecting new model work because the internal drain flag is stuck.

Root Cause

Fix Action

Workaround

A host-side Docker Compose recreate of only the gateway clears the bad in-memory state because it gives the gateway a fresh Node process.

Example workaround:

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

For now I am also considering disabling OpenClaw self-restart paths in Docker and using Compose as the only restart mechanism.

PR fix notes

PR #78144: fix: report gateway draining in readiness

Repository: openclaw/openclaw
Author: bryce-d-greybeard
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/78144

Description (problem / solution / changelog)

Summary

Fixes #78136 by making gateway readiness reflect the command-queue restart drain gate. While GatewayDrainingError would reject new model/tool work, /readyz now returns not-ready with gateway-draining; /healthz remains shallow liveness.

Changes

Export isGatewayDraining() from the command queue runtime.
Wire gateway readiness to the command-queue drain flag.
Avoid caching the transient draining readiness state.
Add regression tests for the readiness mismatch and queue drain-state accessor.

Testing

PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/test-projects.mjs src/process/command-queue.test.ts — passed (23 tests)
PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/run-vitest.mjs run --config test/vitest/vitest.gateway.config.ts src/gateway/server/readiness.test.ts — passed (15 tests)
PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm oxfmt --check src/process/command-queue.ts src/process/command-queue.test.ts src/gateway/server/readiness.ts src/gateway/server/readiness.test.ts src/gateway/server.impl.ts — passed
git diff --check — passed
PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs — failed in core test typecheck on unrelated pre-existing src/agents/model-fallback.test.ts errors (expectedReason missing at lines 1091/1093); core production typecheck passed.

Fixes openclaw/openclaw#78136

Changed files

src/gateway/server.impl.ts (modified, +2/-1)
src/gateway/server/readiness.test.ts (modified, +41/-0)
src/gateway/server/readiness.ts (modified, +7/-0)
src/process/command-queue.test.ts (modified, +4/-0)
src/process/command-queue.ts (modified, +4/-0)

Code Example

2026-05-05T23:25:31.686+00:00 [reload] config change requires gateway restart (plugins.installs.codex) — deferring until 2 operation(s), 1 reply(ies), 1 embedded run(s) complete
2026-05-05T23:25:39.964+00:00 [gateway-tool] gateway tool: restart requested (delayMs=default, reason=Apply approved conservative plugins.allow allowlist and bundledDiscovery=allowlist configuration.)
2026-05-05T23:28:57.692+00:00 [reload] all operations and replies completed; restarting gateway now
2026-05-05T23:28:57.694+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:57.705+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.143+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:58.968+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:58.970+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.974+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:59.699+00:00 [gateway] ready
2026-05-05T23:29:03.217+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=unknown next=openai-codex/gpt-5.4-mini detail=Gateway is draining for restart; new tasks are not accepted
...
2026-05-05T23:50:01.419+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.4-mini candidate=openai-codex/gpt-5.1-codex-max reason=unknown next=openrouter/openai/gpt-5-nano detail=Gateway is draining for restart; new tasks are not accepted

---

/healthz -> 200 {"ok":true,"status":"live"}
/readyz  -> 200 {"ready":true,"failing":[], ...}

---

Gateway is draining for restart; new tasks are not accepted

---

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.5.4
Image revision: 325df3efefe9c0887d9357732e68fc8556e78d79
Image: locally built from ghcr.io/openclaw/openclaw:latest
Install method: Docker Compose
Host: Raspberry Pi / Linux arm64
Kernel: Linux raspberrypi 6.12.62+rpt-rpi-2712 ... aarch64 GNU/Linux
Node inside image: 24.14.0
Gateway container: openclaw-gateway
Docker healthcheck uses /healthz

The compose file is a hardened Docker setup, but it does not set OPENCLAW_NO_RESPAWN. Verified from Docker config and from the actual gateway process environment.

What happened

A config change required a gateway restart. Separately, the gateway restart tool also requested a restart. After active work drained, two SIGUSR1 restarts were handled back-to-back. The gateway then logged ready, but subsequent model calls continued to fail with Gateway is draining for restart; new tasks are not accepted.

Sanitized log sequence:

2026-05-05T23:25:31.686+00:00 [reload] config change requires gateway restart (plugins.installs.codex) — deferring until 2 operation(s), 1 reply(ies), 1 embedded run(s) complete
2026-05-05T23:25:39.964+00:00 [gateway-tool] gateway tool: restart requested (delayMs=default, reason=Apply approved conservative plugins.allow allowlist and bundledDiscovery=allowlist configuration.)
2026-05-05T23:28:57.692+00:00 [reload] all operations and replies completed; restarting gateway now
2026-05-05T23:28:57.694+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:57.705+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.143+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:58.968+00:00 [gateway] signal SIGUSR1 received
2026-05-05T23:28:58.970+00:00 [gateway] received SIGUSR1; restarting
2026-05-05T23:28:58.974+00:00 [gateway] restart mode: in-process restart (OPENCLAW_NO_RESPAWN)
2026-05-05T23:28:59.699+00:00 [gateway] ready
2026-05-05T23:29:03.217+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=unknown next=openai-codex/gpt-5.4-mini detail=Gateway is draining for restart; new tasks are not accepted
...
2026-05-05T23:50:01.419+00:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.4-mini candidate=openai-codex/gpt-5.1-codex-max reason=unknown next=openrouter/openai/gpt-5-nano detail=Gateway is draining for restart; new tasks are not accepted

Note: the log says in-process restart (OPENCLAW_NO_RESPAWN), but OPENCLAW_NO_RESPAWN was not present in the container config or in /proc/<gateway-pid>/environ. This seems to be the generic in-process fallback label. The installed code also appears to intentionally disable fresh process respawn in container environments.

Health/readiness mismatch

While the gateway was still rejecting model work with Gateway is draining for restart, the container stayed healthy and the endpoints reported OK:

/healthz -> 200 {"ok":true,"status":"live"}
/readyz  -> 200 {"ready":true,"failing":[], ...}

Docker also showed the container as running/healthy, so the bad state was invisible to Docker health checks.

Expected behavior

After an in-process restart completes and the gateway logs ready, new model/tool work should be accepted again.

If the gateway is still intentionally draining/rejecting new work, /readyz should not return ready.

Duplicate/coalesced restart requests during restart/startup should not leave the command queue permanently draining.

Actual behavior

The gateway logged ready and readiness endpoints returned OK, but model calls continued to fail for at least ~20 minutes with:

Gateway is draining for restart; new tasks are not accepted

Workaround

A host-side Docker Compose recreate of only the gateway clears the bad in-memory state because it gives the gateway a fresh Node process.

Example workaround:

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

For now I am also considering disabling OpenClaw self-restart paths in Docker and using Compose as the only restart mechanism.

Suggested fixes

Ensure gatewayDraining is cleared reliably after every in-process restart path.
Coalesce or ignore duplicate restart signals while a restart iteration is already in progress.
Make /readyz fail when the gateway is rejecting new model/tool work due to drain state.
Consider documenting Docker installs as requiring supervisor/container-level restarts for restart-required config changes, or provide a Docker-aware restart mode that asks the host/supervisor to recreate the gateway container.

extent analysis

TL;DR

The gateway gets stuck in a draining state after an in-process restart, causing model calls to fail, and can be temporarily resolved by recreating the gateway container using Docker Compose.

Guidance

Verify that the gatewayDraining flag is being properly cleared after an in-process restart by checking the gateway logs and code.
Consider implementing a mechanism to coalesce or ignore duplicate restart signals while a restart iteration is already in progress to prevent the gateway from getting stuck in a draining state.
Update the /readyz endpoint to return a failure status when the gateway is rejecting new model/tool work due to drain state, to ensure consistency with the actual gateway state.
Explore using a Docker-aware restart mode that asks the host/supervisor to recreate the gateway container, or document the requirement for supervisor/container-level restarts for restart-required config changes in Docker installs.

Example

docker compose -f docker-compose.yml -f docker-compose.db.yml up -d --force-recreate openclaw-gateway

This command can be used as a temporary workaround to recreate the gateway container and clear the bad in-memory state.

Notes

The issue seems to be related to the in-process restart mechanism and the handling of duplicate restart signals. Further investigation is needed to determine the root cause and implement a reliable fix.

Recommendation

Apply the workaround by recreating the gateway container using Docker Compose, as it provides a temporary solution to clear the bad in-memory state and allow the gateway to accept new model calls. This approach can be used until a more permanent fix is implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

After an in-process restart completes and the gateway logs ready, new model/tool work should be accepted again.

If the gateway is still intentionally draining/rejecting new work, /readyz should not return ready.

Duplicate/coalesced restart requests during restart/startup should not leave the command queue permanently draining.

#api #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Docker in-process gateway restart can leave command queue draining while healthz/readyz report OK [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #78144: fix: report gateway draining in readiness

Description (problem / solution / changelog)

Summary

Changes

Testing

Changed files

Code Example

Summary

Environment

What happened

Health/readiness mismatch

Expected behavior

Actual behavior

Workaround

Suggested fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING