openclaw - ✅(Solved) Fix [Bug]: Gateway "full process restart" exits PID 1 with code 0 → Docker Swarm task stays Complete and never restarts (service stuck at 0/1) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73178Fetched 2026-04-29 06:22:33
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×2closed ×1commented ×1cross-referenced ×1

When the gateway receives SIGUSR1 and performs a "full process restart" (triggered by a config change that requires restart), the entrypoint process (PID 1 inside the container) exits with code 0 after spawning the replacement child. In any orchestrator that treats exit 0 as a successful completion (Docker Swarm with restart_policy.condition: on-failure, Kubernetes Jobs, systemd Type=oneshot, etc.), the container/task is marked Complete and never restarts — the service is stuck at 0/1 replicas until a human manually redeploys.

The gateway logs the restart, then silence — there is no surfaced error because, from the OS's perspective, nothing went wrong.

Error Message

The gateway logs the restart, then silence — there is no surfaced error OOMKilled: false, Error: "". $ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}' 2026-04-28T02:19:18.477+00:00 Embedded agent failed before reply: LLM request failed: network connection error.

Root Cause

The "full process restart" path appears to spawn a child Node process and then let PID 1 exit cleanly, instead of either:

  • replacing the running process in place via execve(2) (so PID 1 remains the same OS process, just with new code), or
  • exiting with a non-zero code so restart_policy: on-failure triggers a fresh task.

In a container, PID 1 is the supervisor as far as the orchestrator is concerned. Spawning a sibling and exiting PID 1 only works under a host-level supervisor (PM2, systemd) — never inside a container.

Fix Action

Fix / Workaround

  1. Preferred: in-place re-exec. Replace the spawn-and-exit logic with execve of the same Node binary + argv. The container PID 1 keeps running; orchestrators see no exit at all. This is what tools like nginx (SIGHUP/-s reload) and many language runtimes do.
  2. Acceptable: exit non-zero on intentional restart. process.exit(75) (or any non-zero code) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container. Document that operators must configure their restart policy to allow it.
  3. Documentation-only mitigation: prominently document in the Docker/Swarm/Kubernetes deploy guide that restart_policy.condition must be any (Swarm) / restartPolicy: Always (K8s) / Restart=always (systemd). Currently the README does not mention this and the default Easypanel template uses on-failure, which is a footgun.

PR fix notes

PR #73221: fix(gateway): exit non-zero when restarting as PID 1 in container (#73178)

Description (problem / solution / changelog)

Fixes #73178.

Root cause

When the gateway runs as PID 1 inside a container with no detected supervisor (Docker / Docker Swarm with `restart_policy.condition: on-failure`, Kubernetes with `restartPolicy: OnFailure`, etc.), the "full process restart" path in `src/cli/gateway-cli/run-loop.ts` (line 254) spawns a detached sibling Node process via `restartGatewayProcessWithFreshPid()` and then exits PID 1 with code 0.

The orchestrator interprets exit 0 as a successful completion and does not restart the task. The detached child is unsupervised by the orchestrator, the service stays at `0/1` replicas, and only a manual redeploy recovers. The user reports five clean "Complete" task shutdowns over 19 hours.

Fix

Adopt suggested-fix option 2 from the issue: exit non-zero when intentionally restarting as PID 1 unsupervised, so `on-failure` policies relaunch the container.

Add a new respawn mode `orchestrator` that fires when all of:

  • `OPENCLAW_NO_RESPAWN` is unset, AND
  • no supervisor hint is present (`detectRespawnSupervisor()` returns null — no launchd/systemd/Windows scheduled-task markers), AND
  • `process.pid === 1` (we are the container entrypoint).

The run-loop honors this mode by exiting with code 75 (`EX_TEMPFAIL` — "transient failure, please restart"), which:

  • doesn't collide with `EX_USAGE` (64) or generic exit 1
  • is what BSD daemons use to signal restart intent
  • triggers `on-failure` and `OnFailure` restart policies

Other paths are unchanged:

  • Supervised (launchd KeepAlive, systemd unit, schtasks): same behavior, exit 0.
  • `OPENCLAW_NO_RESPAWN=1`: same behavior, in-process restart.
  • Non-PID-1 unsupervised processes (host-launched node CLI): same behavior, detached spawn + exit 0.

I picked option 2 over option 1 (in-place `execve`) because Node has no first-class `execve`-style API; the workarounds (`child_process` + signal forwarding, or pre-loading a wrapper that calls `execvp` via N-API) are substantially more invasive than this surgical mode addition.

Files

  • `src/infra/process-respawn.ts` — new `orchestrator` mode + PID-1 detection guard before the spawn-detached fallback.
  • `src/cli/gateway-cli/run-loop.ts` — handle `mode === "orchestrator"` by logging + `exitProcess(75)`.
  • `src/infra/process-respawn.test.ts` — 3 new cases: PID 1 unsupervised → orchestrator mode; PID 1 + supervisor hint → still supervised; PID 1 + `OPENCLAW_NO_RESPAWN` → still disabled. Adds a `setPid()` helper following the existing `setPlatform()` pattern.
  • `CHANGELOG.md`.

Verification

  • `pnpm vitest run src/infra/process-respawn.test.ts` → 20/20 passing (3 new + all existing).
  • `pnpm vitest run src/cli/gateway-cli/run-loop.test.ts` → 15/15 passing (no regression).

Operator-visible change

The startup banner will continue to show identical behavior on macOS / systemd / Windows installs. On Docker / Swarm / K8s deploys, the gateway logs:

``` [gateway] restart mode: orchestrator restart (pid-1 unsupervised: orchestrator should restart entrypoint) ```

…and exits 75. Existing `restart_policy.condition: on-failure` and `restartPolicy: OnFailure` configs will now relaunch the container. The README docs for `restart_policy.condition: any` workaround can be left in place but will no longer be load-bearing for the documented config-edit-triggers-restart flow.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/cli/gateway-cli/run-loop.ts (modified, +12/-0)
  • src/infra/process-respawn.test.ts (modified, +52/-0)
  • src/infra/process-respawn.ts (modified, +16/-1)

Code Example

[reload] config change requires gateway restart (...)
   [gateway] signal SIGUSR1 received
   [gateway] received SIGUSR1; restarting
   [gmail-watcher] gmail watcher stopped
   [ws] webchat disconnected code=1012 reason=service restart
   [gateway] restart mode: full process restart (spawned pid 4657)

---

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

---
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

When the gateway receives SIGUSR1 and performs a "full process restart" (triggered by a config change that requires restart), the entrypoint process (PID 1 inside the container) exits with code 0 after spawning the replacement child. In any orchestrator that treats exit 0 as a successful completion (Docker Swarm with restart_policy.condition: on-failure, Kubernetes Jobs, systemd Type=oneshot, etc.), the container/task is marked Complete and never restarts — the service is stuck at 0/1 replicas until a human manually redeploys.

The gateway logs the restart, then silence — there is no surfaced error because, from the OS's perspective, nothing went wrong.

Environment

  • Image: ghcr.io/openclaw/openclaw:latest
  • Webchat client: v2026.4.24
  • Host: Ubuntu, Docker 29.4.0, Docker Swarm mode (managed by Easypanel)
  • Service entrypoint (from docker service inspect): /bin/sh -c 'node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured'
  • Service restart policy: condition=on-failure, delay=5s, max-attempts=0

Steps to reproduce

  1. Deploy openclaw-gateway as a Docker Swarm service (or any orchestrator with an "on-failure"-style restart policy).
  2. From the Control UI, change a config field that the reload evaluator classifies as "requires gateway restart" (e.g. plugins.entries.ollama.config, plugins.entries.memory-core, commands).
  3. Save. Gateway logs:
    [reload] config change requires gateway restart (...)
    [gateway] signal SIGUSR1 received
    [gateway] received SIGUSR1; restarting
    [gmail-watcher] gmail watcher stopped
    [ws] webchat disconnected code=1012 reason=service restart
    [gateway] restart mode: full process restart (spawned pid 4657)
  4. The container exits. docker inspect confirms ExitCode: 0, OOMKilled: false, Error: "".
  5. docker service ls shows 0/1. The service never recovers on its own.

Expected behavior

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

Five clean "Complete" task shutdowns over 19 hours — every config edit that flagged requires gateway restart killed the service.

Root cause

The "full process restart" path appears to spawn a child Node process and then let PID 1 exit cleanly, instead of either:

  • replacing the running process in place via execve(2) (so PID 1 remains the same OS process, just with new code), or
  • exiting with a non-zero code so restart_policy: on-failure triggers a fresh task.

In a container, PID 1 is the supervisor as far as the orchestrator is concerned. Spawning a sibling and exiting PID 1 only works under a host-level supervisor (PM2, systemd) — never inside a container.

Suggested fixes (any one of these resolves it)

  1. Preferred: in-place re-exec. Replace the spawn-and-exit logic with execve of the same Node binary + argv. The container PID 1 keeps running; orchestrators see no exit at all. This is what tools like nginx (SIGHUP/-s reload) and many language runtimes do.
  2. Acceptable: exit non-zero on intentional restart. process.exit(75) (or any non-zero code) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container. Document that operators must configure their restart policy to allow it.
  3. Documentation-only mitigation: prominently document in the Docker/Swarm/Kubernetes deploy guide that restart_policy.condition must be any (Swarm) / restartPolicy: Always (K8s) / Restart=always (systemd). Currently the README does not mention this and the default Easypanel template uses on-failure, which is a footgun.

Option 1 is the most robust because it removes the dependency on operator configuration entirely.

Actual behavior

2026-04-28T02:19:18.477+00:00 Embedded agent failed before reply: LLM request failed: network connection error. 2026-04-28T02:19:20.344+00:00 Config overwrite: /home/node/.openclaw/openclaw.json (sha256 5aa15348e4a7d30a228869454a2c8248bb700af76b77aec724c48414de7fdee9 -> 75853247c606e7905a13e5b77eee818b5a1b8c403563bf129b55e19c4793875c, backup=/home/node/.openclaw/openclaw.json.bak) 2026-04-28T02:19:20.362+00:00 [ws] ⇄ res ✓ config.set 282ms conn=5f0963e0…801e id=31dc6dd6…ca1b 2026-04-28T02:19:20.364+00:00 [reload] config change detected; evaluating reload (agents.defaults.maxConcurrent, agents.defaults.subagents, plugins.entries.ollama.config, plugins.entries.memory-core, commands, messages) 2026-04-28T02:19:20.371+00:00 [reload] config change requires gateway restart (plugins.entries.ollama.config, plugins.entries.memory-core, commands) 2026-04-28T02:19:20.372+00:00 [gateway] signal SIGUSR1 received 2026-04-28T02:19:20.372+00:00 [gateway] received SIGUSR1; restarting 2026-04-28T02:19:20.387+00:00 [gmail-watcher] gmail watcher stopped 2026-04-28T02:19:20.401+00:00 [ws] webchat disconnected code=1012 reason=service restart conn=5f0963e0-ed8a-45b5-89fa-a9207d13801e 2026-04-28T02:19:20.499+00:00 [ws] ⇄ res ✓ config.get 109ms conn=5f0963e0…801e id=b22393f3…4914 2026-04-28T02:19:20.505+00:00 [gateway] restart mode: full process restart (spawned pid 4657)

OpenClaw version

ghcr.io/openclaw/openclaw:latest

Operating system

ubuntu

Install method

docker

Model

9router/sonnet-4.6

Provider / routing chain

openclaw

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

The gateway container exits with code 0 after a "full process restart" triggered by a config change, causing the service to never restart in orchestrators that treat exit 0 as a successful completion.

Guidance

  • The root cause is the "full process restart" path spawning a child Node process and then letting PID 1 exit cleanly, instead of replacing the running process in place via execve(2) or exiting with a non-zero code.
  • To fix this, consider replacing the spawn-and-exit logic with execve of the same Node binary + argv, so the container PID 1 keeps running and orchestrators see no exit at all.
  • Alternatively, exit with a non-zero code (e.g., process.exit(75)) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container.
  • Ensure that the restart policy is configured to allow the container to restart after a non-zero exit code.

Example

// Replace spawn-and-exit logic with execve
const childProcess = require('child_process');
childProcess.execFile('node', ['dist/index.js', 'gateway', '--bind', 'lan', '--port', '18789', '--allow-unconfigured']);

Notes

  • The suggested fixes assume that the openclaw image is using a Node.js runtime.
  • The execve approach is preferred as it removes the dependency on operator configuration entirely.

Recommendation

Apply the workaround by replacing the spawn-and-exit logic with execve of the same Node binary + argv, as it is the most robust solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

Five clean "Complete" task shutdowns over 19 hours — every config edit that flagged requires gateway restart killed the service.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Gateway "full process restart" exits PID 1 with code 0 → Docker Swarm task stays Complete and never restarts (service stuck at 0/1) [1 pull requests, 1 comments, 2 participants]