openclaw - ✅(Solved) Fix [Bug]: Gateway "full process restart" exits PID 1 with code 0 → Docker Swarm task stays Complete and never restarts (service stuck at 0/1) [1 pull requests, 1 comments, 2 participants]

Q: Expected behavior

``` $ docker service ls | grep openclaw c4fe3nts1w5x ai_openclaw-gateway replicated 0/1 ghcr.io/openclaw/openclaw:latest $ docker inspect --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}' 0 false $ docker service ps ai_openclaw-gateway --no-trunc ... ai_openclaw-gateway.1 Shutdown Complete 5 minutes ago ... \_ ai_openclaw-gateway.1 Shutdown Complete 16 hours ago ... \_ ai_openclaw-gateway.1 Shutdown Complete 16 hours ago ... \_ ai_openclaw-gateway.1 Shutdown Complete 17 hours ago ... \_ ai_openclaw-gateway.1 Shutdown Complete 19 hours ago ``` Five clean "Complete" task shutdowns over 19 hours — every config edit that flagged `requires gateway restart` killed the service.

du-nguyen-IT007 · 2026-04-28T02:36:01Z

[openclaw] When the gateway receives SIGUSR1 and performs a "full process restart" triggered by a config change that requires restart , the entrypoint process… When the gateway receives `SIGUSR1` and performs a "full process restart" (triggered by a config change that requires restart), the entrypoint process (PID 1 inside the container) **exits with code 0** after spawning the replacement child. In any orchestrator that treats exit 0 as a successful completion (Docker Swarm with `restart_policy.condition: on-failure`, Kubernetes Jobs, systemd `Type=oneshot`, etc.), the container/task is marked `Complete` and **never restarts** — the service is stuck at `0/1` replicas until a human manually redeploys. The gateway logs the restart, then silence — there is no surfaced error because, from the OS's perspective, nothing went wrong. # PR #73221: fix(gateway): exit non-zero when restarting as PID 1 in container (#73178) - Repository: openclaw/openclaw - Author: hclsys - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/73221 ## Description (problem / solution / changelog) Fixes #73178. ## Root cause When the gateway runs as **PID 1 inside a container with no detected supervisor** (Docker / Docker Swarm with \`restart_policy.condition: on-failure\`, Kubernetes with \`restartPolicy: OnFailure\`, etc.), the \"full process restart\" path in \`src/cli/gateway-cli/run-loop.ts\` (line 254) spawns a detached sibling Node process via \`restartGatewayProcessWithFreshPid()\` and then exits PID 1 with code 0. The orchestrator interprets exit 0 as a successful completion and **does not restart the task**. The detached child is unsupervised by the orchestrator, the service stays at \`0/1\` replicas, and only a manual redeploy recovers. The user reports five clean \"Complete\" task shutdowns over 19 hours. ## Fix Adopt suggested-fix option 2 from the issue: exit non-zero when intentionally restarting as PID 1 unsupervised, so \`on-failure\` policies relaunch the container. Add a new respawn mode \`orchestrator\` that fires when **all** of: - \`OPENCLAW_NO_RESPAWN\` is unset, AND - no supervisor hint is present (\`detectRespawnSupervisor()\` returns null — no launchd/systemd/Windows scheduled-task markers), AND - \`process.pid === 1\` (we are the container entrypoint). The run-loop honors this mode by exiting with code **75** (\`EX_TEMPFAIL\` — \"transient failure, please restart\"), which: - doesn't collide with \`EX_USAGE\` (64) or generic exit 1 - is what BSD daemons use to signal restart intent - triggers \`on-failure\` and \`OnFailure\` restart policies Other paths are unchanged: - Supervised (launchd KeepAlive, systemd unit, schtasks): same behavior, exit 0. - \`OPENCLAW_NO_RESPAWN=1\`: same behavior, in-process restart. - Non-PID-1 unsupervised processes (host-launched node CLI): same behavior, detached spawn + exit 0. I picked option 2 over option 1 (in-place \`execve\`) because Node has no first-class \`execve\`-style API; the workarounds (\`child_process\` + signal forwarding, or pre-loading a wrapper that calls \`execvp\` via N-API) are substantially more invasive than this surgical mode addition. ## Files - \`src/infra/process-respawn.ts\` — new \`orchestrator\` mode + PID-1 detection guard before the spawn-detached fallback. - \`src/cli/gateway-cli/run-loop.ts\` — handle \`mode === \"orchestrator\"\` by logging + \`exitProcess(75)\`. - \`src/infra/process-respawn.test.ts\` — 3 new cases: PID 1 unsupervised → orchestrator mode; PID 1 + supervisor hint → still supervised; PID 1 + \`OPENCLAW_NO_RESPAWN\` → still disabled. Adds a \`setPid()\` helper following the existing \`setPlatform()\` pattern. - \`CHANGELOG.md\`. ## Verification - \`pnpm vitest run src/infra/process-respawn.test.ts\` → 20/20 passing (3 new + all existing). - \`pnpm vitest run src/cli/gateway-cli/run-loop.test.ts\` → 15/15 passing (no regression). ## Operator-visible change The startup banner will continue to show identical behavior on macOS / systemd / Windows installs. On Docker / Swarm / K8s deploys, the gateway logs: \`\`\` [gateway] restart mode: orchestrator restart (pid-1 unsupervised: orchestrator should restart entrypoint) \`\`\` …and exits 75. Existing \`restart_policy.condition: on-failure\` and \`restartPolicy: OnFailure\` configs will now relaunch the container. The README docs for \`restart_policy.condition: any\` workaround can be left in place but will no longer be load-bearing for the documented config-edit-triggers-restart flow. ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/cli/gateway-cli/run-loop.ts` (modified, +12/-0) - `src/infra/process-respawn.test.ts` (modified, +52/-0) - `src/infra/process-respawn.ts` (modified, +16/-1) ## Fix / Workaround 1. **Preferred: in-place re-exec.** Replace the spawn-and-exit logic with `execve` of the same Node binary + argv. The container PID 1 keeps running; orchestrators see no exit at all. This is what tools like nginx (`SIGHUP`/`-s reload`) and many l

openclaw2026-04-28 02:36:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73178•Fetched 2026-04-29 06:22:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

du-nguyen-IT007

Participants

du-nguyen-IT007

steipete

Timeline (top)

labeled ×2closed ×1commented ×1cross-referenced ×1

When the gateway receives SIGUSR1 and performs a "full process restart" (triggered by a config change that requires restart), the entrypoint process (PID 1 inside the container) exits with code 0 after spawning the replacement child. In any orchestrator that treats exit 0 as a successful completion (Docker Swarm with restart_policy.condition: on-failure, Kubernetes Jobs, systemd Type=oneshot, etc.), the container/task is marked Complete and never restarts — the service is stuck at 0/1 replicas until a human manually redeploys.

The gateway logs the restart, then silence — there is no surfaced error because, from the OS's perspective, nothing went wrong.

Error Message

The gateway logs the restart, then silence — there is no surfaced error OOMKilled: false, Error: "". $ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}' 2026-04-28T02:19:18.477+00:00 Embedded agent failed before reply: LLM request failed: network connection error.

Root Cause

The "full process restart" path appears to spawn a child Node process and then let PID 1 exit cleanly, instead of either:

replacing the running process in place via execve(2) (so PID 1 remains the same OS process, just with new code), or
exiting with a non-zero code so restart_policy: on-failure triggers a fresh task.

In a container, PID 1 is the supervisor as far as the orchestrator is concerned. Spawning a sibling and exiting PID 1 only works under a host-level supervisor (PM2, systemd) — never inside a container.

Fix Action

Fix / Workaround

Preferred: in-place re-exec. Replace the spawn-and-exit logic with execve of the same Node binary + argv. The container PID 1 keeps running; orchestrators see no exit at all. This is what tools like nginx (SIGHUP/-s reload) and many language runtimes do.
Acceptable: exit non-zero on intentional restart. process.exit(75) (or any non-zero code) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container. Document that operators must configure their restart policy to allow it.
Documentation-only mitigation: prominently document in the Docker/Swarm/Kubernetes deploy guide that restart_policy.condition must be any (Swarm) / restartPolicy: Always (K8s) / Restart=always (systemd). Currently the README does not mention this and the default Easypanel template uses on-failure, which is a footgun.

PR fix notes

PR #73221: fix(gateway): exit non-zero when restarting as PID 1 in container (#73178)

Repository: openclaw/openclaw
Author: hclsys
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/73221

Description (problem / solution / changelog)

Fixes #73178.

Root cause

When the gateway runs as PID 1 inside a container with no detected supervisor (Docker / Docker Swarm with `restart_policy.condition: on-failure`, Kubernetes with `restartPolicy: OnFailure`, etc.), the "full process restart" path in `src/cli/gateway-cli/run-loop.ts` (line 254) spawns a detached sibling Node process via `restartGatewayProcessWithFreshPid()` and then exits PID 1 with code 0.

The orchestrator interprets exit 0 as a successful completion and does not restart the task. The detached child is unsupervised by the orchestrator, the service stays at `0/1` replicas, and only a manual redeploy recovers. The user reports five clean "Complete" task shutdowns over 19 hours.

Fix

Adopt suggested-fix option 2 from the issue: exit non-zero when intentionally restarting as PID 1 unsupervised, so `on-failure` policies relaunch the container.

Add a new respawn mode `orchestrator` that fires when all of:

`OPENCLAW_NO_RESPAWN` is unset, AND
no supervisor hint is present (`detectRespawnSupervisor()` returns null — no launchd/systemd/Windows scheduled-task markers), AND
`process.pid === 1` (we are the container entrypoint).

The run-loop honors this mode by exiting with code 75 (`EX_TEMPFAIL` — "transient failure, please restart"), which:

doesn't collide with `EX_USAGE` (64) or generic exit 1
is what BSD daemons use to signal restart intent
triggers `on-failure` and `OnFailure` restart policies

Other paths are unchanged:

Supervised (launchd KeepAlive, systemd unit, schtasks): same behavior, exit 0.
`OPENCLAW_NO_RESPAWN=1`: same behavior, in-process restart.
Non-PID-1 unsupervised processes (host-launched node CLI): same behavior, detached spawn + exit 0.

I picked option 2 over option 1 (in-place `execve`) because Node has no first-class `execve`-style API; the workarounds (`child_process` + signal forwarding, or pre-loading a wrapper that calls `execvp` via N-API) are substantially more invasive than this surgical mode addition.

Files

`src/infra/process-respawn.ts` — new `orchestrator` mode + PID-1 detection guard before the spawn-detached fallback.
`src/cli/gateway-cli/run-loop.ts` — handle `mode === "orchestrator"` by logging + `exitProcess(75)`.
`src/infra/process-respawn.test.ts` — 3 new cases: PID 1 unsupervised → orchestrator mode; PID 1 + supervisor hint → still supervised; PID 1 + `OPENCLAW_NO_RESPAWN` → still disabled. Adds a `setPid()` helper following the existing `setPlatform()` pattern.
`CHANGELOG.md`.

Verification

`pnpm vitest run src/infra/process-respawn.test.ts` → 20/20 passing (3 new + all existing).
`pnpm vitest run src/cli/gateway-cli/run-loop.test.ts` → 15/15 passing (no regression).

Operator-visible change

The startup banner will continue to show identical behavior on macOS / systemd / Windows installs. On Docker / Swarm / K8s deploys, the gateway logs:

``` [gateway] restart mode: orchestrator restart (pid-1 unsupervised: orchestrator should restart entrypoint) ```

…and exits 75. Existing `restart_policy.condition: on-failure` and `restartPolicy: OnFailure` configs will now relaunch the container. The README docs for `restart_policy.condition: any` workaround can be left in place but will no longer be load-bearing for the documented config-edit-triggers-restart flow.

Changed files

CHANGELOG.md (modified, +1/-0)
src/cli/gateway-cli/run-loop.ts (modified, +12/-0)
src/infra/process-respawn.test.ts (modified, +52/-0)
src/infra/process-respawn.ts (modified, +16/-1)

Code Example

[reload] config change requires gateway restart (...)
   [gateway] signal SIGUSR1 received
   [gateway] received SIGUSR1; restarting
   [gmail-watcher] gmail watcher stopped
   [ws] webchat disconnected code=1012 reason=service restart
   [gateway] restart mode: full process restart (spawned pid 4657)

---

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

---

RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

The gateway logs the restart, then silence — there is no surfaced error because, from the OS's perspective, nothing went wrong.

Environment

Image: ghcr.io/openclaw/openclaw:latest
Webchat client: v2026.4.24
Host: Ubuntu, Docker 29.4.0, Docker Swarm mode (managed by Easypanel)
Service entrypoint (from docker service inspect): /bin/sh -c 'node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured'
Service restart policy: condition=on-failure, delay=5s, max-attempts=0

Steps to reproduce

Deploy openclaw-gateway as a Docker Swarm service (or any orchestrator with an "on-failure"-style restart policy).
From the Control UI, change a config field that the reload evaluator classifies as "requires gateway restart" (e.g. plugins.entries.ollama.config, plugins.entries.memory-core, commands).

Save. Gateway logs:

[reload] config change requires gateway restart (...)
[gateway] signal SIGUSR1 received
[gateway] received SIGUSR1; restarting
[gmail-watcher] gmail watcher stopped
[ws] webchat disconnected code=1012 reason=service restart
[gateway] restart mode: full process restart (spawned pid 4657)

The container exits. docker inspect confirms ExitCode: 0, OOMKilled: false, Error: "".
docker service ls shows 0/1. The service never recovers on its own.

Expected behavior

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

Five clean "Complete" task shutdowns over 19 hours — every config edit that flagged requires gateway restart killed the service.

Root cause

The "full process restart" path appears to spawn a child Node process and then let PID 1 exit cleanly, instead of either:

replacing the running process in place via execve(2) (so PID 1 remains the same OS process, just with new code), or
exiting with a non-zero code so restart_policy: on-failure triggers a fresh task.

Suggested fixes (any one of these resolves it)

Preferred: in-place re-exec. Replace the spawn-and-exit logic with execve of the same Node binary + argv. The container PID 1 keeps running; orchestrators see no exit at all. This is what tools like nginx (SIGHUP/-s reload) and many language runtimes do.
Acceptable: exit non-zero on intentional restart. process.exit(75) (or any non-zero code) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container. Document that operators must configure their restart policy to allow it.
Documentation-only mitigation: prominently document in the Docker/Swarm/Kubernetes deploy guide that restart_policy.condition must be any (Swarm) / restartPolicy: Always (K8s) / Restart=always (systemd). Currently the README does not mention this and the default Easypanel template uses on-failure, which is a footgun.

Option 1 is the most robust because it removes the dependency on operator configuration entirely.

Actual behavior

2026-04-28T02:19:18.477+00:00 Embedded agent failed before reply: LLM request failed: network connection error. 2026-04-28T02:19:20.344+00:00 Config overwrite: /home/node/.openclaw/openclaw.json (sha256 5aa15348e4a7d30a228869454a2c8248bb700af76b77aec724c48414de7fdee9 -> 75853247c606e7905a13e5b77eee818b5a1b8c403563bf129b55e19c4793875c, backup=/home/node/.openclaw/openclaw.json.bak) 2026-04-28T02:19:20.362+00:00 [ws] ⇄ res ✓ config.set 282ms conn=5f0963e0…801e id=31dc6dd6…ca1b 2026-04-28T02:19:20.364+00:00 [reload] config change detected; evaluating reload (agents.defaults.maxConcurrent, agents.defaults.subagents, plugins.entries.ollama.config, plugins.entries.memory-core, commands, messages) 2026-04-28T02:19:20.371+00:00 [reload] config change requires gateway restart (plugins.entries.ollama.config, plugins.entries.memory-core, commands) 2026-04-28T02:19:20.372+00:00 [gateway] signal SIGUSR1 received 2026-04-28T02:19:20.372+00:00 [gateway] received SIGUSR1; restarting 2026-04-28T02:19:20.387+00:00 [gmail-watcher] gmail watcher stopped 2026-04-28T02:19:20.401+00:00 [ws] webchat disconnected code=1012 reason=service restart conn=5f0963e0-ed8a-45b5-89fa-a9207d13801e 2026-04-28T02:19:20.499+00:00 [ws] ⇄ res ✓ config.get 109ms conn=5f0963e0…801e id=b22393f3…4914 2026-04-28T02:19:20.505+00:00 [gateway] restart mode: full process restart (spawned pid 4657)

OpenClaw version

ghcr.io/openclaw/openclaw:latest

Operating system

ubuntu

Install method

docker

Model

9router/sonnet-4.6

Provider / routing chain

openclaw

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

The gateway container exits with code 0 after a "full process restart" triggered by a config change, causing the service to never restart in orchestrators that treat exit 0 as a successful completion.

Guidance

The root cause is the "full process restart" path spawning a child Node process and then letting PID 1 exit cleanly, instead of replacing the running process in place via execve(2) or exiting with a non-zero code.
To fix this, consider replacing the spawn-and-exit logic with execve of the same Node binary + argv, so the container PID 1 keeps running and orchestrators see no exit at all.
Alternatively, exit with a non-zero code (e.g., process.exit(75)) so restart_policy: on-failure and Kubernetes restartPolicy: Always/OnFailure will respawn the container.
Ensure that the restart policy is configured to allow the container to restart after a non-zero exit code.

Example

// Replace spawn-and-exit logic with execve
const childProcess = require('child_process');
childProcess.execFile('node', ['dist/index.js', 'gateway', '--bind', 'lan', '--port', '18789', '--allow-unconfigured']);

Notes

The suggested fixes assume that the openclaw image is using a Node.js runtime.
The execve approach is preferred as it removes the dependency on operator configuration entirely.

Recommendation

Apply the workaround by replacing the spawn-and-exit logic with execve of the same Node binary + argv, as it is the most robust solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

$ docker service ls | grep openclaw
c4fe3nts1w5x   ai_openclaw-gateway   replicated   0/1   ghcr.io/openclaw/openclaw:latest

$ docker inspect <container-id> --format '{{.State.ExitCode}} {{.State.OOMKilled}} {{.State.Error}}'
0 false

$ docker service ps ai_openclaw-gateway --no-trunc
... ai_openclaw-gateway.1   Shutdown   Complete 5 minutes ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 16 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 17 hours ago
... \_ ai_openclaw-gateway.1 Shutdown   Complete 19 hours ago

Five clean "Complete" task shutdowns over 19 hours — every config edit that flagged requires gateway restart killed the service.

#environment variable #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Gateway "full process restart" exits PID 1 with code 0 → Docker Swarm task stays Complete and never restarts (service stuck at 0/1) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #73221: fix(gateway): exit non-zero when restarting as PID 1 in container (#73178)

Description (problem / solution / changelog)

Root cause

Fix

Files

Verification

Operator-visible change

Changed files

Code Example

Bug type

Beta release blocker

Summary

Environment

Steps to reproduce

Expected behavior

Root cause

Suggested fixes (any one of these resolves it)

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING