openclaw - ✅(Solved) Fix [Bug]: Dockerfile HEALTHCHECK --start-period=15s flips containers to (unhealthy) during legitimate cold start [2 pull requests, 1 comments, 2 participants]

Q: Expected behavior

`HEALTHCHECK --start-period` should reflect the realistic worst-case time-to-first-bind. Even with #74948's lock-staleness fix and a warm cache, the first-ever boot on a fresh host has to run the staging script once. Recommend `--start-period=600s` to cover both the warm-cache cold start (~10 s) and the worst-case fresh-host first-ever start (~5 min, see #73339 / #75069 / #73055). `--interval=3m --timeout=10s --retries=3` are reasonable as-is.

openclaw2026-05-01 14:58:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#75701•Fetched 2026-05-02 05:31:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

inxaos

Participants

clawsweeper[bot]

inxaos

Timeline (top)

mentioned ×6subscribed ×6cross-referenced ×2commented ×1

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

15 seconds is shorter than even a hot-restart on a non-WSL Docker host, and far shorter than the cold-start envelope documented in #73339 / #75069 / #73055 (~3+ minutes from [gateway] starting... to http server listening, dominated by bundled runtime mirror work even after #74948's lock fix lands). On a true cold start (no persistent volume or freshly seeded one) the bundled-runtime-deps install itself adds another ~30 seconds on top.

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

Docker Compose with depends_on: condition: service_healthy
Docker Swarm
podman / podman-compose
Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Root Cause

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

Docker Compose with depends_on: condition: service_healthy
Docker Swarm
podman / podman-compose
Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Code Example

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

---

docker run --rm --name oc-test \
  -p 18789:18789 \
  -v /tmp/oc-test-state:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:2026.4.29 \
  node openclaw.mjs gateway --bind lan

---

# 15s in:
docker ps --format '{{.Status}}' | grep oc-test
# Up 16 seconds (unhealthy)

---

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash) — Dockerfile config

Beta release blocker

Summary

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

Docker Compose with depends_on: condition: service_healthy
Docker Swarm
podman / podman-compose
Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Steps to reproduce

docker run --rm --name oc-test \
  -p 18789:18789 \
  -v /tmp/oc-test-state:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:2026.4.29 \
  node openclaw.mjs gateway --bind lan

In a separate terminal:

# 15s in:
docker ps --format '{{.Status}}' | grep oc-test
# Up 16 seconds (unhealthy)

The container is mid-staging at this point and will eventually become healthy (assuming a warm cache it's ~10-15s; on a truly cold start it can be several minutes per #73339). But anything that gates on the OCI health state has already failed.

Expected behavior

HEALTHCHECK --start-period should reflect the realistic worst-case time-to-first-bind. Even with #74948's lock-staleness fix and a warm cache, the first-ever boot on a fresh host has to run the staging script once. Recommend --start-period=600s to cover both the warm-cache cold start (~10 s) and the worst-case fresh-host first-ever start (~5 min, see #73339 / #75069 / #73055).

--interval=3m --timeout=10s --retries=3 are reasonable as-is.

Actual behavior

Container flips to (unhealthy) ~15 seconds after docker run, while the gateway is still legitimately progressing through [plugins] staging bundled runtime deps before gateway startup → [plugins] installed bundled runtime deps → [gateway] starting HTTP server... → http server listening.

Environment

OpenClaw image: ghcr.io/openclaw/openclaw:2026.4.29 (current latest), and the same value on main and back through several prior tags
Affects: any orchestrator that consumes the OCI HEALTHCHECK state — Docker Compose, Swarm, podman, systemd units gating on docker inspect health, k8s clusters whose operators / Helm charts mirror the OCI probe
Doesn't affect: orchestrators that ignore the image-level healthcheck and define their own (e.g., k8s livenessProbe/readinessProbe/startupProbe set explicitly by the user)

Suggested fix

Trivial — Dockerfile:

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

If 600 s feels too long, 300 s would still be a meaningful improvement and would cover the warm-cache cold-start case comfortably.

Cross-references

#73339 — Gateway cold start dominated by bundled runtime mirror work (closed)
#75069 — Bundled plugin runtime mirror runs synchronously per pi-agent invocation (closed)
#73055 — Gateway blocks HTTP server startup while reinstalling 63 bundled runtime deps on every restart (closed)
#74948 — Legacy runtime-deps lock without starttime/createdAtMs not expired (closed; the fix removes one large class of "never becomes healthy" cases but doesn't shorten the legitimate cold start)
#75032 — Feature request for installBundledRuntimeDeps: false to skip staging entirely (open; complementary)
#62850 — Docker HEALTHCHECK uses node -e fetch which fails intermittently in 4.5+ (closed; same directive, different concern)

Repro evidence

The value --start-period=15s is verified directly against the upstream Dockerfile (current main and tag v2026.4.29) and reproduces with the docker run command in Steps to reproduce. v2026.4.29 carries the same directive value as v2026.4.26 and earlier.

extent analysis

TL;DR

Update the HEALTHCHECK directive in the Dockerfile to increase the --start-period to a value that reflects the realistic worst-case time-to-first-bind, such as 600s.

Guidance

Review the Dockerfile and update the HEALTHCHECK directive to increase the --start-period value.
Consider the worst-case time-to-first-bind for the container, including the time it takes to run the staging script and start the HTTP server.
Test the updated HEALTHCHECK directive to ensure it resolves the issue with false-positive unhealthy flags during legitimate cold starts.
If the updated value feels too long, consider a shorter value, such as 300s, which would still be an improvement over the current 15s value.

Example

The suggested fix is to update the Dockerfile with the following change:

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

Notes

The issue affects any orchestrator that consumes the OCI HEALTHCHECK state, including Docker Compose, Swarm, podman, and Kubernetes clusters. The suggested fix is a trivial update to the Dockerfile and should resolve the issue with false-positive unhealthy flags during legitimate cold starts.

Recommendation

Apply the suggested fix to update the HEALTHCHECK directive in the Dockerfile to increase the --start-period value to 600s, as this reflects the realistic worst-case time-to-first

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

--interval=3m --timeout=10s --retries=3 are reasonable as-is.

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Dockerfile HEALTHCHECK --start-period=15s flips containers to (unhealthy) during legitimate cold start [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #75766: fix(docker): bump HEALTHCHECK start-period from 15s to 120s (AI-assisted)

Description (problem / solution / changelog)

Summary

Change Type

Scope

Linked Issue/PR

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Security Impact

AI-Assisted

Human Verification

Compatibility / Migration

Changed files

PR #75809: fix(docker): raise HEALTHCHECK start-period to 600s

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Security Impact

Repro + Verification

Environment

Steps

Expected

Actual (before)

Evidence

Human Verification

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

Environment

Suggested fix

Cross-references

Repro evidence

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING