openclaw - ✅(Solved) Fix [Bug]: Dockerfile HEALTHCHECK --start-period=15s flips containers to (unhealthy) during legitimate cold start [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75701Fetched 2026-05-02 05:31:26
View on GitHub
Comments
1
Participants
2
Timeline
15
Reactions
2
Author
Timeline (top)
mentioned ×6subscribed ×6cross-referenced ×2commented ×1

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

15 seconds is shorter than even a hot-restart on a non-WSL Docker host, and far shorter than the cold-start envelope documented in #73339 / #75069 / #73055 (~3+ minutes from [gateway] starting... to http server listening, dominated by bundled runtime mirror work even after #74948's lock fix lands). On a true cold start (no persistent volume or freshly seeded one) the bundled-runtime-deps install itself adds another ~30 seconds on top.

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

  • Docker Compose with depends_on: condition: service_healthy
  • Docker Swarm
  • podman / podman-compose
  • Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
  • systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Root Cause

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

15 seconds is shorter than even a hot-restart on a non-WSL Docker host, and far shorter than the cold-start envelope documented in #73339 / #75069 / #73055 (~3+ minutes from [gateway] starting... to http server listening, dominated by bundled runtime mirror work even after #74948's lock fix lands). On a true cold start (no persistent volume or freshly seeded one) the bundled-runtime-deps install itself adds another ~30 seconds on top.

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

  • Docker Compose with depends_on: condition: service_healthy
  • Docker Swarm
  • podman / podman-compose
  • Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
  • systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Fix Action

Fixed

PR fix notes

PR #75766: fix(docker): bump HEALTHCHECK start-period from 15s to 120s (AI-assisted)

Description (problem / solution / changelog)

Summary

  • Problem: Docker HEALTHCHECK --start-period=15s is too aggressive — cold starts (npm install, model downloads, bootstrapping) can take 30-60 seconds on resource-constrained/ARM machines, causing premature unhealthy status and unnecessary restarts.
  • Why it matters: Containers get killed/restarted before they finish booting, causing flaky deployments.
  • What changed: Bumped --start-period from 15s to 120s in the root Dockerfile.
  • What did NOT change: Interval (3m), timeout (10s), retries (3), and the health check command itself are untouched. Only the start grace period.

Change Type

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #75701

Root Cause

  • Root cause: --start-period=15s was chosen for fast CI/cloud environments but does not account for cold starts on resource-constrained hardware (Raspberry Pi, ARM Mac, low-RAM VPS).
  • Missing detection: No CI coverage for container health check timing on ARM/slow hardware.
  • Contributing context: OpenClaw's startup involves pnpm dependencies, plugin loading, and model resolution — all of which compound on first runs.

Regression Test Plan

N/A — this is a one-line config value change. The existing Docker health check logic remains unchanged; only the grace period is widened. No new test needed.

User-visible / Behavior Changes

None. Users only see the difference if they were previously hitting unhealthy state during cold starts — now they won't.

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

AI-Assisted

  • AI-assisted PR (Claude via OpenClaw)
  • Testing: light — verified Dockerfile syntax, confirmed only one HEALTHCHECK line exists
  • I understand what the code does
  • One-line change with clear rationale

Human Verification

  • Verified: Single HEALTHCHECK line in root Dockerfile; no other Dockerfiles have HEALTHCHECK; grep confirmed only one instance.
  • Edge cases: Considered 60s, 90s, 120s — went with 120s to be generous without being wasteful.
  • Not verified: Did not build the Docker image end-to-end (requires full container build).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Changed files

  • Dockerfile (modified, +1/-1)

PR #75809: fix(docker): raise HEALTHCHECK start-period to 600s

Description (problem / solution / changelog)

Summary

  • Problem: Dockerfile HEALTHCHECK --start-period=15s flips containers to (unhealthy) ~15s after docker run while bundled runtime deps stage and the Gateway is still progressing through [plugins] staging bundled runtime deps before gateway startup[plugins] installed bundled runtime deps[gateway] starting HTTP serverhttp server listening.
  • Why it matters: Anyone consuming the OCI healthcheck (Compose depends_on: service_healthy, Swarm, podman, k8s probes that mirror the OCI healthcheck via the CRI, systemd units gating on docker inspect health) sees false-positive unhealthy on every legitimate cold start.
  • What changed: --start-period=15s--start-period=600s. Other healthcheck knobs (--interval=3m --timeout=10s --retries=3) are unchanged.
  • What did NOT change: the probe command, the bind defaults, and any non-Docker health check paths.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #75701
  • Related #73339, #75069, #73055, #74948 (cold-start envelope context)
  • This PR fixes a bug or regression

Root Cause

  • Root cause: --start-period was set to a value (15s) shorter than even a hot-restart, much shorter than the documented cold-start envelope dominated by bundled runtime mirror work, and shorter than the additional ~30s a fresh-host first-ever bundled-runtime-deps install adds on top.
  • Missing detection / guardrail: no test pins the start-period to the documented envelope; the value drifted without anyone catching that orchestrators consuming OCI health flip on every cold start.
  • Contributing context: same directive value carried forward across releases; downstream operators worked around it with bespoke healthcheck overrides per #62850.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Why this is the smallest reliable guardrail: this is a single Dockerfile literal whose appropriate value is determined by documented cold-start envelope (#73339 / #75069 / #73055). A unit/seam/E2E test would either tautologically read back the literal or run a real Docker cold start, neither of which adds durable signal beyond the linked observability issues.
  • If no new test is added, why not: see above.

User-visible / Behavior Changes

  • During the first 600s after docker run, the OCI health state stays starting instead of flipping to (unhealthy). After the start-period, behavior is unchanged (--interval=3m --timeout=10s --retries=3 continue to apply).

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Ubuntu 24.04 / macOS / Windows Docker Desktop (any host running the openclaw image)
  • Runtime/container: ghcr.io/openclaw/openclaw:2026.4.29 (and earlier; same value still on main before this PR)

Steps

  1. docker run --rm --name oc-test -p 18789:18789 -v /tmp/oc-test-state:/home/node/.openclaw ghcr.io/openclaw/openclaw:2026.4.29 node openclaw.mjs gateway --bind lan
  2. After ~15s, run docker ps --format '{{.Status}}' | grep oc-test.

Expected

  • Container reports Up <n> seconds (health: starting) until the gateway has actually bound.

Actual (before)

  • Container reports Up 16 seconds (unhealthy) while the gateway is still legitimately staging runtime deps.

Evidence

  • Trace/log snippets — see issue #75701 for the staging → bind log sequence and the (unhealthy) flip timestamp.

Human Verification

  • Verified scenarios: Dockerfile literal change matches the suggested fix in #75701; no other Dockerfile or CI scripts depend on the previous 15s value (grep -rn 'start-period' --include='*.ts' --include='*.mjs' --include='Dockerfile*' returns only this directive).
  • Edge cases checked: existing reasoning around --interval=3m --timeout=10s --retries=3 is preserved; the change is limited to --start-period.
  • What I did not verify: I did not run a fresh-host cold start end-to-end; the appropriate envelope is documented in #73339 / #75069 / #73055.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: a genuinely broken container that never becomes healthy now stays in health: starting for 10 minutes instead of unhealthy after 15s.
    • Mitigation: --interval=3m --timeout=10s --retries=3 is unchanged, so once the start period elapses the container still flips to unhealthy on the same schedule as before; orchestrators that need a tighter probe can override the image-level healthcheck with their own.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • Dockerfile (modified, +1/-1)

Code Example

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

---

docker run --rm --name oc-test \
  -p 18789:18789 \
  -v /tmp/oc-test-state:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:2026.4.29 \
  node openclaw.mjs gateway --bind lan

---

# 15s in:
docker ps --format '{{.Status}}' | grep oc-test
# Up 16 seconds (unhealthy)

---

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash) — Dockerfile config

Beta release blocker

No

Summary

The image's built-in HEALTHCHECK directive in Dockerfile uses --start-period=15s. Verified still present on main and at tag v2026.4.29:

HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
  CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

15 seconds is shorter than even a hot-restart on a non-WSL Docker host, and far shorter than the cold-start envelope documented in #73339 / #75069 / #73055 (~3+ minutes from [gateway] starting... to http server listening, dominated by bundled runtime mirror work even after #74948's lock fix lands). On a true cold start (no persistent volume or freshly seeded one) the bundled-runtime-deps install itself adds another ~30 seconds on top.

Anyone running OpenClaw under an orchestrator that consumes the OCI healthcheck will see false-positive unhealthy flags during every legitimate cold start:

  • Docker Compose with depends_on: condition: service_healthy
  • Docker Swarm
  • podman / podman-compose
  • Kubernetes pods that derive livenessProbe/readinessProbe/startupProbe from the OCI healthcheck via the CRI (some operators / Helm charts default to this)
  • systemd units gated on docker inspect health state (e.g., systemd-runs-docker patterns)

Steps to reproduce

docker run --rm --name oc-test \
  -p 18789:18789 \
  -v /tmp/oc-test-state:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:2026.4.29 \
  node openclaw.mjs gateway --bind lan

In a separate terminal:

# 15s in:
docker ps --format '{{.Status}}' | grep oc-test
# Up 16 seconds (unhealthy)

The container is mid-staging at this point and will eventually become healthy (assuming a warm cache it's ~10-15s; on a truly cold start it can be several minutes per #73339). But anything that gates on the OCI health state has already failed.

Expected behavior

HEALTHCHECK --start-period should reflect the realistic worst-case time-to-first-bind. Even with #74948's lock-staleness fix and a warm cache, the first-ever boot on a fresh host has to run the staging script once. Recommend --start-period=600s to cover both the warm-cache cold start (~10 s) and the worst-case fresh-host first-ever start (~5 min, see #73339 / #75069 / #73055).

--interval=3m --timeout=10s --retries=3 are reasonable as-is.

Actual behavior

Container flips to (unhealthy) ~15 seconds after docker run, while the gateway is still legitimately progressing through [plugins] staging bundled runtime deps before gateway startup[plugins] installed bundled runtime deps[gateway] starting HTTP server...http server listening.

Environment

  • OpenClaw image: ghcr.io/openclaw/openclaw:2026.4.29 (current latest), and the same value on main and back through several prior tags
  • Affects: any orchestrator that consumes the OCI HEALTHCHECK state — Docker Compose, Swarm, podman, systemd units gating on docker inspect health, k8s clusters whose operators / Helm charts mirror the OCI probe
  • Doesn't affect: orchestrators that ignore the image-level healthcheck and define their own (e.g., k8s livenessProbe/readinessProbe/startupProbe set explicitly by the user)

Suggested fix

Trivial — Dockerfile:

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

If 600 s feels too long, 300 s would still be a meaningful improvement and would cover the warm-cache cold-start case comfortably.

Cross-references

  • #73339 — Gateway cold start dominated by bundled runtime mirror work (closed)
  • #75069 — Bundled plugin runtime mirror runs synchronously per pi-agent invocation (closed)
  • #73055 — Gateway blocks HTTP server startup while reinstalling 63 bundled runtime deps on every restart (closed)
  • #74948 — Legacy runtime-deps lock without starttime/createdAtMs not expired (closed; the fix removes one large class of "never becomes healthy" cases but doesn't shorten the legitimate cold start)
  • #75032 — Feature request for installBundledRuntimeDeps: false to skip staging entirely (open; complementary)
  • #62850 — Docker HEALTHCHECK uses node -e fetch which fails intermittently in 4.5+ (closed; same directive, different concern)

Repro evidence

The value --start-period=15s is verified directly against the upstream Dockerfile (current main and tag v2026.4.29) and reproduces with the docker run command in Steps to reproduce. v2026.4.29 carries the same directive value as v2026.4.26 and earlier.

extent analysis

TL;DR

Update the HEALTHCHECK directive in the Dockerfile to increase the --start-period to a value that reflects the realistic worst-case time-to-first-bind, such as 600s.

Guidance

  • Review the Dockerfile and update the HEALTHCHECK directive to increase the --start-period value.
  • Consider the worst-case time-to-first-bind for the container, including the time it takes to run the staging script and start the HTTP server.
  • Test the updated HEALTHCHECK directive to ensure it resolves the issue with false-positive unhealthy flags during legitimate cold starts.
  • If the updated value feels too long, consider a shorter value, such as 300s, which would still be an improvement over the current 15s value.

Example

The suggested fix is to update the Dockerfile with the following change:

- HEALTHCHECK --interval=3m --timeout=10s --start-period=15s --retries=3 \
+ HEALTHCHECK --interval=3m --timeout=10s --start-period=600s --retries=3 \
    CMD node -e "fetch('http://127.0.0.1:18789/healthz').then((r)=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"

Notes

The issue affects any orchestrator that consumes the OCI HEALTHCHECK state, including Docker Compose, Swarm, podman, and Kubernetes clusters. The suggested fix is a trivial update to the Dockerfile and should resolve the issue with false-positive unhealthy flags during legitimate cold starts.

Recommendation

Apply the suggested fix to update the HEALTHCHECK directive in the Dockerfile to increase the --start-period value to 600s, as this reflects the realistic worst-case time-to-first

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

HEALTHCHECK --start-period should reflect the realistic worst-case time-to-first-bind. Even with #74948's lock-staleness fix and a warm cache, the first-ever boot on a fresh host has to run the staging script once. Recommend --start-period=600s to cover both the warm-cache cold start (~10 s) and the worst-case fresh-host first-ever start (~5 min, see #73339 / #75069 / #73055).

--interval=3m --timeout=10s --retries=3 are reasonable as-is.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Dockerfile HEALTHCHECK --start-period=15s flips containers to (unhealthy) during legitimate cold start [2 pull requests, 1 comments, 2 participants]