openclaw - ✅(Solved) Fix [Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7) [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74346Fetched 2026-04-30 06:25:12
View on GitHub
Comments
4
Participants
4
Timeline
13
Reactions
2
Timeline (top)
commented ×4cross-referenced ×2mentioned ×2referenced ×2

shouldRemoveRuntimeDepsLock (in the bundled plugin-runtime-deps installer) decides a stale lock is "fresh" whenever owner.pid matches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with init: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.

Result: gateway hangs at starting… for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.

Root Cause

shouldRemoveRuntimeDepsLock (in the bundled plugin-runtime-deps installer) decides a stale lock is "fresh" whenever owner.pid matches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with init: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.

Result: gateway hangs at starting… for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.

Fix Action

Fix / Workaround

  1. Document the workaround in the Docker install docs and have the gateway's startup script rm -rf any lock dir whose owner.json.createdAtMs is older than e.g. 30s before invoking the gateway.

PR fix notes

PR #74361: fix(plugins): disambiguate runtime-deps lock owners by process start-time (Docker PID reuse)

Description (problem / solution / changelog)

Summary

shouldRemoveRuntimeDepsLock short-circuits on isAlive(owner.pid) alone, which is unsafe inside containers because PIDs are recycled deterministically — the new gateway is always PID 1 (or PID 7 with init: true) in its container PID namespace, so a stale lock left behind by a previous incarnation looks "live" to the new one and never gets reclaimed.

This PR captures pidStartTimeMs at lock-acquisition time and consults it in the staleness check. When both sides have start-time evidence and they disagree, the lock is treated as stale. When evidence is missing on either side (legacy locks, non-Linux hosts), the existing PID-alive-means-fresh behavior is preserved exactly.

Closes #74346.

The bug

Setup: OpenClaw on Docker (ghcr.io/openclaw/openclaw:2026.4.24+), Linux host. Reproduced both with docker compose down/up and with a hard-killed gateway (sigkill, OOM, container kill).

  1. Old gateway dies without graceful cleanup — .openclaw-runtime-deps.lock/owner.json is left behind with {"pid": 7, "createdAtMs": <T0>}.
  2. New container starts. The new Node process becomes PID 7 in its namespace.
  3. Lock-acquisition path calls removeRuntimeDepsLockIfStale(lockDir, nowMs), which calls shouldRemoveRuntimeDepsLock(owner, nowMs). Owner has pid: 7. isAlive(7) returns true (the new process is PID 7). The function returns false — lock stays.
  4. mkdirSync(lockDir) returns EEXIST. The wait loop spins until BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS (5 min) elapses, then errors out and is restarted by the supervisor. Cycle repeats indefinitely.

Operators have been working around this by stopping the container and manually removing .openclaw-runtime-deps.lock. After lock removal the gateway boots in ~35 seconds.

The fix

Strictly-additive change to the lock owner record and the staleness predicate:

  1. New pidStartTimeMs field on the lock owner record. Computed once at module load as Date.now() - process.uptime() * 1000 and persisted alongside pid / createdAtMs in owner.json.
  2. New readProcessStartTimeMs(pid) helper that returns the live PID's start-time in epoch ms on Linux (parses /proc/<pid>/stat field 22, anchored on the last ) to handle process names with spaces/parens), or null elsewhere. Hard-coded 100 for _SC_CLK_TCK since the constant cancels out — both ends of the comparison use the same conversion.
  3. shouldRemoveRuntimeDepsLock extended: when owner.pidStartTimeMs is set AND readStartTimeMs(owner.pid) returns a value AND the two disagree, the lock is stale. When evidence is missing on either side, the existing PID-alive-means-fresh path is taken unchanged.

Why strictly additive

  • Wire format: lock files written by older releases lack pidStartTimeMs. They continue to take the legacy path. No migration step needed.
  • Predicate: when start-time evidence cannot be confirmed (legacy lock, non-Linux host), behavior is identical to today's code. The pre-existing test does not expire active runtime-deps install locks by age alone continues to pass.
  • Platforms: macOS / Windows hosts return null from readProcessStartTimeMs (no /proc) and take the legacy path. Linux hosts (including Docker Desktop's Linux VM, which all Linux containers run inside on macOS / Windows) get the disambiguation.

Tests

src/plugins/bundled-runtime-deps.test.ts adds four new cases:

  • matching start-time → lock is fresh (same incarnation re-checking its own lock).
  • mismatched start-time, PID alive → lock is stale (the Docker PID-reuse case this PR targets).
  • PID alive, no live start-time available → fall back to legacy behavior. Asserts that even with pidStartTimeMs set on the owner, we don't stomp the lock when the live side cannot be verified.
  • (existing test) does not expire active runtime-deps install locks by age alone → still passes unchanged, because the legacy path is preserved exactly when evidence is missing.

All 94 tests in bundled-runtime-deps.test.ts pass locally.

Verified

  • pnpm vitest run --reporter verbose --no-coverage src/plugins/bundled-runtime-deps.test.ts — 94/94 pass.
  • pnpm tsc --noEmit -p tsconfig.json — no new errors related to this file.

Notes for review

  • The capture timing (Date.now() - process.uptime() * 1000) drifts a few ms relative to /proc's starttime jiffies. That's fine because the comparison only runs against the same PID's start-time read from /proc later — equality holds within rounding tolerance and we use Math.round on both sides.
  • I deliberately kept this as a "fix the predicate" change rather than a flock(2) rewrite. A flock-based design (kernel-released on process exit, no staleness logic needed) would be cleaner architecturally but is a much larger change and would require picking a Node lockfile library. Happy to follow up with an RFC if you'd like to go that direction; this PR addresses the immediate Docker hang while preserving every existing invariant.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/plugins/bundled-runtime-deps.test.ts (modified, +45/-0)
  • src/plugins/bundled-runtime-deps.ts (modified, +49/-4)
  • ui/src/ui/components/dashboard-header.ts (modified, +2/-2)

Code Example

function shouldRemoveRuntimeDepsLock(owner, nowMs) {
  if (!owner) return true;
  if (typeof owner.pid === "number") return !isAlive(owner.pid);
  return typeof owner.createdAtMs === "number"
    && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
}

---

docker compose down openclaw-gateway
rm -rf data/config/plugin-runtime-deps/openclaw-<version>-*/.openclaw-runtime-deps.lock
docker compose up -d openclaw-gateway

---

if (typeof owner.pid === "number" && isAlive(owner.pid)) {
     if (typeof owner.createdAtMs === "number"
         && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
       return true;
     }
     return false;
   }
   return typeof owner.createdAtMs === "number"
     && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
RAW_BUFFERClick to expand / collapse

Summary

shouldRemoveRuntimeDepsLock (in the bundled plugin-runtime-deps installer) decides a stale lock is "fresh" whenever owner.pid matches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with init: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.

Result: gateway hangs at starting… for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.

Affected versions

Reproduced on ghcr.io/openclaw/openclaw:2026.4.24 and :2026.4.25-beta.4. Code path is unchanged on current main.

Source

/app/dist/bundled-runtime-deps-BdEAdjwi.js (in the v2026.4.24 dist), corresponding to bundled-runtime-deps.ts:

function shouldRemoveRuntimeDepsLock(owner, nowMs) {
  if (!owner) return true;
  if (typeof owner.pid === "number") return !isAlive(owner.pid);
  return typeof owner.createdAtMs === "number"
    && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
}

The if/else short-circuits the time-based fallback — createdAtMs is only consulted when pid is missing. As long as PID-N is alive (which it always is inside the container running the new gateway), the time-based stale check never fires.

Reproduction

  1. docker compose up -d openclaw-gateway — gateway starts cleanly, writes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock/owner.json with {"pid": 7, "createdAtMs": <T0>}.
  2. Force-kill or hard-restart the container in a way that prevents Node's normal shutdown cleanup. We hit this via docker compose down && docker compose up -d, but anything that bypasses graceful exit (OOM, container kill, sigkill) reproduces it.
  3. New container starts. The new Node process is also PID 7 inside the container.
  4. bundled-runtime-deps.ts:withBundledRuntimeDepsInstallRootLock calls removeRuntimeDepsLockIfStale(lockDir, nowMs). It reads the leftover owner.json and calls isAlive(7)true (the new process is PID 7).
  5. Lock is not removed. mkdirSync(lockDir) returns EEXIST. Loop spins waiting for the lock until BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS = 5 * 60_000 elapses, then errors and is retried by the supervisor — the gateway log stays parked at starting… with no further entries.

We have repeatedly worked around this with:

docker compose down openclaw-gateway
rm -rf data/config/plugin-runtime-deps/openclaw-<version>-*/.openclaw-runtime-deps.lock
docker compose up -d openclaw-gateway

and after the lock removal, gateway boots in ~35 seconds.

Why the bug doesn't surface outside Docker

On a host with a normal PID namespace, the previous Node's PID is gone after exit, isAlive(<old-pid>) returns false, and the lock is removed. The bug is invisible. It only bites in containers where PIDs are recycled deterministically.

Recommended fixes (any one would help)

  1. Always consult createdAtMs even when pid is set. A lock older than BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS is stale regardless of PID. Single-line change:

    if (typeof owner.pid === "number" && isAlive(owner.pid)) {
      if (typeof owner.createdAtMs === "number"
          && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
        return true;
      }
      return false;
    }
    return typeof owner.createdAtMs === "number"
      && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
  2. Use process start-time alongside PID (Linux: /proc/<pid>/stat field 22 / starttime jiffies). Two PID-7 processes in different container incarnations have different start-times. isAlive(pid) && startTimeMatches(pid, owner.startTime) distinguishes them.

  3. Use flock(2) on a sentinel file instead of an mkdir lock + owner-json. The kernel releases the lock when the holding process exits (clean or not), so stale locks don't persist across container restarts.

  4. Document the workaround in the Docker install docs and have the gateway's startup script rm -rf any lock dir whose owner.json.createdAtMs is older than e.g. 30s before invoking the gateway.

(1) is the smallest change and the lowest risk. (3) is the most architecturally sound but a bigger refactor.

Adjacent context

This isn't the only failure-mode involving plugin-runtime-deps — #73520 covers stale cross-version directories causing crash-loops on openclaw update, and #71818 / #71599 covered runtime-deps re-install loops on cold start. This issue is distinct: same-version, same-installation, just an unsafe staleness predicate that happens to short-circuit on container PID reuse.

extent analysis

TL;DR

The most likely fix is to always consult createdAtMs even when pid is set, to correctly determine if a lock is stale.

Guidance

  • The issue arises from the shouldRemoveRuntimeDepsLock function not correctly handling the case where a new process has the same PID as a previous process in a Docker container.
  • To fix this, the function should always check the createdAtMs timestamp to determine if a lock is stale, even if the PID is still alive.
  • One possible solution is to modify the shouldRemoveRuntimeDepsLock function to check createdAtMs in addition to pid, as shown in the recommended fixes.
  • Another option is to use a different locking mechanism, such as flock(2), which would release the lock when the holding process exits.

Example

if (typeof owner.pid === "number" && isAlive(owner.pid)) {
  if (typeof owner.createdAtMs === "number"
      && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
    return true;
  }
  return false;
}
return typeof owner.createdAtMs === "number"
  && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;

Notes

  • The issue only occurs in Docker containers, where PIDs are recycled deterministically.
  • The recommended fixes have different levels of complexity and risk, with option (1) being the smallest change and lowest risk.

Recommendation

Apply workaround (1) to always consult createdAtMs even when pid is set, as it is the smallest change and lowest risk. This will correctly determine if a lock is stale and prevent the gateway from hanging.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7) [1 pull requests, 4 comments, 4 participants]