openclaw - 💡(How to fix) Fix [Bug]: Telegram ingress spool — orphan claim from prior container blocks all inbound (processExists(1) false-positive in PID-namespaced runtimes)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After a non-graceful container restart (e.g. docker kill, host reboot, OOM), the Telegram polling channel reports healthy but never drains queued updates. The root cause is a PID-liveness check in isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess that returns a false positive in any PID-namespaced runtime (Docker, Podman, LXC), because both the old and new gateway processes run as PID 1. Recovery is gated on this check, so orphan claims persist for the full staleMs window (default 6h).

This is a reliability bug, not a security boundary bypass — SECURITY.md trust model does not apply, so filing publicly per guidance.

Error Message

  1. Observe: lastInboundAt unchanged. No error logs. openclaw channels status reports healthy.

Root Cause

After a non-graceful container restart (e.g. docker kill, host reboot, OOM), the Telegram polling channel reports healthy but never drains queued updates. The root cause is a PID-liveness check in isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess that returns a false positive in any PID-namespaced runtime (Docker, Podman, LXC), because both the old and new gateway processes run as PID 1. Recovery is gated on this check, so orphan claims persist for the full staleMs window (default 6h).

Fix Action

Fix / Workaround

  1. Start a fresh container, configure Telegram polling, send one DM to the bot.
  2. Confirm the message processes normally (lastInboundAt advances).
  3. Force a controlled stall so a .processing file persists on disk (e.g. an await in the lane handler that holds long enough). Alternative: send a message and docker kill <container> within ~50ms of the inbound — race-y but reproducible after a few attempts.
  4. Verify on disk: ls $XDG_STATE_HOME/openclaw/telegram/ingress-spool-default/*.json.processing exists.
  5. Start the container again. PID 1 will be the new node gateway.
  6. Send any new DM. The new .json file lands in the spool but is never processed.
  7. Observe: lastInboundAt unchanged. No error logs. openclaw channels status reports healthy.
  8. Wait staleMs (default 6h) OR run the workaround below for recovery.

Container restarted at 1779389240521 (~2h17m after the claim). All 18 messages — including the one in the .processing file — sat undispatched for ~4 hours. lastInboundAt stayed null throughout.

After applying the workaround below (strip the claim field, rename .processing.json), the drain loop reclaimed the file within seconds with a fresh processId, logged [telegram] Inbound message ... (3843 chars), and lastInboundAt advanced.

Code Example

-rw------- 1 node node 4569 May 21 16:26 0000000885526467.json.processing
-rw------- 1 node node  424 May 21 16:42 0000000885526468.json
... 16 more files accumulated over 4 hours ...
-rw------- 1 node node  423 May 21 20:10 0000000885526484.json

---

{
  "version": 1,
  "updateId": 885526467,
  "receivedAt": 1779380802999,
  "update": { /* message body */ },
  "claim": {
    "processId": "1:8f095cfa-c66d-4c99-956b-12fe3365eaf6",
    "processPid": 1,
    "claimedAt": 1779380803393
  }
}

---

function isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess(claim) {
  return Boolean(
    claim.claim
    && claim.claim.processId !== TELEGRAM_SPOOLED_UPDATE_PROCESS_ID  // ✓ different UUID
    && isFreshClaimOwner(claim.claim)                                 // ✓ claimedAt within staleMs default
    && processExists(claim.claim.processPid)                          // ✗ — see below
  );
}

function processExists(pid) {
  try { process.kill(pid, 0); return true; }
  catch (err) { return err.code !== "ESRCH"; }
}

---

shouldRecover: (claim) =>
  !activeLaneKeys.has(this.#spooledUpdateLaneKey(claim))
  && !isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess(claim)

---

SPOOL=/home/node/.openclaw/telegram/ingress-spool-default
for f in "$SPOOL"/*.json.processing; do
  [ -e "$f" ] || continue
  jq 'del(.claim)' "$f" > "${f%.processing}"
  rm "$f"
  echo "released $(basename "$f")"
done

---

for f in "$SPOOL"/*.json.processing; do
  [ -e "$f" ] || continue
  node -e "const fs=require('fs'),p='$f'; const d=JSON.parse(fs.readFileSync(p)); delete d.claim; fs.writeFileSync(p.replace('.processing',''), JSON.stringify(d)); fs.unlinkSync(p);"
done
RAW_BUFFERClick to expand / collapse

Summary

After a non-graceful container restart (e.g. docker kill, host reboot, OOM), the Telegram polling channel reports healthy but never drains queued updates. The root cause is a PID-liveness check in isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess that returns a false positive in any PID-namespaced runtime (Docker, Podman, LXC), because both the old and new gateway processes run as PID 1. Recovery is gated on this check, so orphan claims persist for the full staleMs window (default 6h).

This is a reliability bug, not a security boundary bypass — SECURITY.md trust model does not apply, so filing publicly per guidance.

Bug type

Behavior bug (incorrect output/state without crash). Channel reports running: true, connected: true, lastError: null indefinitely; lastInboundAt stays null.

Beta release blocker

No.

Steps to reproduce

Requires OpenClaw 2026.5.19 with Telegram polling, running in any container (Docker / Podman / LXC).

  1. Start a fresh container, configure Telegram polling, send one DM to the bot.
  2. Confirm the message processes normally (lastInboundAt advances).
  3. Force a controlled stall so a .processing file persists on disk (e.g. an await in the lane handler that holds long enough). Alternative: send a message and docker kill <container> within ~50ms of the inbound — race-y but reproducible after a few attempts.
  4. Verify on disk: ls $XDG_STATE_HOME/openclaw/telegram/ingress-spool-default/*.json.processing exists.
  5. Start the container again. PID 1 will be the new node gateway.
  6. Send any new DM. The new .json file lands in the spool but is never processed.
  7. Observe: lastInboundAt unchanged. No error logs. openclaw channels status reports healthy.
  8. Wait staleMs (default 6h) OR run the workaround below for recovery.

The race in step 3 can be eliminated for tests by lowering staleMs via recoverStaleTelegramSpooledUpdateClaims({staleMs: 60000, ...}) and timing the kill to land within the 60s window.

Expected behavior

Orphan claims from a previous container instance should be reclaimed within one drain cycle, since the previous instance is no longer running. The existing staleMs mtime gate would already provide this if the PID check did not short-circuit it.

Actual behavior

Orphan claims persist for the full staleMs window (default 6h = 216e5 ms). During that window, all new updates queue on disk but never enter the agent loop. lastInboundAt does not advance. No diagnostics indicate the stall.

OpenClaw version

2026.5.19

Operating system

Debian 12 (Proxmox LXC, container CT112). Reproduced behavior is independent of host OS — the trigger is PID-namespaced runtime, not the host.

Install method

Docker container (custom image based on ghcr.io/openclaw/openclaw:latest, deployed via docker-compose). The container runs the gateway as PID 1.

Model

Not relevant to this bug (any model). For reference: openai-codex/gpt-5.5 via OAuth.

Provider / routing chain

Telegram polling via getUpdates. No gateway/proxy in front. The bug is in the on-disk spool reclamation logic, upstream of provider routing.

Logs, screenshots, and evidence

Real orphan file captured 2026-05-21 from a Marvin gateway (OpenClaw 2026.5.19, container lifeos-stack-openclaw-1):

-rw------- 1 node node 4569 May 21 16:26 0000000885526467.json.processing
-rw------- 1 node node  424 May 21 16:42 0000000885526468.json
... 16 more files accumulated over 4 hours ...
-rw------- 1 node node  423 May 21 20:10 0000000885526484.json

Orphan claim contents:

{
  "version": 1,
  "updateId": 885526467,
  "receivedAt": 1779380802999,
  "update": { /* message body */ },
  "claim": {
    "processId": "1:8f095cfa-c66d-4c99-956b-12fe3365eaf6",
    "processPid": 1,
    "claimedAt": 1779380803393
  }
}

Container restarted at 1779389240521 (~2h17m after the claim). All 18 messages — including the one in the .processing file — sat undispatched for ~4 hours. lastInboundAt stayed null throughout.

After applying the workaround below (strip the claim field, rename .processing.json), the drain loop reclaimed the file within seconds with a fresh processId, logged [telegram] Inbound message ... (3843 chars), and lastInboundAt advanced.

Impact and severity

  • Affected: any OpenClaw deployment running Telegram polling in a containerized runtime (Docker, Podman, LXC). Bare-metal deployments are unaffected because PIDs are not reused as 1.
  • Severity: HIGH. Silent total inbound stall on the Telegram channel. The channel and openclaw doctor both report healthy, so the failure is invisible to monitoring built around those signals.
  • Frequency: 100% after any non-graceful restart that leaves a .processing file on disk. Operator-initiated graceful shutdowns avoid the bug because the worker finishes the in-flight handler before exit.
  • Consequence: missed messages until the staleMs window expires (default 6h). Users perceive the bot as ignoring them; operator perceives a healthy system.

Additional information

Root cause (function-level)

telegram-ingress-spool.ts (bundled as dist/telegram-ingress-spool-*.js), function isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess:

function isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess(claim) {
  return Boolean(
    claim.claim
    && claim.claim.processId !== TELEGRAM_SPOOLED_UPDATE_PROCESS_ID  // ✓ different UUID
    && isFreshClaimOwner(claim.claim)                                 // ✓ claimedAt within staleMs default
    && processExists(claim.claim.processPid)                          // ✗ — see below
  );
}

function processExists(pid) {
  try { process.kill(pid, 0); return true; }
  catch (err) { return err.code !== "ESRCH"; }
}

In a container, the main process is always PID 1. After a restart, the OLD container's claim with processPid: 1 is checked against the NEW container's process table. process.kill(1, 0) succeeds because PID 1 exists (the new gateway). The orphan claim is therefore classified as held by another live process, and the file is never reclaimed.

The drain loop already invokes recoverStaleTelegramSpooledUpdateClaims with staleMs: 0 on every cycle (monitor-polling.runtime.ts, the #drainSpooledUpdates method), but its shouldRecover predicate defers to the same function:

shouldRecover: (claim) =>
  !activeLaneKeys.has(this.#spooledUpdateLaneKey(claim))
  && !isTelegramSpooledUpdateClaimOwnedByOtherLiveProcess(claim)

So both recovery layers are gated on a check that returns false positives in containers. The staleMs mtime fallback eventually fires at the 6h ceiling, but the silent stall in the meantime is the user-visible symptom.

The PID-liveness check is reasonable for non-containerized deployments where PID-1 reuse doesn't happen. The issue is specific to PID-namespaced runtimes.

Suggested fix

Two viable approaches, ordered by code change size:

Option A — UUID-only ownership (minimum change). Drop the processExists check entirely. If claim.processId !== TELEGRAM_SPOOLED_UPDATE_PROCESS_ID, the claim is by a different runtime instance; combined with the existing staleMs mtime gate, this is sufficient. Operators who want faster recovery can lower staleMs. Trade-off: two healthy peers on the same machine targeting the same spool directory would race, but that topology is already not supported (one gateway per spool directory per docs/gateway/index.md lines 152-180).

Option B — process start time check (Linux/proc-aware). Read /proc/<claim.processPid>/stat field 22 (start_time in clock ticks since boot) and convert to wall-clock via /proc/stat's btime line. If (btime + start_time / sysconf(_SC_CLK_TCK)) * 1000 > claim.claimedAt, the current PID started AFTER the claim, so it's a different process and the claim is stale. Falls back to current behavior on non-Linux. More work, more precise, container-correct.

Option A is the minimum surface change; Option B adds robustness at the cost of platform-specific code.

Suggested regression test

Write a .json.processing file with claim.processId="1:fake-uuid", claim.processPid=1, claim.claimedAt=Date.now()-60000. Start the spool worker with staleMs: 600000 (10 min, well above the claim age). Assert that within one drain cycle, the file is reclaimed (its claim.processId becomes the worker's own UUID) — NOT skipped.

Under current code the test would fail: the worker sees processId !== ours ✓ + staleMs > claim age ✓ + processExists(1) ✓ → concludes "live owner" → leaves the file.

Operator workaround (no upstream change required)

Inside the affected container:

SPOOL=/home/node/.openclaw/telegram/ingress-spool-default
for f in "$SPOOL"/*.json.processing; do
  [ -e "$f" ] || continue
  jq 'del(.claim)' "$f" > "${f%.processing}"
  rm "$f"
  echo "released $(basename "$f")"
done

jq is present in the official openclaw container image. For environments without jq, equivalent via node -e:

for f in "$SPOOL"/*.json.processing; do
  [ -e "$f" ] || continue
  node -e "const fs=require('fs'),p='$f'; const d=JSON.parse(fs.readFileSync(p)); delete d.claim; fs.writeFileSync(p.replace('.processing',''), JSON.stringify(d)); fs.unlinkSync(p);"
done

No container restart needed; the drain loop picks up renamed files within one cycle.

Investigation and review

Investigation, function-level forensics, and write-up were done by Claude Opus 4.7 and Codex CLI (gpt-5.5) in collaboration with the operator. The report was then independently reviewed by two adversarial reviewers — Kimi K2.6 and DeepSeek v4 Pro, both via Ollama Cloud — acting as verifiers. Their MUST-FIX items have been integrated into this version: specifically, the earlier draft contained an incorrect /proc/stat calculation in Option B (mixing jiffies-since-boot with epoch ms without btime conversion) and an overclaimed "monitoring is deceived" framing that wasn't grounded in code. Both have been corrected.

Happy to provide additional artifacts (the full live orphan file with redacted message body, raw container logs spanning the stall, comparison getWebhookInfo output) or join a debugging session if useful.


Reported by Michael Guirguis (operator of the lifeos-stack Marvin gateway). Verified live on OpenClaw 2026.5.19, 2026-05-21.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Orphan claims from a previous container instance should be reclaimed within one drain cycle, since the previous instance is no longer running. The existing staleMs mtime gate would already provide this if the PID check did not short-circuit it.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING