openclaw - 💡(How to fix) Fix [Bug]: Heartbeat death loop — pendingFinalDelivery stuck on agent main session, blocks all future heartbeats for days

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A heartbeat run that returns any non-token text to a session whose origin.to is the auto-reply pseudo-target "heartbeat" puts the agent's main session into a permanent pendingFinalDelivery: true state. Every subsequent heartbeat tick retries the dead delivery first — failing silently with pendingFinalDeliveryLastError: null despite climbing pendingFinalDeliveryAttemptCount — and bumps updatedAt = now, which keeps the 30-second skip-window check (runHeartbeatOnce, line 866-870 of heartbeat-runner-DpQCcYf2.js) perpetually true. Result: the heartbeat scheduler logs heartbeat: started intervalMs: 3600000 cleanly at every gateway boot, but no actual heartbeat run ever happens again. We observed 64 consecutive hours of silence (2026-05-05 17:54 → 2026-05-08 13:53 WIB) before investigating. cron list, doctor, and system heartbeat last all surface nothing about the stuck state.

Error Message

Behavior bug (silent state corruption, no crash, no error logs)

Bug B — silent retry against pseudo-target with no error captured

recordPendingFinalDeliveryFailure that captures the error string into 4. (Hardening) openclaw doctor should warn when any session has

Root Cause

Clearing only the pendingFinalDelivery* fields is insufficient — we verified within 8 minutes that the same heartbeat output re-creates the stuck state, because origin.to: "heartbeat" is still on the session and keeps re-dispatching.

Fix Action

Fix / Workaround

Beta release blocker

No (workaround exists)

The only thing that prevented the death loop reforming after the fresh session was that the new session has origin: null and lastTo: null (rather than origin.to: "heartbeat"), so there's nothing for dispatch to retry against. Pending stays cosmetically true but updatedAt doesn't get bumped, and the next 60m tick fires normally.

Bug B — silent retry against pseudo-target with no error captured

  • dispatch-8E8vi2HV.js:227-246 (clearPendingFinalDeliveryAfterSuccess) only clears the flag on success. There is no corresponding recordPendingFinalDeliveryFailure that captures the error string into pendingFinalDeliveryLastError — so failures look identical to "still trying" and never surface in logs.
  • When delivery.to === "heartbeat" (the auto-reply pseudo-channel set on the session origin) and no real channel adapter resolves, the dispatch path returns silently. Compare #78532 (closed 2026-05-07) which addressed a similar deliverySucceeded=true masquerade — this is the same family of telemetry-vs-state mismatch on the failure side.
  • The 30s skip window in runHeartbeatOnce:
    if (recentSessionEntry?.pendingFinalDelivery === true
        && recentSessionEntry?.updatedAt
        && startedAt - recentSessionEntry.updatedAt < 3e4) return SKIP_REQUESTS_IN_FLIGHT;
    is correct in principle, but combined with retry logic that bumps updatedAt = now on each silent failure, it becomes a perpetual block.

Code Example

python3 -c "import json; e=json.load(open('/home/openclaw/.openclaw/agents/<agent>/sessions/sessions.json'))['agent:<agent>:main']; print({k:e.get(k) for k in ['pendingFinalDelivery','pendingFinalDeliveryAttemptCount','pendingFinalDeliveryLastError','updatedAt']})"

---

if (recentSessionEntry?.pendingFinalDelivery === true
      && recentSessionEntry?.updatedAt
      && startedAt - recentSessionEntry.updatedAt < 3e4) return SKIP_REQUESTS_IN_FLIGHT;
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (silent state corruption, no crash, no error logs)

Beta release blocker

No (workaround exists)

Summary

A heartbeat run that returns any non-token text to a session whose origin.to is the auto-reply pseudo-target "heartbeat" puts the agent's main session into a permanent pendingFinalDelivery: true state. Every subsequent heartbeat tick retries the dead delivery first — failing silently with pendingFinalDeliveryLastError: null despite climbing pendingFinalDeliveryAttemptCount — and bumps updatedAt = now, which keeps the 30-second skip-window check (runHeartbeatOnce, line 866-870 of heartbeat-runner-DpQCcYf2.js) perpetually true. Result: the heartbeat scheduler logs heartbeat: started intervalMs: 3600000 cleanly at every gateway boot, but no actual heartbeat run ever happens again. We observed 64 consecutive hours of silence (2026-05-05 17:54 → 2026-05-08 13:53 WIB) before investigating. cron list, doctor, and system heartbeat last all surface nothing about the stuck state.

Reproduction (deterministic on v2026.5.7)

  1. Configure an agent with heartbeat: { every: "60m" } and target unset (defaults to "none"). Agent has no real Telegram/Discord delivery target for its heartbeat.
  2. Let the agent's heartbeat fire normally a few times. The session ~/.openclaw/agents/<agent>/sessions/sessions.json for agent:<id>:main accumulates origin: { label: "heartbeat", from: "heartbeat", to: "heartbeat" }.
  3. Force the heartbeat to return any text other than the bare token. In our case the agent appended a preamble: "All clear.\n\nHEARTBEAT_OK" instead of bare HEARTBEAT_OK. (See "Even bare HEARTBEAT_OK still triggers pending" below — non-token preamble accelerates it but isn't required.)
  4. The session enters pendingFinalDelivery: true with the preamble text.
  5. Wait one heartbeat interval. Inspect:
    python3 -c "import json; e=json.load(open('/home/openclaw/.openclaw/agents/<agent>/sessions/sessions.json'))['agent:<agent>:main']; print({k:e.get(k) for k in ['pendingFinalDelivery','pendingFinalDeliveryAttemptCount','pendingFinalDeliveryLastError','updatedAt']})"
  6. pendingFinalDeliveryAttemptCount climbs by 1 every hour, but pendingFinalDeliveryLastError stays null. updatedAt matches each retry timestamp.

In our reproduction we hit attemptCount: 64 before noticing.

Even bare HEARTBEAT_OK still triggers pending

After tightening the agent's HEARTBEAT.md to forbid preamble and forcing a fresh session run (manual openclaw system event --mode now --text "test" --url ws://127.0.0.1:18789 --token $OPENCLAW_GATEWAY_TOKEN on a freshly-created session entry), the agent returned the literal token HEARTBEAT_OK and still ended with pendingFinalDelivery: true, pendingFinalDeliveryText: "HEARTBEAT_OK". So the stripHeartbeatToken call in heartbeat-Dynyl6hI.js:52-87 either runs after the pending-queue write or its empty-after-strip output isn't gating the queueing. The runner should treat "output that strips to empty" as effectively-empty and skip the final-delivery queue entirely.

The only thing that prevented the death loop reforming after the fresh session was that the new session has origin: null and lastTo: null (rather than origin.to: "heartbeat"), so there's nothing for dispatch to retry against. Pending stays cosmetically true but updatedAt doesn't get bumped, and the next 60m tick fires normally.

Two distinct issues compounding

Bug A — pending-delivery flag set even when output is the bare token

  • agent-runner.runtime-DQsCsHUA.js:4093-4095 sets pendingFinalDelivery: true
    • pendingFinalDeliveryText: pendingText whenever pendingText is non-empty by the runner's metric.
  • For heartbeat sessions, "output that strips to the empty string" should count as effectively-empty. Currently pendingText = "HEARTBEAT_OK" reaches that block.

Bug B — silent retry against pseudo-target with no error captured

  • dispatch-8E8vi2HV.js:227-246 (clearPendingFinalDeliveryAfterSuccess) only clears the flag on success. There is no corresponding recordPendingFinalDeliveryFailure that captures the error string into pendingFinalDeliveryLastError — so failures look identical to "still trying" and never surface in logs.
  • When delivery.to === "heartbeat" (the auto-reply pseudo-channel set on the session origin) and no real channel adapter resolves, the dispatch path returns silently. Compare #78532 (closed 2026-05-07) which addressed a similar deliverySucceeded=true masquerade — this is the same family of telemetry-vs-state mismatch on the failure side.
  • The 30s skip window in runHeartbeatOnce:
    if (recentSessionEntry?.pendingFinalDelivery === true
        && recentSessionEntry?.updatedAt
        && startedAt - recentSessionEntry.updatedAt < 3e4) return SKIP_REQUESTS_IN_FLIGHT;
    is correct in principle, but combined with retry logic that bumps updatedAt = now on each silent failure, it becomes a perpetual block.

Workaround (proven on v2026.5.7)

Stop gateway → drop the entire agent:<agent>:main entry from sessions.json and remove its associated *.jsonl/*.trajectory.jsonl files → restart gateway. The runner re-creates the session on the next tick with origin: null, breaking the dispatch retry loop.

Clearing only the pendingFinalDelivery* fields is insufficient — we verified within 8 minutes that the same heartbeat output re-creates the stuck state, because origin.to: "heartbeat" is still on the session and keeps re-dispatching.

Suggested fixes

  1. (Bug A) In the agent-runner pending-delivery write, gate on isHeartbeatContentEffectivelyEmpty(stripHeartbeatToken(text).text). If stripped output is empty, skip the pending queue write entirely.
  2. (Bug B-1) Capture dispatch failures into pendingFinalDeliveryLastError so silent failures become visible.
  3. (Bug B-2) When delivery.to === "heartbeat" and no channel plugin resolves, treat as clearPendingFinalDeliveryAfterSuccess — the pseudo-target acknowledges by reaching it; persistent retry is the bug.
  4. (Hardening) openclaw doctor should warn when any session has pendingFinalDelivery: true AND now - pendingFinalDeliveryCreatedAt > 1h AND pendingFinalDeliveryLastError === null. That's the diagnostic triple that masks this bug.

Environment

  • OpenClaw: 2026.5.7 (eeef486)
  • OS: Ubuntu 24.04 / Linux 6.8.0-110-generic x86_64
  • Node: v22.22.2
  • Install: npm global as system-level systemd service
  • Topology: 5-agent (orchestrator + reasoner + coder + fast + multimodal)
  • Affected agent: fast — 60m heartbeat, no target set, runs deepseek-v4-flash via opencode-go provider
  • Channel: Telegram-bound orchestrator; the affected agent has no direct user-facing channel.

Related issues (not duplicates)

  • #59710 — Heartbeat silently stops after ~20h (this issue's diagnostic mechanism may be the underlying cause)
  • #78187 — Heartbeat polling silently stops after SIGUSR1 gateway restart
  • #74257 — HEARTBEAT_OK/internal text leak (inverse symptom of same path)
  • #78532 — deliverySucceeded=true when no adapter invoked (CLOSED, sibling)
  • #55882 — Agent can drop promised outputs after task switching (broader pending-deliverables queue durability)
  • #65498 — Main-session user task can lose final reply after heartbeat or exec-completion interrupt (CLOSED — related fix area)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Heartbeat death loop — pendingFinalDelivery stuck on agent main session, blocks all future heartbeats for days