openclaw - ✅(Solved) Fix stopChannel abort-timeout leaves zombie task in store, preventing health-monitor restart [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71412Fetched 2026-04-26 05:13:08
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
referenced ×2commented ×1cross-referenced ×1

When stopChannel exceeds CHANNEL_STOP_ABORT_TIMEOUT_MS (5 s), the channel runtime is left in a state where startChannelInternal silently no-ops and the health-monitor believes the restart succeeded. The channel appears running: true, connected: true in the runtime snapshot but no polling loop is active. Recovery requires a manual launchctl kickstart of the gateway.

In practice this bites Telegram polling mode after any network-stack suspend/resume event — most commonly macOS laptop sleep/wake. Polling has been dead for 6+ hours in both observed incidents while openclaw doctor reported Telegram: ok.

Error Message

  1. Telegram polling never resumes. openclaw channels status --probe shows error:channel stop timed out after 5000ms but running, connected, works. log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
  • The outer try block in the monitor completes without error, so record.lastRestartAt = now is set and a 10-minute cooldown begins. log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`); Safety of leaving the zombie alive: the timed-out task may still be blocked on its HTTP read. When the next polling loop calls getUpdates, Telegram returns 409 Conflict: terminated by other getUpdates request, which surfaces as an error in the zombie's poll callback and terminates it. This is the same mechanism that makes a full process restart (launchctl kickstart) work cleanly today. openclaw doctor's Telegram health check currently returns OK when the bot token is valid. Consider extending it to also check that channels status --probe shows recent inbound activity (or no stop-timed-out error), so the doctor matches the liveness semantics a user expects.

Root Cause

Two functions interact badly on the timeout path. Locations are from the bundled dist/server.impl-*.js in v2026.4.21 (npm openclaw); line numbers are from that bundle, the equivalent source paths in the repo will differ.

stopChannel (around line 2695 of the bundle): ```js if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) { log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`); setRuntime(channelId, id, { accountId: id, running: true, // runtime still marked running restartPending: false, lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms` }); return; // early return: store.tasks and store.aborts NOT cleared } store.aborts.delete(id); // only reached on success store.tasks.delete(id); ```

startChannelInternal (around line 2491): ```js if (store.tasks.has(id)) return; // silent no-op if a task is already registered ```

When the health-monitor runs its await stopChannel(...); await startChannel(...) sequence:

  • stopChannel returns normally (doesn't throw) even on timeout.
  • The zombie task reference remains in store.tasks.
  • startChannelInternal sees it and returns without starting anything.
  • The outer try block in the monitor completes without error, so record.lastRestartAt = now is set and a 10-minute cooldown begins.
  • Every 10 min the same no-op cycle repeats; after maxRestartsPerHour (default 10) it prints "skipping" and goes silent.

Why it looks fine from the outside:

  • connected is never written on the timeout path, so it stays true.
  • running is explicitly set back to true.
  • openclaw doctor's Telegram probe is just a getMe call on the bot token — it verifies credentials, not polling liveness.

Fix Action

Fix / Workaround

Happy to open a PR if helpful — the patch above is the entire fix as far as I can tell from the bundled output.

PR fix notes

PR #71456: fix(channels): release zombie task on stopChannel timeout (#71412)

Description (problem / solution / changelog)

Closes #71412.

Bug

When stopChannel times out (CHANNEL_STOP_ABORT_TIMEOUT_MS), the runtime entry was left at running: true and the timed-out task was still registered in store.tasks / store.aborts. The next startChannelInternal saw the leftover tasks[id] entry and silently no-op'd, so the channel was stuck running:true, connected:true with no live poll loop — e.g. Telegram polling after a macOS sleep/wake.

Fix

Clear store.aborts[id] and store.tasks[id] before recording runtime, and set running: false, so a subsequent restartChannel() registers a fresh poll. Patch is the verbatim diff supplied by the reporter on #71412.

The leaked task may still be blocked on its HTTP read; when the new getUpdates fires, Telegram returns 409 Conflict and terminates it. So this is safe for Telegram and a generic resource-leak fix that's channel-agnostic.

Test

Renamed does not allow a second account task to start when stop times outreleases the zombie task when stop times out so the next start can register a fresh one (#71412):

  • expect(startAccount).toHaveBeenCalledTimes(1)expect(startAccount).toHaveBeenCalledTimes(2) (a fresh start now actually fires)
  • Dropped the lastError stale-message check; the successful restart overwrites it

Lint clean: pnpm oxlint src/gateway/server-channels.ts — 0 warnings, 0 errors.

Note: full repo pnpm vitest is currently broken on origin/main independent of this change (vitest config workspace error), so I targeted the test rename + assertion update at the file level.

🤖 generated with assistance from Claude Code Co-authored-by: HCL [email protected]

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/server-channels.test.ts (modified, +69/-6)
  • src/gateway/server-channels.ts (modified, +34/-11)
RAW_BUFFERClick to expand / collapse

Summary

When stopChannel exceeds CHANNEL_STOP_ABORT_TIMEOUT_MS (5 s), the channel runtime is left in a state where startChannelInternal silently no-ops and the health-monitor believes the restart succeeded. The channel appears running: true, connected: true in the runtime snapshot but no polling loop is active. Recovery requires a manual launchctl kickstart of the gateway.

In practice this bites Telegram polling mode after any network-stack suspend/resume event — most commonly macOS laptop sleep/wake. Polling has been dead for 6+ hours in both observed incidents while openclaw doctor reported Telegram: ok.

Reproduction

  1. Run the gateway with a Telegram polling account on a MacBook.
  2. Unplug the charger and close the lid (or let the machine sleep for ~30 min).
  3. Wake the machine.
  4. Within the next health-monitor cycle, the Telegram getUpdates long-poll socket is dead; the monitor detects stale-socket and calls stopChannel.
  5. The in-flight HTTP request doesn't honor the abort within 5 s (half-closed socket after sleep). stopChannel logs channel stop exceeded 5000ms after abort; continuing shutdown and returns.
  6. Telegram polling never resumes. openclaw channels status --probe shows error:channel stop timed out after 5000ms but running, connected, works.

Root cause

Two functions interact badly on the timeout path. Locations are from the bundled dist/server.impl-*.js in v2026.4.21 (npm openclaw); line numbers are from that bundle, the equivalent source paths in the repo will differ.

stopChannel (around line 2695 of the bundle): ```js if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) { log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`); setRuntime(channelId, id, { accountId: id, running: true, // runtime still marked running restartPending: false, lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms` }); return; // early return: store.tasks and store.aborts NOT cleared } store.aborts.delete(id); // only reached on success store.tasks.delete(id); ```

startChannelInternal (around line 2491): ```js if (store.tasks.has(id)) return; // silent no-op if a task is already registered ```

When the health-monitor runs its await stopChannel(...); await startChannel(...) sequence:

  • stopChannel returns normally (doesn't throw) even on timeout.
  • The zombie task reference remains in store.tasks.
  • startChannelInternal sees it and returns without starting anything.
  • The outer try block in the monitor completes without error, so record.lastRestartAt = now is set and a 10-minute cooldown begins.
  • Every 10 min the same no-op cycle repeats; after maxRestartsPerHour (default 10) it prints "skipping" and goes silent.

Why it looks fine from the outside:

  • connected is never written on the timeout path, so it stays true.
  • running is explicitly set back to true.
  • openclaw doctor's Telegram probe is just a getMe call on the bot token — it verifies credentials, not polling liveness.

Proposed fix

On the timeout path, release the stale task reference so the next startChannel can proceed:

```diff if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) { log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);

  • store.aborts.delete(id);
  • store.tasks.delete(id); setRuntime(channelId, id, { accountId: id,
  •    running: true,
  •    running: false,
       restartPending: false,
       lastError: \`channel stop timed out after \${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms\`
    }); return; } ```

Safety of leaving the zombie alive: the timed-out task may still be blocked on its HTTP read. When the next polling loop calls getUpdates, Telegram returns 409 Conflict: terminated by other getUpdates request, which surfaces as an error in the zombie's poll callback and terminates it. This is the same mechanism that makes a full process restart (launchctl kickstart) work cleanly today.

Possible secondary improvement

openclaw doctor's Telegram health check currently returns OK when the bot token is valid. Consider extending it to also check that channels status --probe shows recent inbound activity (or no stop-timed-out error), so the doctor matches the liveness semantics a user expects.

Environment

  • macOS 25.4.0 (Darwin), MacBook Pro M5 Max
  • openclaw v2026.4.21 (npm + Homebrew cask)
  • Telegram channel in polling mode

Happy to open a PR if helpful — the patch above is the entire fix as far as I can tell from the bundled output.

extent analysis

TL;DR

The proposed fix involves releasing the stale task reference on the timeout path in the stopChannel function to allow the next startChannel to proceed.

Guidance

  • Review the stopChannel function to ensure that the stale task reference is released on the timeout path.
  • Verify that the store.tasks and store.aborts are properly cleared in the stopChannel function.
  • Test the proposed fix by reproducing the issue and checking if the Telegram polling resumes after the machine wakes up from sleep.
  • Consider extending the openclaw doctor's Telegram health check to also verify recent inbound activity or the absence of stop-timed-out errors.

Example

if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    store.aborts.delete(id);
    store.tasks.delete(id);
    setRuntime(channelId, id, {
        accountId: id,
        running: false,
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;
}

Notes

The proposed fix assumes that releasing the stale task reference will allow the next startChannel to proceed. However, it's essential to test this fix thoroughly to ensure it resolves the issue.

Recommendation

Apply the proposed workaround by releasing the stale task reference in the stopChannel function, as it is a targeted fix for the identified issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix stopChannel abort-timeout leaves zombie task in store, preventing health-monitor restart [1 pull requests, 1 comments, 2 participants]