openclaw - ✅(Solved) Fix [Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77651Fetched 2026-05-06 06:23:23
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
2
Author
Timeline (top)
cross-referenced ×2closed ×1commented ×1

Error Message

// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3 if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) { log.warn?.([${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown); setRuntime(channelId, id, { accountId: id, running: true, // ← should not be true; connection is dead restartPending: false, lastError: channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms }); return; // ← exits without store.aborts.delete / store.tasks.delete // and manuallyStopped remains set from line ~495 } // happy path clears aborts, tasks, sets running:false store.aborts.delete(id); store.tasks.delete(id);

Fix Action

Workaround

Until fixed, a watchdog cron job running launchctl kickstart -k gui/<uid>/ai.openclaw.gateway on detection of the pattern (last channel stop exceeded timestamp > last socket mode connected timestamp in the logs) recovers the socket automatically.

PR fix notes

PR #77682: Fix: Issue 77651 channel stop timeout

Description (problem / solution / changelog)

Summary

  • Problem: health-monitor recovery stops could time out while leaving a channel account treated like an explicit manual stop, suppressing later reconnects.
  • Why it matters: Slack Socket Mode and other long-lived channel tasks could stay dead until a full gateway restart after event-loop starvation or an abort-ignoring provider task.
  • What changed: health-monitor restarts now use a non-manual stop mode; non-manual stop timeouts detach stale tasks so replacements can start; stale task completion and status writes are guarded so old tasks cannot clobber replacement runtime state.
  • What did NOT change (scope boundary): no Slack-specific plugin logic, no health threshold/backoff changes, no new config, no UI/API surface changes beyond the internal optional stop mode.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #77651
  • Related #77634
  • Related #77626
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: stopChannel() always marked channel accounts as manually stopped before aborting. When a health- monitor stop timed out, the timeout path returned without clearing manuallyStopped or the tracked task, so recovery starts were suppressed or blocked by stale task state.
  • Missing detection / guardrail: there was coverage for manual stop timeout duplicate-task protection, but not for health-monitor recovery stop timeout, replacement start, or stale task status writes after detachment.
  • Contributing context (if known): Slack Socket Mode can lose heartbeat during event-loop starvation, and an abort-ignoring task can keep the old provider task alive past the gateway stop timeout.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-channels.test.ts, src/gateway/channel-health-monitor.test.ts
  • Scenario the test should lock in: non-manual recovery stop timeouts must not poison manual-stop state; replacement tasks must be able to start; stale task completion/status writes must not clobber replacement runtime state.
  • Why this is the smallest reliable guardrail: the bug is in gateway channel lifecycle state, so mocked channel tasks can deterministically reproduce abort-ignoring timeout behavior without live Slack credentials.
  • Existing test that already covers this (if any): existing manual stop timeout coverage protected duplicate- task behavior but encoded the manual/ghost-running path, not health-monitor recovery.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Gateway channel health recovery can reconnect a channel account after a timed-out recovery stop instead of leaving it indefinitely suppressed as manually stopped.

Diagram (if applicable)

Before: [health monitor restart] -> [stop timeout] -> [manual stop marker + stale task] -> [no reconnect]

After: [health monitor restart] -> [non-manual stop timeout] -> [detach stale task] -> [replacement starts] -> [stale writes ignored]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node/pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): Gateway channel lifecycle; reported via Slack Socket Mode
  • Relevant config (redacted): N/A

Steps

  1. Start a channel account whose task ignores abort and never settles.
  2. Trigger stopChannel(..., { manual: false }) and advance past the 5000ms stop timeout.
  3. Start the same account again and allow the stale task to complete or publish status.

Expected

  • Recovery stop timeout does not leave the account manually stopped.
  • Replacement channel task can start.
  • Stale task completion/status writes do not overwrite the replacement runtime state.

Actual

  • Before this fix, timeout left manual-stop/stale-task state that suppressed reconnect.
  • After this fix, targeted regression tests pass for recovery timeout, replacement start, and stale write guarding.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
    • pnpm build
    • OPENCLAW_LOCAL_CHECK=1 OPENCLAW_LOCAL_CHECK_MODE=throttled pnpm check:changed
    • codex review --base origin/main
  • Edge cases checked:
    • manual stop timeout still prevents duplicate task start
    • recovery stop timeout clears manual-stop suppression
    • replacement task starts after recovery timeout
    • stale task completion/status writes cannot clobber replacement state
  • What you did not verify:
    • live Slack Socket Mode disconnect/reconnect with real credentials
    • Blacksmith Testbox, because blacksmith was not installed locally

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: detaching a timed-out recovery task can temporarily overlap with a replacement task if the old provider ignores abort.
    • Mitigation: replacement is allowed only for non-manual recovery stops, and stale task completion plus task-scoped status writes are guarded by active task identity checks.

Built with GPT 5.5

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/channel-health-monitor.test.ts (modified, +36/-4)
  • src/gateway/channel-health-monitor.ts (modified, +3/-1)
  • src/gateway/server-channels.test.ts (modified, +99/-1)
  • src/gateway/server-channels.ts (modified, +49/-6)

PR #77686: fix: recover Slack channel restart after stop timeout

Description (problem / solution / changelog)

Summary

  • Treat channel health-monitor restarts as recovery stops, not manual stops, so a timed-out stop does not poison future auto-reconnect state.
  • Preserve manual-stop timeout behavior for user-initiated stops while marking recovery stop timeouts as restart-pending.
  • Add regression coverage for the timed-out recovery stop path and the manual-stop-during-recovery-backoff path.
  • Update the changelog.

Fixes #77651.

Real behavior proof

  • Behavior or issue addressed: A recovery restart that times out while stopping a stuck channel task should remain eligible for auto-restart when the old task finally settles, while later manual stops should still cancel recovery backoff.
  • Real environment tested: Local OpenClaw checkout on macOS, running the actual gateway createChannelManager implementation through Node/tsx with a temporary channel plugin installed in the runtime registry.
  • Exact steps or command run after this patch: From repo root, ran a node --import tsx --input-type=module runtime harness that imports src/gateway/server-channels.ts, starts a channel account whose first task ignores abort until explicitly released, calls stopChannel('discord', 'default', { manual: false }), immediately requests startChannel, releases the old task, waits for the manager's real auto-restart path, then performs a manual cleanup stop.
  • Evidence after fix: Terminal output from the local runtime harness:
after start: startCalls=1
[recovery-proof] [default] channel stop exceeded 5000ms after abort; continuing shutdown
after recovery stop timeout: elapsedMs=5023 running=false restartPending=true manuallyStopped=false lastError=channel stop timed out after 5000ms
after immediate restart request while old task is stuck: startCalls=1
[recovery-proof] [default] auto-restart attempt 1/10 in 5s
after old task settled: startCalls=2 running=true restartPending=false reconnectAttempts=1 lastError=null
events=startAccount(default) #1 | first task received abort but stays stuck until release | first task settled after simulated stuck stop | startAccount(default) #2
cleanup manual stop: running=false restartPending=false manuallyStopped=true
  • Observed result after fix: The recovery stop timed out without setting the account as manually stopped, the immediate restart request did not double-start while the old task was still stuck, and after the old task settled the manager auto-restarted the account (startCalls=2, running=true, lastError=null). A later manual stop left restartPending=false and manuallyStopped=true.
  • What was not tested: Live Slack credentials/socket reconnection. The proof uses the real gateway channel manager and a temporary local channel plugin to reproduce the stuck-task lifecycle deterministically.

Verification

  • pnpm exec oxfmt --check --threads=1 src/gateway/server-channels.ts src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.ts src/gateway/channel-health-monitor.test.ts
  • git diff --check HEAD~2..HEAD
  • pnpm changed:lanes --json
  • pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
  • Local Node/tsx runtime harness output copied in Real behavior proof above.

Notes

  • pnpm changed:lanes --json selected broad lanes after the rebase because the branch had been rewritten on top of current main; the touched-surface checks above are the local targeted proof for this PR.
  • Testbox/Crabbox broad validation was not run from this environment because neither blacksmith nor crabbox is installed here.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/channel-health-monitor.test.ts (modified, +5/-5)
  • src/gateway/channel-health-monitor.ts (modified, +3/-1)
  • src/gateway/server-channels.test.ts (modified, +79/-0)
  • src/gateway/server-channels.ts (modified, +40/-7)

Code Example

stalled model call (~10 min, auditor:main, lmstudio-lab1)
  → event loop blocked (P99 delay 7692ms, utilization 0.922)
Slack SDK WS heartbeat fails → connection drops
  → health monitor aborts stalled session → calls stopChannel()
stopChannel(): manuallyStopped.add(rKey)          ← poison pill set
waitForChannelStopGracefully() times out at 5000ms (loop still starved)
  → timeout branch: setRuntime(running: true), return  ← no cleanup
  → event loop clears, gateway process continues alive
  → auto-restart loop: manuallyStopped.has(rKey) === true → returns, no reconnect
Slack dead indefinitely; only fix is launchctl kickstart -k

---

[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=33s eventLoopDelayP99Ms=7692.4 eventLoopDelayMaxMs=7893.7 eventLoopUtilization=0.922 cpuCoreRatio=0.944
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

---

[ws] ⇄ res ✓ health ...   ← gateway WS still alive
[ws] ⇄ res ✓ health ...
... (silence from Slack)

---

// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3
if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    setRuntime(channelId, id, {
        accountId: id,
        running: true,          // ← should not be true; connection is dead
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;  // ← exits without store.aborts.delete / store.tasks.delete
             //   and manuallyStopped remains set from line ~495
}
// happy path clears aborts, tasks, sets running:false
store.aborts.delete(id);
store.tasks.delete(id);
RAW_BUFFERClick to expand / collapse

Bug Description

When a stalled agent run starves the Node.js event loop long enough to drop the Slack WebSocket heartbeat, the gateway's stopChannel() cleanup path hits the 5000ms timeout and leaves manuallyStopped set for the Slack channel account. The gateway process stays alive but the Slack socket never reconnects — manuallyStopped.has(rKey) is true, so the auto-restart loop exits immediately without scheduling a reconnect.

Environment

  • OpenClaw version: 2026.5.3-1 (2eae30e)
  • Platform: macOS (Darwin, Apple Silicon)
  • Channel: Slack (socket mode, two accounts: default + archivist)

Failure chain

stalled model call (~10 min, auditor:main, lmstudio-lab1)
  → event loop blocked (P99 delay 7692ms, utilization 0.922)
  → Slack SDK WS heartbeat fails → connection drops
  → health monitor aborts stalled session → calls stopChannel()
  → stopChannel(): manuallyStopped.add(rKey)          ← poison pill set
  → waitForChannelStopGracefully() times out at 5000ms (loop still starved)
  → timeout branch: setRuntime(running: true), return  ← no cleanup
  → event loop clears, gateway process continues alive
  → auto-restart loop: manuallyStopped.has(rKey) === true → returns, no reconnect
  → Slack dead indefinitely; only fix is launchctl kickstart -k

Relevant log sequence

gateway.err.log:

[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=33s eventLoopDelayP99Ms=7692.4 eventLoopDelayMaxMs=7893.7 eventLoopUtilization=0.922 cpuCoreRatio=0.944
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

gateway.log (after the above — no further Slack events until manual kickstart):

[ws] ⇄ res ✓ health ...   ← gateway WS still alive
[ws] ⇄ res ✓ health ...
... (silence from Slack)

Code location

server-channels-DtnF0i8E.js (compiled), stopChannel(), line ~512:

// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3
if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    setRuntime(channelId, id, {
        accountId: id,
        running: true,          // ← should not be true; connection is dead
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;  // ← exits without store.aborts.delete / store.tasks.delete
             //   and manuallyStopped remains set from line ~495
}
// happy path clears aborts, tasks, sets running:false
store.aborts.delete(id);
store.tasks.delete(id);

manuallyStopped.add(rKey) is called unconditionally at the top of stopChannel() (line ~495), before the timeout check. On the timeout path it is never cleared, so the auto-restart loop at line ~354 sees manuallyStopped.has(rKey) === true and returns without reconnecting.

Expected behavior

When waitForChannelStopGracefully times out, the channel should either:

Option A (minimal fix): Remove rKey from manuallyStopped in the timeout branch, set running: false, and let the auto-restart loop reconnect.

Option B (explicit reconnect): After the timeout, schedule a reconnect attempt directly (bypassing manuallyStopped) with a short delay to let the event loop recover.

Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.

Workaround

Until fixed, a watchdog cron job running launchctl kickstart -k gui/<uid>/ai.openclaw.gateway on detection of the pattern (last channel stop exceeded timestamp > last socket mode connected timestamp in the logs) recovers the socket automatically.

Related

  • Issue #77634 (Discord fetch timeout blocking event loop) — same root category (event-loop starvation), different failure surface.
  • Issue #77626 (Liveness-based turn timeouts) — would mitigate the stalled model call trigger.

extent analysis

TL;DR

The most likely fix is to remove rKey from manuallyStopped in the timeout branch of stopChannel() and set running: false to allow the auto-restart loop to reconnect.

Guidance

  • Review the stopChannel() function, specifically the timeout branch, to ensure manuallyStopped is cleared and running is set to false when waitForChannelStopGracefully times out.
  • Consider implementing a reconnect attempt with a short delay after the timeout to let the event loop recover.
  • Verify the fix by testing the scenario that triggers the channel stop exceeded warning and checking if the Slack socket reconnects automatically.
  • Monitor logs for the channel stop exceeded warning and the subsequent reconnect attempt to ensure the fix is working as expected.

Example

if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    manuallyStopped.delete(rKey); // Clear manuallyStopped
    setRuntime(channelId, id, {
        accountId: id,
        running: false, // Set running to false
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;
}

Notes

The provided fix assumes that clearing manuallyStopped and setting running: false is sufficient to allow the auto-restart loop to reconnect. However, additional logging or monitoring may be necessary to ensure the fix is working as expected in all scenarios.

Recommendation

Apply the workaround of removing rKey from manuallyStopped and setting running: false in the timeout branch of stopChannel(), as it is a minimal fix that can prevent the "ghost alive" state and allow the Slack socket to reconnect.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When waitForChannelStopGracefully times out, the channel should either:

Option A (minimal fix): Remove rKey from manuallyStopped in the timeout branch, set running: false, and let the auto-restart loop reconnect.

Option B (explicit reconnect): After the timeout, schedule a reconnect attempt directly (bypassing manuallyStopped) with a short delay to let the event loop recover.

Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect [2 pull requests, 1 comments, 2 participants]