openclaw - ✅(Solved) Fix [Bug] Discord plugin fetch timeout blocks Node.js event loop, causing liveness warnings [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77634Fetched 2026-05-06 06:23:42
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
2
Timeline (top)
cross-referenced ×3commented ×2

Root Cause

Root Cause Hypothesis

Fix Action

Fixed

PR fix notes

PR #77682: Fix: Issue 77651 channel stop timeout

Description (problem / solution / changelog)

Summary

  • Problem: health-monitor recovery stops could time out while leaving a channel account treated like an explicit manual stop, suppressing later reconnects.
  • Why it matters: Slack Socket Mode and other long-lived channel tasks could stay dead until a full gateway restart after event-loop starvation or an abort-ignoring provider task.
  • What changed: health-monitor restarts now use a non-manual stop mode; non-manual stop timeouts detach stale tasks so replacements can start; stale task completion and status writes are guarded so old tasks cannot clobber replacement runtime state.
  • What did NOT change (scope boundary): no Slack-specific plugin logic, no health threshold/backoff changes, no new config, no UI/API surface changes beyond the internal optional stop mode.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #77651
  • Related #77634
  • Related #77626
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: stopChannel() always marked channel accounts as manually stopped before aborting. When a health- monitor stop timed out, the timeout path returned without clearing manuallyStopped or the tracked task, so recovery starts were suppressed or blocked by stale task state.
  • Missing detection / guardrail: there was coverage for manual stop timeout duplicate-task protection, but not for health-monitor recovery stop timeout, replacement start, or stale task status writes after detachment.
  • Contributing context (if known): Slack Socket Mode can lose heartbeat during event-loop starvation, and an abort-ignoring task can keep the old provider task alive past the gateway stop timeout.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-channels.test.ts, src/gateway/channel-health-monitor.test.ts
  • Scenario the test should lock in: non-manual recovery stop timeouts must not poison manual-stop state; replacement tasks must be able to start; stale task completion/status writes must not clobber replacement runtime state.
  • Why this is the smallest reliable guardrail: the bug is in gateway channel lifecycle state, so mocked channel tasks can deterministically reproduce abort-ignoring timeout behavior without live Slack credentials.
  • Existing test that already covers this (if any): existing manual stop timeout coverage protected duplicate- task behavior but encoded the manual/ghost-running path, not health-monitor recovery.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Gateway channel health recovery can reconnect a channel account after a timed-out recovery stop instead of leaving it indefinitely suppressed as manually stopped.

Diagram (if applicable)

Before: [health monitor restart] -> [stop timeout] -> [manual stop marker + stale task] -> [no reconnect]

After: [health monitor restart] -> [non-manual stop timeout] -> [detach stale task] -> [replacement starts] -> [stale writes ignored]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node/pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): Gateway channel lifecycle; reported via Slack Socket Mode
  • Relevant config (redacted): N/A

Steps

  1. Start a channel account whose task ignores abort and never settles.
  2. Trigger stopChannel(..., { manual: false }) and advance past the 5000ms stop timeout.
  3. Start the same account again and allow the stale task to complete or publish status.

Expected

  • Recovery stop timeout does not leave the account manually stopped.
  • Replacement channel task can start.
  • Stale task completion/status writes do not overwrite the replacement runtime state.

Actual

  • Before this fix, timeout left manual-stop/stale-task state that suppressed reconnect.
  • After this fix, targeted regression tests pass for recovery timeout, replacement start, and stale write guarding.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
    • pnpm build
    • OPENCLAW_LOCAL_CHECK=1 OPENCLAW_LOCAL_CHECK_MODE=throttled pnpm check:changed
    • codex review --base origin/main
  • Edge cases checked:
    • manual stop timeout still prevents duplicate task start
    • recovery stop timeout clears manual-stop suppression
    • replacement task starts after recovery timeout
    • stale task completion/status writes cannot clobber replacement state
  • What you did not verify:
    • live Slack Socket Mode disconnect/reconnect with real credentials
    • Blacksmith Testbox, because blacksmith was not installed locally

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: detaching a timed-out recovery task can temporarily overlap with a replacement task if the old provider ignores abort.
    • Mitigation: replacement is allowed only for non-manual recovery stops, and stale task completion plus task-scoped status writes are guarded by active task identity checks.

Built with GPT 5.5

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/channel-health-monitor.test.ts (modified, +36/-4)
  • src/gateway/channel-health-monitor.ts (modified, +3/-1)
  • src/gateway/server-channels.test.ts (modified, +99/-1)
  • src/gateway/server-channels.ts (modified, +49/-6)
RAW_BUFFERClick to expand / collapse

Bug Description

The Discord plugin's HTTP fetch calls to https://discord.com/api/v10/users/@me are causing the Node.js event loop to block for extended periods (3+ seconds), resulting in severe liveness warning alerts.

Environment

  • OpenClaw version: 2026.5.3-1
  • Node.js version: 24.15.0
  • Platform: Windows_NT 10.0.26100 (x64)
  • Host: GGX-THINKPAD

Steps to Reproduce

  1. Enable Discord channel plugin (channels.discord.enabled: true)
  2. Gateway starts and connects to Discord API
  3. Observe liveness warnings in logs immediately

Observed Behavior

Log excerpt:

[fetch-timeout] fetch timeout after 2500ms (elapsed 3013ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@Me [diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=35.3 eventLoopDelayMaxMs=1362.1 eventLoopUtilization=0.096 cpuCoreRatio=0.093

Key observations:

  1. Direct Node.js fetch test is FAST: etch('https://discord.com/api/v10/users/@me', {headers:{Authorization:'Bot ...'}}) returns in ~600ms — completely normal
  2. Direct curl test is FAST: ~232ms
  3. The etchWithTimeout wrapper in the Discord plugin causes blocking: The AbortController timeout mechanism itself appears to block the event loop for the full timeout duration (2500ms + overhead = 3013ms)
  4. Timer is delayed: The timeout fired at 3013ms instead of 2500ms, indicating the timer was itself blocked by ~513ms
  5. Low CPU utilization during block: Only 23% CPU — this is I/O wait, not computation

Root Cause Hypothesis

The etchWithTimeout from openclaw/plugin-sdk/text-runtime uses setTimeout wrapped around an AbortController-based fetch. When the fetch is in flight, something in the Node.js 24.x fetch/undici implementation or the plugin-sdk's getResolvedFetch appears to block the event loop synchronously during DNS/TLS/hot connection reuse, causing the timeout timer itself to be delayed.

The Discord plugin probe at dist/probe-DmHUl6wI.js calls: js const res = await fetchWithTimeout(${DISCORD_API_BASE}/users/@me, { headers: { Authorization: Bot } }, timeoutMs, getResolvedFetch(fetcher));

Impact

  • Gateway event loop blocks for 3+ seconds during Discord API probe
  • Causes severe liveness warning with eventLoopDelayMaxMs up to 5440ms
  • Affects overall gateway responsiveness

Suggested Fix Directions

  1. Investigate whether getResolvedFetch(fetcher) returns a different fetch implementation that behaves synchronously under Node.js 24.x
  2. Consider using a non-blocking timeout approach (e.g., separate worker thread for Discord API calls)
  3. Increase the timeout from 2500ms to something more generous, or make it configurable
  4. Add retry logic with exponential backoff instead of blocking the main thread

Tags

bug, discord, event-loop, node.js-24, performance

extent analysis

TL;DR

The Discord plugin's fetchWithTimeout calls are blocking the Node.js event loop, causing severe liveness warnings, and a non-blocking timeout approach or retry logic with exponential backoff may help mitigate the issue.

Guidance

  • Investigate the getResolvedFetch(fetcher) function to determine if it returns a different fetch implementation that behaves synchronously under Node.js 24.x.
  • Consider using a separate worker thread for Discord API calls to avoid blocking the main thread.
  • Evaluate increasing the timeout from 2500ms to a more generous value or making it configurable to reduce the frequency of timeouts.
  • Implement retry logic with exponential backoff to handle failed API calls without blocking the main thread.

Example

// Example of retry logic with exponential backoff
const retry = async (fn, retries = 3, delay = 500) => {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i < retries - 1) {
        await new Promise(resolve => setTimeout(resolve, delay * (2 ** i)));
      } else {
        throw error;
      }
    }
  }
};

// Usage
const fetchWithRetry = async () => {
  return retry(async () => {
    const res = await fetchWithTimeout(`${DISCORD_API_BASE}/users/@me`, { headers: { Authorization: 'Bot ' } }, timeoutMs, getResolvedFetch(fetcher));
    return res;
  });
};

Notes

The provided example is a basic illustration of retry logic with exponential backoff and may need to be adapted to the specific requirements of the Discord plugin.

Recommendation

Apply a non-blocking timeout approach, such as using a separate worker thread for Discord API calls or implementing retry logic with exponential backoff, to mitigate the event loop blocking issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING