openclaw - ✅(Solved) Fix [Bug] Discord plugin fetch timeout blocks Node.js event loop, causing liveness warnings [1 pull requests, 2 comments, 3 participants]

guguangxin-eng · 2026-05-05T02:18:55Z

[openclaw] PR 77682: Fix: Issue 77651 channel stop timeout - Repository: openclaw/openclaw - Author: sahilsatralkar - State: open | merged: False - Link: https… # PR #77682: Fix: Issue 77651 channel stop timeout - Repository: openclaw/openclaw - Author: sahilsatralkar - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/77682 ## Description (problem / solution / changelog) ## Summary - Problem: health-monitor recovery stops could time out while leaving a channel account treated like an explicit manual stop, suppressing later reconnects. - Why it matters: Slack Socket Mode and other long-lived channel tasks could stay dead until a full gateway restart after event-loop starvation or an abort-ignoring provider task. - What changed: health-monitor restarts now use a non-manual stop mode; non-manual stop timeouts detach stale tasks so replacements can start; stale task completion and status writes are guarded so old tasks cannot clobber replacement runtime state. - What did NOT change (scope boundary): no Slack-specific plugin logic, no health threshold/backoff changes, no new config, no UI/API surface changes beyond the internal optional stop mode. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #77651 - Related #77634 - Related #77626 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: stopChannel() always marked channel accounts as manually stopped before aborting. When a health- monitor stop timed out, the timeout path returned without clearing manuallyStopped or the tracked task, so recovery starts were suppressed or blocked by stale task state. - Missing detection / guardrail: there was coverage for manual stop timeout duplicate-task protection, but not for health-monitor recovery stop timeout, replacement start, or stale task status writes after detachment. - Contributing context (if known): Slack Socket Mode can lose heartbeat during event-loop starvation, and an abort-ignoring task can keep the old provider task alive past the gateway stop timeout. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [ ] Unit test - [x] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: src/gateway/server-channels.test.ts, src/gateway/channel-health-monitor.test.ts - Scenario the test should lock in: non-manual recovery stop timeouts must not poison manual-stop state; replacement tasks must be able to start; stale task completion/status writes must not clobber replacement runtime state. - Why this is the smallest reliable guardrail: the bug is in gateway channel lifecycle state, so mocked channel tasks can deterministically reproduce abort-ignoring timeout behavior without live Slack credentials. - Existing test that already covers this (if any): existing manual stop timeout coverage protected duplicate- task behavior but encoded the manual/ghost-running path, not health-monitor recovery. - If no new test is added, why not: N/A ## User-visible / Behavior Changes Gateway channel health recovery can reconnect a channel account after a timed-out recovery stop instead of leaving it indefinitely suppressed as manually stopped. ## Diagram (if applicable) Before: [health monitor restart] -> [stop timeout] -> [manual stop marker + stale task] -> [no reconnect] After: [health monitor restart] -> [non-manual stop timeout] -> [detach stale task] -> [replacement starts] -> [stale writes ignored] ## Security Impact (required) - New permissions/capabilities? (Yes/No) No - Secrets/tokens handling changed? (Yes/No) No - New/changed network calls? (Yes/No) No - Command/tool execution surface changed? (Yes/No) No - Data access scope changed? (Yes/No) No - If any Yes, explain risk + mitigation: N/A ## Repro + Verification ### Environment - OS: macOS - Runtime/container: local Node/pnpm workspace - Model/provider: N/A - Integration/channel (if any): Gateway channel lifecycle; reported via Slack Socket Mode - Relevant config (redacted): N/A ### Steps 1. Start a channel account whose task ignores abort and never settles. 2. Trigger stopChannel(..., { manual: false }) and advance past the 5000ms stop timeout. 3. Start the same account again and allow the stale task to complete or publish status. ### Expected - Recovery stop timeout does not leave the account manually stopped. - Replacement channel task can start. - Stale task completion/status writes do not overwrite the replacement runtime state. ### Actual - Before this fix, timeout left manual-stop/stale-task state that suppressed reconnect. - After this fix, targeted regression tests pass for recovery

openclaw2026-05-05 02:18:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77634•Fetched 2026-05-06 06:23:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×3commented ×2

Root Cause

Root Cause Hypothesis

RAW_BUFFERClick to expand / collapse

Bug Description

The Discord plugin's HTTP fetch calls to https://discord.com/api/v10/users/@me are causing the Node.js event loop to block for extended periods (3+ seconds), resulting in severe liveness warning alerts.

Environment

OpenClaw version: 2026.5.3-1
Node.js version: 24.15.0
Platform: Windows_NT 10.0.26100 (x64)
Host: GGX-THINKPAD

Steps to Reproduce

Enable Discord channel plugin (channels.discord.enabled: true)
Gateway starts and connects to Discord API
Observe liveness warnings in logs immediately

Observed Behavior

Log excerpt:

[fetch-timeout] fetch timeout after 2500ms (elapsed 3013ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@Me [diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=35.3 eventLoopDelayMaxMs=1362.1 eventLoopUtilization=0.096 cpuCoreRatio=0.093

Key observations:

Direct Node.js fetch test is FAST: etch('https://discord.com/api/v10/users/@me', {headers:{Authorization:'Bot ...'}}) returns in ~600ms — completely normal
Direct curl test is FAST: ~232ms
The etchWithTimeout wrapper in the Discord plugin causes blocking: The AbortController timeout mechanism itself appears to block the event loop for the full timeout duration (2500ms + overhead = 3013ms)
Timer is delayed: The timeout fired at 3013ms instead of 2500ms, indicating the timer was itself blocked by ~513ms
Low CPU utilization during block: Only 23% CPU — this is I/O wait, not computation

Root Cause Hypothesis

The etchWithTimeout from openclaw/plugin-sdk/text-runtime uses setTimeout wrapped around an AbortController-based fetch. When the fetch is in flight, something in the Node.js 24.x fetch/undici implementation or the plugin-sdk's getResolvedFetch appears to block the event loop synchronously during DNS/TLS/hot connection reuse, causing the timeout timer itself to be delayed.

The Discord plugin probe at dist/probe-DmHUl6wI.js calls: js const res = await fetchWithTimeout(${DISCORD_API_BASE}/users/@me, { headers: { Authorization: Bot } }, timeoutMs, getResolvedFetch(fetcher));

Impact

Gateway event loop blocks for 3+ seconds during Discord API probe
Causes severe liveness warning with eventLoopDelayMaxMs up to 5440ms
Affects overall gateway responsiveness

Suggested Fix Directions

Investigate whether getResolvedFetch(fetcher) returns a different fetch implementation that behaves synchronously under Node.js 24.x
Consider using a non-blocking timeout approach (e.g., separate worker thread for Discord API calls)
Increase the timeout from 2500ms to something more generous, or make it configurable
Add retry logic with exponential backoff instead of blocking the main thread

extent analysis

TL;DR

The Discord plugin's fetchWithTimeout calls are blocking the Node.js event loop, causing severe liveness warnings, and a non-blocking timeout approach or retry logic with exponential backoff may help mitigate the issue.

Guidance

Investigate the getResolvedFetch(fetcher) function to determine if it returns a different fetch implementation that behaves synchronously under Node.js 24.x.
Consider using a separate worker thread for Discord API calls to avoid blocking the main thread.
Evaluate increasing the timeout from 2500ms to a more generous value or making it configurable to reduce the frequency of timeouts.
Implement retry logic with exponential backoff to handle failed API calls without blocking the main thread.

Example

// Example of retry logic with exponential backoff
const retry = async (fn, retries = 3, delay = 500) => {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i < retries - 1) {
        await new Promise(resolve => setTimeout(resolve, delay * (2 ** i)));
      } else {
        throw error;
      }
    }
  }
};

// Usage
const fetchWithRetry = async () => {
  return retry(async () => {
    const res = await fetchWithTimeout(`${DISCORD_API_BASE}/users/@me`, { headers: { Authorization: 'Bot ' } }, timeoutMs, getResolvedFetch(fetcher));
    return res;
  });
};

Notes

The provided example is a basic illustration of retry logic with exponential backoff and may need to be adapted to the specific requirements of the Discord plugin.

Recommendation

Apply a non-blocking timeout approach, such as using a separate worker thread for Discord API calls or implementing retry logic with exponential backoff, to mitigate the event loop blocking issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] Discord plugin fetch timeout blocks Node.js event loop, causing liveness warnings [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Hypothesis

Fix Action

Fixed

PR fix notes

PR #77682: Fix: Issue 77651 channel stop timeout

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Built with GPT 5.5

Changed files

Bug Description

Environment

Steps to Reproduce

Observed Behavior

Log excerpt:

Key observations:

Root Cause Hypothesis

Impact

Suggested Fix Directions

Tags

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING