openclaw - 💡(How to fix) Fix Feishu channel: bot identity recovery race condition causes permanent disconnection [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77717Fetched 2026-05-06 06:22:32
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
2
Author
Timeline (top)
commented ×1

When the Gateway reloads its configuration (e.g., during a config change or restart), Feishu WebSocket channels restart their identity probe. If the abortSignal is set before the identity probe completes, the condition that gates startBotIdentityRecovery evaluates to false, and the channel is left with bot open_id = unknown — with no retry mechanism and no recovery path.

Root Cause

In extensions/feishu/src/monitor.account.ts, monitorSingleAccount function (line ~4553):

log(`feishu[${accountId}]: bot open_id resolved: ${botOpenId ?? "unknown"}`);
if (!botOpenId && !abortSignal?.aborted) startBotIdentityRecovery({
  account, accountId, runtime, abortSignal
});

The startBotIdentityRecovery call is gated by !abortSignal?.aborted. During a config reload, an abort signal is broadcast to all channels before the new channel initialization completes. If an account's identity probe finishes after the abort signal arrives (but before the new channel initialization finishes), the abortSignal?.aborted check returns true, the retry is never scheduled, and the channel is left in a broken state.

This is a classic TOCTOU (time-of-check-time-of-use) race condition.

Code Example

log(`feishu[${accountId}]: bot open_id resolved: ${botOpenId ?? "unknown"}`);
if (!botOpenId && !abortSignal?.aborted) startBotIdentityRecovery({
  account, accountId, runtime, abortSignal
});

---

11:27:03  [health-monitor] [feishu:main] health-monitor: restarting (reason: stopped)
11:27:04  [feishu] feishu[third]: bot open_id unknown; starting background retry (delays: 60s, 120s...)
          ← third's probe happened to complete before abort arrived
11:27:12  [feishu] feishu[third]: abort signal received, stopping
11:27:12  [feishu] feishu[main]: bot open_id resolved: unknown
          ← main's probe completed after abort arrived; no retry triggered
11:27:12  [feishu] feishu[second]: bot open_id resolved: unknown
          ← same; no retry triggered
11:27:14  [gateway] loading configuration… (reload complete)
          ← main and second are now permanently broken
RAW_BUFFERClick to expand / collapse

Feishu channel: bot identity recovery race condition causes permanent disconnection

Description

When the Gateway reloads its configuration (e.g., during a config change or restart), Feishu WebSocket channels restart their identity probe. If the abortSignal is set before the identity probe completes, the condition that gates startBotIdentityRecovery evaluates to false, and the channel is left with bot open_id = unknown — with no retry mechanism and no recovery path.

Severity

High — the Feishu channel becomes permanently unrecoverable without a full Gateway restart.

Steps to reproduce

  1. Have one or more Feishu accounts configured (WebSocket mode)
  2. Trigger a Gateway config reload (e.g., openclaw gateway restart, or a config hot-reload)
  3. Observe the timing window where fetchBotIdentityForMonitor is still pending when the abort signal is delivered to the channel

The race window is ~5–10 seconds in practice.

Root cause

In extensions/feishu/src/monitor.account.ts, monitorSingleAccount function (line ~4553):

log(`feishu[${accountId}]: bot open_id resolved: ${botOpenId ?? "unknown"}`);
if (!botOpenId && !abortSignal?.aborted) startBotIdentityRecovery({
  account, accountId, runtime, abortSignal
});

The startBotIdentityRecovery call is gated by !abortSignal?.aborted. During a config reload, an abort signal is broadcast to all channels before the new channel initialization completes. If an account's identity probe finishes after the abort signal arrives (but before the new channel initialization finishes), the abortSignal?.aborted check returns true, the retry is never scheduled, and the channel is left in a broken state.

This is a classic TOCTOU (time-of-check-time-of-use) race condition.

Observed log

11:27:03  [health-monitor] [feishu:main] health-monitor: restarting (reason: stopped)
11:27:04  [feishu] feishu[third]: bot open_id unknown; starting background retry (delays: 60s, 120s...)
          ← third's probe happened to complete before abort arrived
11:27:12  [feishu] feishu[third]: abort signal received, stopping
11:27:12  [feishu] feishu[main]: bot open_id resolved: unknown
          ← main's probe completed after abort arrived; no retry triggered
11:27:12  [feishu] feishu[second]: bot open_id resolved: unknown
          ← same; no retry triggered
11:27:14  [gateway] loading configuration… (reload complete)
          ← main and second are now permanently broken

Impact

  • Feishu direct messages are silently dropped (the bot cannot receive any messages)
  • The health-monitor does not detect this as a "stopped" state because the WebSocket process is still running, just with an invalid identity
  • Only a full Gateway restart resolves it

Suggested fix

Option A (defensive): When abortSignal is already aborted but botOpenId is unknown, schedule recovery with a fresh abort signal rather than skipping it entirely. The new channel initialization should use a new abort signal scoped to its own lifecycle.

Option B (robust): Move the bot identity probe before the abort signal is registered, or separate the identity resolution from the channel lifecycle abort signal entirely (use a dedicated identity-probe abort signal).

Option C (fail-safe): If fetchBotIdentityForMonitor fails or returns unknown after the initial probe, always schedule a retry with a new independent abort signal — don't gate it on the current channel's abort state.

Environment

  • OpenClaw 2026.4.26 (be8c246)
  • macOS
  • Feishu WebSocket mode, 3 accounts

extent analysis

TL;DR

Implement one of the suggested fixes (A, B, or C) to address the TOCTOU race condition causing permanent disconnection of Feishu channels.

Guidance

  • Review the monitorSingleAccount function in extensions/feishu/src/monitor.account.ts to understand the current implementation and identify the best fix.
  • Consider implementing Option B (robust) to separate the identity resolution from the channel lifecycle abort signal, ensuring a dedicated identity-probe abort signal.
  • When implementing the fix, verify that the startBotIdentityRecovery call is correctly gated and that the retry mechanism is triggered as expected.
  • Test the fix by reproducing the issue and observing the logs to ensure that the Feishu channels recover correctly after a config reload.

Example

// Example of Option A (defensive) fix
if (!botOpenId && (abortSignal?.aborted || !abortSignal)) {
  const newAbortSignal = new AbortSignal();
  startBotIdentityRecovery({
    account, accountId, runtime, abortSignal: newAbortSignal
  });
}

Notes

The fix may require additional testing and verification to ensure that it works correctly in all scenarios, including different timing conditions and edge cases.

Recommendation

Apply workaround Option B (robust) to separate the identity resolution from the channel lifecycle abort signal, as it provides a more comprehensive solution to the TOCTOU race condition.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feishu channel: bot identity recovery race condition causes permanent disconnection [1 comments, 2 participants]