openclaw - ✅(Solved) Fix [Bug]: Feishu WebSocket connection does not recover after transient token refresh failure [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68766Fetched 2026-04-19 15:07:55
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Timeline (top)
cross-referenced ×2labeled ×2referenced ×2commented ×1

When the Feishu tenant_access_token refresh fails due to a transient timeout (e.g., open.feishu.cn responds slowly during off-peak hours), the Feishu WebSocket connection drops and does not automatically recover.

The current reconnection logic attempts only one retry after the initial failure. If that retry also fails (which is likely when the upstream issue is still ongoing), the plugin gives up entirely and stops receiving Feishu events. The connection remains dead until the gateway is manually restarted.

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Error Message

2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal' code: 'ECONNABORTED' 2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ] 2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal' code: 'ECONNABORTED' # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.

Root Cause

When the Feishu tenant_access_token refresh fails due to a transient timeout (e.g., open.feishu.cn responds slowly during off-peak hours), the Feishu WebSocket connection drops and does not automatically recover.

The current reconnection logic attempts only one retry after the initial failure. If that retry also fails (which is likely when the upstream issue is still ongoing), the plugin gives up entirely and stops receiving Feishu events. The connection remains dead until the gateway is manually restarted.

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Fix Action

Fixed

PR fix notes

PR #68840: fix: add exponential backoff retry for Feishu WebSocket reconnection

Description (problem / solution / changelog)

Fixes #68766

Problem

Feishu WebSocket connection doesn't recover after transient tenant_access_token refresh failure. The current reconnection logic retries only once — if that retry also fails, the plugin gives up entirely, causing hours of silent message loss.

Solution

Replace the single-shot retry with an exponential backoff retry loop in monitorWebSocket():

  • Backoff delays: 5s → 10s → 30s → 60s → 120s
  • Never gives up: After exhausting the 5 backoff steps, retries indefinitely at 120s
  • Clean shutdown: Respects abortSignal during backoff waits
  • Cleanup: wsClients properly cleaned up on each failed attempt

Tests

6 new tests covering retry behavior, backoff clamping, cleanup, and abort handling.

Changed files

  • extensions/feishu/src/monitor.cleanup.test.ts (modified, +6/-4)
  • extensions/feishu/src/monitor.transport.test.ts (added, +178/-0)
  • extensions/feishu/src/monitor.transport.ts (modified, +64/-9)

PR #68865: fix(feishu): add application-level WebSocket reconnection with backoff

Description (problem / solution / changelog)

Fixes #68766

Summary

The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path.

Root Cause

  • The Lark SDK's internal reconnection has a limited retry budget
  • cleanup() calls wsClient.close() which permanently kills the SDK's reconnection loop
  • No application-level recovery exists — once the SDK gives up, the bot stays offline

Fix

Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), following the same pattern used by the Mattermost channel (runWithReconnect).

The loop:

  • Retries on both client creation failures (e.g. token refresh timeout) and runtime disconnects
  • Resets backoff on successful connections (normal close)
  • Respects the abort signal for clean shutdown
  • Cleans up wsClient/botOpenIds/botNames state on each cycle

Test Plan

  • Start Feishu WebSocket transport
  • Simulate token refresh failure (e.g. network interruption during refresh)
  • Verify bot reconnects with exponential backoff instead of going permanently offline
  • Verify abort signal still cleanly stops the transport

Changed files

  • extensions/feishu/src/monitor.transport.ts (modified, +83/-33)

Code Example

2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded
        url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
        code: 'ECONNABORTED'
     2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ]
     2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded
        url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
        code: 'ECONNABORTED'
     # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.

---
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When the Feishu tenant_access_token refresh fails due to a transient timeout (e.g., open.feishu.cn responds slowly during off-peak hours), the Feishu WebSocket connection drops and does not automatically recover.

The current reconnection logic attempts only one retry after the initial failure. If that retry also fails (which is likely when the upstream issue is still ongoing), the plugin gives up entirely and stops receiving Feishu events. The connection remains dead until the gateway is manually restarted.

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Steps to reproduce

  1. Configure Feishu plugin with connectionMode: "websocket".
  2. Wait for a transient tenant_access_token timeout (or simulate by temporarily blocking open.feishu.cn for ~60s).
  3. Observe gateway logs:
      2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded
         url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
         code: 'ECONNABORTED'
      2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ]
      2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded
         url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
         code: 'ECONNABORTED'
      # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.
  4. After the transient issue resolves, Feishu events are never received again — no further reconnection attempts are made.
  5. openclaw doctor still reports Feishu: ok (does not detect the dead ws connection).

Expected behavior

  • The Feishu WebSocket plugin should implement exponential backoff with persistent retries (e.g., 1s → 2s → 4s → ... → max 5min, retrying indefinitely until reconnected).
  • After successful reconnection, the plugin should log a clear [feishu] reconnected message.
  • Optionally: openclaw doctor could check whether the Feishu ws connection is actually alive (not just configured).

Actual behavior

In our case, the token refresh timed out at 01:19 AM. The connection was not restored until a manual gateway restart at 09:48 AM — 8.5 hours of silent message loss.

OpenClaw version

2026.4.15

Operating system

Ubuntu 20.04

Install method

No response

Model

opus-4-6

Provider / routing chain

openclaw -> anthropic-API

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Implement exponential backoff with persistent retries for the Feishu WebSocket connection to handle transient timeouts.

Guidance

  • Modify the Feishu plugin's reconnection logic to use exponential backoff (e.g., 1s → 2s → 4s → ... → max 5min) with indefinite retries until reconnected.
  • After a successful reconnection, log a clear [feishu] reconnected message to indicate the connection is restored.
  • Consider enhancing openclaw doctor to check the Feishu WebSocket connection's liveness, not just its configuration.
  • Review the current retry mechanism to ensure it does not give up after a single failure, causing prolonged message loss.

Example

// Pseudocode example of exponential backoff
function reconnectFeishuWebSocket() {
  const initialDelay = 1000; // 1s
  const maxDelay = 300000; // 5min
  const backoffFactor = 2;
  let delay = initialDelay;

  while (true) {
    try {
      // Attempt to reconnect
      reconnectWebSocket();
      console.log('[feishu] reconnected');
      break;
    } catch (error) {
      // Exponential backoff
      delay = Math.min(delay * backoffFactor, maxDelay);
      setTimeout(reconnectFeishuWebSocket, delay);
    }
  }
}

Notes

The provided example is a simplified illustration and may require adaptation to fit the actual implementation details of the Feishu plugin and OpenClaw's architecture.

Recommendation

Apply a workaround by implementing exponential backoff with persistent retries for the Feishu WebSocket connection, as this directly addresses the issue of silent message loss due to transient timeouts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • The Feishu WebSocket plugin should implement exponential backoff with persistent retries (e.g., 1s → 2s → 4s → ... → max 5min, retrying indefinitely until reconnected).
  • After successful reconnection, the plugin should log a clear [feishu] reconnected message.
  • Optionally: openclaw doctor could check whether the Feishu ws connection is actually alive (not just configured).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Feishu WebSocket connection does not recover after transient token refresh failure [2 pull requests, 1 comments, 2 participants]