- The Feishu WebSocket plugin should implement **exponential backoff with persistent retries** (e.g., 1s → 2s → 4s → ... → max 5min, retrying indefinitely until reconnected). - After successful reconnection, the plugin should log a clear `[feishu] reconnected` message. - Optionally: `openclaw doctor` could check whether the Feishu ws connection is actually alive (not just configured).

openclaw - ✅(Solved) Fix [Bug]: Feishu WebSocket connection does not recover after transient token refresh failure [2 pull requests, 1 comments, 2 participants]

openclaw2026-04-19 01:58:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#68766•Fetched 2026-04-19 15:07:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jw8957

Participants

jw8957

martingarramon

Timeline (top)

cross-referenced ×2labeled ×2referenced ×2commented ×1

When the Feishu tenant_access_token refresh fails due to a transient timeout (e.g., open.feishu.cn responds slowly during off-peak hours), the Feishu WebSocket connection drops and does not automatically recover.

The current reconnection logic attempts only one retry after the initial failure. If that retry also fails (which is likely when the upstream issue is still ongoing), the plugin gives up entirely and stops receiving Feishu events. The connection remains dead until the gateway is manually restarted.

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Error Message

2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal' code: 'ECONNABORTED' 2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ] 2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal' code: 'ECONNABORTED' # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.

Root Cause

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Fix Action

Fixed

Fixed by PR: fix: add exponential backoff retry for Feishu WebSocket reconnection (https://github.com/openclaw/openclaw/pull/68840)
Fixed by PR: fix(feishu): add application-level WebSocket reconnection with backoff (https://github.com/openclaw/openclaw/pull/68865)

PR fix notes

PR #68840: fix: add exponential backoff retry for Feishu WebSocket reconnection

Repository: openclaw/openclaw
Author: kagura-agent
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/68840

Description (problem / solution / changelog)

Fixes #68766

Problem

Feishu WebSocket connection doesn't recover after transient tenant_access_token refresh failure. The current reconnection logic retries only once — if that retry also fails, the plugin gives up entirely, causing hours of silent message loss.

Solution

Replace the single-shot retry with an exponential backoff retry loop in monitorWebSocket():

Backoff delays: 5s → 10s → 30s → 60s → 120s
Never gives up: After exhausting the 5 backoff steps, retries indefinitely at 120s
Clean shutdown: Respects abortSignal during backoff waits
Cleanup: wsClients properly cleaned up on each failed attempt

Tests

6 new tests covering retry behavior, backoff clamping, cleanup, and abort handling.

Changed files

extensions/feishu/src/monitor.cleanup.test.ts (modified, +6/-4)
extensions/feishu/src/monitor.transport.test.ts (added, +178/-0)
extensions/feishu/src/monitor.transport.ts (modified, +64/-9)

PR #68865: fix(feishu): add application-level WebSocket reconnection with backoff

Repository: openclaw/openclaw
Author: tianhaocui
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/68865

Description (problem / solution / changelog)

Fixes #68766

Summary

The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path.

Root Cause

The Lark SDK's internal reconnection has a limited retry budget
cleanup() calls wsClient.close() which permanently kills the SDK's reconnection loop
No application-level recovery exists — once the SDK gives up, the bot stays offline

Fix

Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), following the same pattern used by the Mattermost channel (runWithReconnect).

The loop:

Retries on both client creation failures (e.g. token refresh timeout) and runtime disconnects
Resets backoff on successful connections (normal close)
Respects the abort signal for clean shutdown
Cleans up wsClient/botOpenIds/botNames state on each cycle

Test Plan

Start Feishu WebSocket transport
Simulate token refresh failure (e.g. network interruption during refresh)
Verify bot reconnects with exponential backoff instead of going permanently offline
Verify abort signal still cleanly stops the transport

Changed files

extensions/feishu/src/monitor.transport.ts (modified, +83/-33)

Code Example

2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded
        url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
        code: 'ECONNABORTED'
     2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ]
     2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded
        url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
        code: 'ECONNABORTED'
     # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.

---

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

This means a single transient API hiccup can cause hours of silent message loss with no visible error to the user.

Steps to reproduce

Configure Feishu plugin with connectionMode: "websocket".
Wait for a transient tenant_access_token timeout (or simulate by temporarily blocking open.feishu.cn for ~60s).

Observe gateway logs:

  2026-04-19T01:19:40 [error]: AxiosError: timeout of 30000ms exceeded
     url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
     code: 'ECONNABORTED'
  2026-04-19T01:20:39 [info]: [ '[ws]', 'reconnect' ]
  2026-04-19T01:21:40 [error]: AxiosError: timeout of 30000ms exceeded
     url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
     code: 'ECONNABORTED'
  # No further reconnection attempts after this. Next Feishu event only after manual gateway restart at 09:48.

After the transient issue resolves, Feishu events are never received again — no further reconnection attempts are made.
openclaw doctor still reports Feishu: ok (does not detect the dead ws connection).

Expected behavior

The Feishu WebSocket plugin should implement exponential backoff with persistent retries (e.g., 1s → 2s → 4s → ... → max 5min, retrying indefinitely until reconnected).
After successful reconnection, the plugin should log a clear [feishu] reconnected message.
Optionally: openclaw doctor could check whether the Feishu ws connection is actually alive (not just configured).

Actual behavior

In our case, the token refresh timed out at 01:19 AM. The connection was not restored until a manual gateway restart at 09:48 AM — 8.5 hours of silent message loss.

OpenClaw version

2026.4.15

Operating system

Ubuntu 20.04

Install method

No response

Model

opus-4-6

Provider / routing chain

openclaw -> anthropic-API

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Implement exponential backoff with persistent retries for the Feishu WebSocket connection to handle transient timeouts.

Guidance

Modify the Feishu plugin's reconnection logic to use exponential backoff (e.g., 1s → 2s → 4s → ... → max 5min) with indefinite retries until reconnected.
After a successful reconnection, log a clear [feishu] reconnected message to indicate the connection is restored.
Consider enhancing openclaw doctor to check the Feishu WebSocket connection's liveness, not just its configuration.
Review the current retry mechanism to ensure it does not give up after a single failure, causing prolonged message loss.

Example

// Pseudocode example of exponential backoff
function reconnectFeishuWebSocket() {
  const initialDelay = 1000; // 1s
  const maxDelay = 300000; // 5min
  const backoffFactor = 2;
  let delay = initialDelay;

  while (true) {
    try {
      // Attempt to reconnect
      reconnectWebSocket();
      console.log('[feishu] reconnected');
      break;
    } catch (error) {
      // Exponential backoff
      delay = Math.min(delay * backoffFactor, maxDelay);
      setTimeout(reconnectFeishuWebSocket, delay);
    }
  }
}

Notes

The provided example is a simplified illustration and may require adaptation to fit the actual implementation details of the Feishu plugin and OpenClaw's architecture.

Recommendation

Apply a workaround by implementing exponential backoff with persistent retries for the Feishu WebSocket connection, as this directly addresses the issue of silent message loss due to transient timeouts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The Feishu WebSocket plugin should implement exponential backoff with persistent retries (e.g., 1s → 2s → 4s → ... → max 5min, retrying indefinitely until reconnected).
After successful reconnection, the plugin should log a clear [feishu] reconnected message.
Optionally: openclaw doctor could check whether the Feishu ws connection is actually alive (not just configured).

#api #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Feishu WebSocket connection does not recover after transient token refresh failure [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #68840: fix: add exponential backoff retry for Feishu WebSocket reconnection

Description (problem / solution / changelog)

Problem

Solution

Tests

Changed files

PR #68865: fix(feishu): add application-level WebSocket reconnection with backoff

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Test Plan

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING