openclaw - ✅(Solved) Fix Gateway/provider race: stale heartbeat reconnect callback throws after disconnect [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63387Fetched 2026-04-09 07:54:26
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

I hit an internal invariant failure in the gateway/provider websocket heartbeat logic:

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

This appears to be a race where a stale heartbeat reconnect callback fires after the websocket/session has already been closed or disconnected.

Error Message

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

Root Cause

I hit an internal invariant failure in the gateway/provider websocket heartbeat logic:

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

This appears to be a race where a stale heartbeat reconnect callback fires after the websocket/session has already been closed or disconnected.

PR fix notes

PR #68159: fix(discord): prevent Identify silent-drop race in gateway startup

Description (problem / solution / changelog)

Summary

Fixes the Discord gateway stuck at "awaiting gateway readiness" reported in #52372.

Root cause: @buape/carbon's Client constructor does not await registerClient(). If the lifecycle readiness-timeout handler invokes connect()identify() while SafeGatewayPlugin.registerClient is still awaiting the gateway-metadata fetch, Carbon's identify() reads this.client as undefined and silently drops the Identify frame. The WebSocket opens and receives Hello, but never reaches READY.

Two minimal additions in SafeGatewayPlugin.registerClient (extensions/discord/src/monitor/gateway-plugin.ts):

  1. Assign this.client = client at the top of the method so a concurrent identify() always sees a defined client while the async fetchDiscordGatewayInfoWithTimeout is still resolving.
  2. After metadata has been resolved, if an external caller already set ws or isConnecting on the plugin (i.e. a connect() raced ahead), skip super.registerClient() to avoid tearing down the live WebSocket that the external connect() just opened.

The ws / isConnecting access is a typed cast: if Carbon ever renames these fields, the guard silently degrades to a no-op and super.registerClient() runs again — which may cause a brief reconnect but not a permanent hang, because fix (1) already prevents Identify from being dropped.

Why low-risk

  • Only touches SafeGatewayPlugin, Carbon's public contract is unchanged.
  • The early this.client = client is a no-op for all existing paths except the pre-await race window.
  • The ws / isConnecting guard is placed after the testing.registerClient hook (which already bypasses super.registerClient()), so the existing test injection semantics are preserved.
  • The override connect() heartbeat-stale-timer guard added in the meantime (for #65009 / #64011 / #63387) is untouched.

Tests

Adds two regression tests in provider.proxy.test.ts:

  • sets client reference before the async gateway-info fetch resolves — mocks a pending fetch, calls registerClient, asserts this.client === clientArg before resolving the fetch.
  • skips super.registerClient when an external connect() sets ws during the metadata fetch — simulates the race by setting plugin.ws mid-fetch, asserts baseRegisterClientSpy was not called.

The mock GatewayPlugin in that file is extended with client, ws, isConnecting fields to match Carbon's shape.

Community validation

@Skeptomenos verified the same fix on a live 4-bot setup with runtime instrumentation of Carbon's GatewayPlugin (see #53039 comments):

  • All 4 bots: connect → WS open → IDENTIFY sent → READY received → heartbeats flowing
  • Zero errors, zero guard blocks, zero close events during startup.

History

Supersedes #53039. That branch had diverged ~10k commits from main and could no longer be rebased cleanly. The fix is re-applied here on top of current main, which has since added an override connect() heartbeat-timer guard and a testing.registerClient injection hook that both need to coexist with this change.

Closes #52372

Test plan

  • extensions/discord unit suite (provider.proxy.test.ts, gateway-plugin.test.ts) — new regression tests added, existing assertions unchanged.
  • Community field test on 4-bot Discord setup — see #53039 thread for full WS-event trace.
  • CI (check, extension-fast (discord), checks (node, extensions, pnpm test:extensions)) will run on push.

Made with Cursor

Changed files

  • extensions/discord/src/monitor/gateway-plugin.ts (modified, +24/-0)
  • extensions/discord/src/monitor/provider.proxy.test.ts (modified, +88/-0)

Code Example

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

---

startHeartbeat(this, {
  interval,
  reconnectCallback: () => {
    if (closed) throw new Error("Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)");
    closed = true;
    this.handleZombieConnection();
  }
});

---

function startHeartbeat(manager, options) {
  stopHeartbeat(manager);
  const sendHeartbeat = () => {
    if (!manager.lastHeartbeatAck) {
      options.reconnectCallback();
      return;
    }
    manager.lastHeartbeatAck = false;
    manager.send({ op: GatewayOpcodes.Heartbeat, d: manager.sequence });
  };
  manager.firstHeartbeatTimeout = setTimeout(() => {
    sendHeartbeat();
    manager.heartbeatInterval = setInterval(sendHeartbeat, interval);
  }, initialDelay);
}
RAW_BUFFERClick to expand / collapse

Summary

I hit an internal invariant failure in the gateway/provider websocket heartbeat logic:

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

This appears to be a race where a stale heartbeat reconnect callback fires after the websocket/session has already been closed or disconnected.

Error

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

What I found

The installed code throws from the provider heartbeat/reconnect path in the built distribution:

  • file: dist/provider-DEWH9yd9.js

Relevant logic is effectively:

startHeartbeat(this, {
  interval,
  reconnectCallback: () => {
    if (closed) throw new Error("Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)");
    closed = true;
    this.handleZombieConnection();
  }
});

And heartbeat scheduling looks like:

function startHeartbeat(manager, options) {
  stopHeartbeat(manager);
  const sendHeartbeat = () => {
    if (!manager.lastHeartbeatAck) {
      options.reconnectCallback();
      return;
    }
    manager.lastHeartbeatAck = false;
    manager.send({ op: GatewayOpcodes.Heartbeat, d: manager.sequence });
  };
  manager.firstHeartbeatTimeout = setTimeout(() => {
    sendHeartbeat();
    manager.heartbeatInterval = setInterval(sendHeartbeat, interval);
  }, initialDelay);
}

disconnect() does call stopHeartbeat(this), but it looks like a stale timer / overlapping close-reconnect state can still let reconnectCallback() run on an already-closed connection object.

Expected behavior

A stale heartbeat callback should exit quietly or no-op once the connection/session has already been closed, not throw an exception.

Actual behavior

An exception is thrown from reconnect logic for a connection that was already considered closed/disconnected.

Suspected cause

Race between:

  • websocket/session disconnect/cleanup
  • pending heartbeat timeout or interval callback
  • reconnect callback closure retaining stale closed state

Observed impact

  • noisy internal exception
  • possible gateway/provider instability after the event
  • openclaw status did not return cleanly around the same time, which may indicate daemon state disruption

Trigger conditions

Not 100% certain, but likely one of:

  • transient network flap
  • websocket close during heartbeat timing window
  • rapid gateway restart/reload while old provider session is unwinding
  • delayed/missed heartbeat ack followed by overlapping reconnect and close

Suggested fix

Defensively guard reconnect callback / heartbeat send path so stale callbacks do not throw after disconnect. For example:

  • no-op if websocket/session is no longer current
  • no-op if manager is already disconnected
  • bind heartbeat callbacks to a connection generation/token and ignore stale generations
  • avoid throwing on closed === true; log/debug and return instead

Environment

  • OpenClaw installed via npm on Windows
  • observed in built dist file: dist/provider-DEWH9yd9.js

If useful, I can provide a fuller stack trace/log context.

extent analysis

TL;DR

Defensively guard the reconnect callback and heartbeat send path to prevent stale callbacks from throwing after disconnect.

Guidance

  • Review the startHeartbeat function and reconnectCallback to ensure they properly handle the case where the connection is already closed or disconnected.
  • Consider adding a check for the connection's state before calling reconnectCallback to prevent it from running on a stale connection.
  • Instead of throwing an error when closed === true, log a debug message and return to prevent exceptions from being thrown.
  • Investigate using a connection generation/token to bind heartbeat callbacks and ignore stale generations.

Example

reconnectCallback: () => {
  if (closed) {
    console.debug("Stale reconnect callback ignored");
    return;
  }
  // ...
}

Notes

The suggested fix is based on the provided code and may require additional modifications to ensure proper functionality. It's also important to test the changes thoroughly to prevent any unintended side effects.

Recommendation

Apply the suggested workaround by defensively guarding the reconnect callback and heartbeat send path to prevent stale callbacks from throwing after disconnect. This approach is preferred as it directly addresses the identified issue and provides a clear path to resolving the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A stale heartbeat callback should exit quietly or no-op once the connection/session has already been closed, not throw an exception.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING