openclaw - ✅(Solved) Fix [Bug]: Gateway crashes with Attempted to reconnect zombie connection after disconnecting first and is auto-restarted by systemd [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65009Fetched 2026-04-12 13:26:03
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×2cross-referenced ×1referenced ×1

My OpenClaw gateway process intermittently crashes with:

Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

The process exits, then systemd --user automatically restarts openclaw-gateway.service.

This appears to be happening in the gateway heartbeat / reconnect path, not as a full machine reboot.

Error Message

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Root Cause

What I checked

  • This is not a host reboot. The machine stayed on the same boot.
  • I checked for OOM / kernel panic / thermal issues and did not find evidence that those caused the restart.
  • The crash appears to come from OpenClaw itself, specifically the heartbeat / reconnect logic.
  • I checked whether cron jobs posting to Discord were the immediate cause. I did not find strong evidence that Discord posting is the root cause. At most, activity may expose the bug, but the fatal event is in heartbeat/reconnect handling.

Fix Action

Fixed

PR fix notes

PR #65087: fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

Description (problem / solution / changelog)

What does this PR do?

Adds a connect() override to SafeGatewayPlugin that clears stale heartbeat timers before delegating to the parent, preventing an intermittent uncaught exception that crashes the Discord gateway process and drops in-flight replies.

Root Cause

@buape/[email protected] has a race in its heartbeat initialisation:

setTimeout(() => {
    sendHeartbeat()                                          // stopHeartbeat() runs here —
                                                             // but heartbeatInterval is still undefined
    heartbeatInterval = setInterval(sendHeartbeat, interval) // stale interval created after the clear
}, initialDelay)

When sendHeartbeat detects a zombie connection it calls stopHeartbeat(), which clears heartbeatInterval — but the interval has not been assigned yet. The setInterval on the next line then creates a timer whose closure holds closed=true. When it fires ~41 seconds later, reconnectCallback(closed=true) throws inside a setInterval callback. Node.js routes this to process.on('uncaughtException'), bypassing the EventEmitter.on('error') path the gateway supervisor monitors. systemd restarts the service and in-flight replies fail.

Solution Applied

Override connect() in SafeGatewayPlugin to unconditionally clear both heartbeatInterval and firstHeartbeatTimeout before calling super.connect():

public override connect(resume = false): void {
  if (this.heartbeatInterval !== undefined) {
    clearInterval(this.heartbeatInterval);
    this.heartbeatInterval = undefined;
  }
  if (this.firstHeartbeatTimeout !== undefined) {
    clearTimeout(this.firstHeartbeatTimeout);
    this.firstHeartbeatTimeout = undefined;
  }
  super.connect(resume);
}

The parent's connect() only calls stopHeartbeat() when isConnecting=false. When isConnecting=true it returns early — leaving any stale timer alive. This override runs before that early-return check, ensuring stale timers are always cleared on reconnect.

Bottleneck Solved

  • Eliminates intermittent gateway crashes caused by the stale setInterval firing with a closed reconnectCallback
  • No process restart, no dropped in-flight replies
  • Works with the currently published @buape/[email protected] without requiring a version bump

Testing

pnpm test:extension discord

Two new unit tests added to extensions/discord/src/monitor/gateway-plugin.test.ts verifying that heartbeatInterval and firstHeartbeatTimeout are cleared when connect() is called while isConnecting=true.

Fixes #65009

Changed files

  • extensions/discord/src/monitor/gateway-plugin.test.ts (added, +115/-0)
  • extensions/discord/src/monitor/gateway-plugin.ts (modified, +17/-0)
  • scripts/check-no-raw-channel-fetch.mjs (modified, +2/-2)

Code Example

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

---

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

---

[plugins] codex failed during register from /usr/lib/node_modules/openclaw/dist/extensions/codex/index.js: TypeError: api.registerAgentHarness is not a function

---

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

---

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

---

text
2026-04-11T18:30:54+00:00 [openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)
2026-04-11T18:30:55+00:00 openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
2026-04-11T18:31:00+00:00 openclaw-gateway.service: Scheduled restart job, restart counter is at 8.
2026-04-11T18:31:16+00:00 Started openclaw-gateway.service - OpenClaw Gateway (v2026.3.12).
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

My OpenClaw gateway process intermittently crashes with:

Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

The process exits, then systemd --user automatically restarts openclaw-gateway.service.

This appears to be happening in the gateway heartbeat / reconnect path, not as a full machine reboot.

Environment

  • OpenClaw version: v2026.3.12
    (I updated the gateway, but the issue still occurred afterward)
  • Host: Linux
  • Service managed by: systemd --user
  • Channel in use: Discord
  • Model: openai-codex/gpt-5.4

Observed behavior

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Representative crash sequence:

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

Then:

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

What I checked

  • This is not a host reboot. The machine stayed on the same boot.
  • I checked for OOM / kernel panic / thermal issues and did not find evidence that those caused the restart.
  • The crash appears to come from OpenClaw itself, specifically the heartbeat / reconnect logic.
  • I checked whether cron jobs posting to Discord were the immediate cause. I did not find strong evidence that Discord posting is the root cause. At most, activity may expose the bug, but the fatal event is in heartbeat/reconnect handling.

Extra observations

After updating, I also saw:

[plugins] codex failed during register from /usr/lib/node_modules/openclaw/dist/extensions/codex/index.js: TypeError: api.registerAgentHarness is not a function

But this did not seem to be the direct crash trigger. The actual fatal event was still the zombie reconnect exception.

Expected behavior

If a connection is stale/dead, the gateway should:

  • ignore the stale connection
  • establish a fresh one if needed
  • log a warning/error
  • continue running

It should not throw an uncaught exception that kills the whole gateway process

Steps to reproduce

  1. Run OpenClaw gateway as a systemd --user service on Linux with Discord enabled.
  2. Keep the gateway running for several hours under normal use, including cron activity and Discord messaging.
  3. Intermittently, the gateway crashes with:Attempted to reconnect zombie connection after disconnecting first
  4. systemd then auto-restarts the service.

Expected behavior

If a connection is stale/dead, the gateway should:

  • ignore the stale connection
  • establish a fresh one if needed
  • log a warning/error
  • continue running

It should not throw an uncaught exception that kills the whole gateway process.

Actual behavior

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Representative crash sequence:

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

Then:

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

OpenClaw version

v2026.3.12

Operating system

Ubuntu 24.04.4 LTS (GNU/Linux 6.8.0-107-generic x86_64)

Install method

npm global

Model

openai-codex/gpt-5.4 / anthropic/claude-opus-4.6 / anthropic/claude-sonnet-4.6

Provider / routing chain

Discord -> OpenClaw gateway -> openai-codex/gpt-5.4

Additional provider/model setup details

No response

Logs, screenshots, and evidence

text
2026-04-11T18:30:54+00:00 [openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)
2026-04-11T18:30:55+00:00 openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
2026-04-11T18:31:00+00:00 openclaw-gateway.service: Scheduled restart job, restart counter is at 8.
2026-04-11T18:31:16+00:00 Started openclaw-gateway.service - OpenClaw Gateway (v2026.3.12).

Impact and severity

Impact:

  • Gateway process crashes intermittently during normal operation
  • Messaging becomes temporarily unavailable until systemd restarts the service
  • In-flight replies or tool actions may fail or be interrupted
  • Reduces reliability of cron jobs and Discord interactions
  • Creates noisy restart churn and makes the system hard to trust for unattended use

Severity: High

Additional information

  • The crash appears to be in the gateway connection layer, not in the model provider itself.
  • systemd --user auto-restart masked the failure somewhat by bringing the service back quickly, but the crash is still user-visible because in-flight replies can fail.
  • At least one run showed a Discord-side symptom shortly before a crash:
    • discord final reply failed: AbortError: This operation was aborted
    • This may be related, or just a downstream symptom of connection instability.
  • After updating, I also observed a likely separate compatibility warning:
    • TypeError: api.registerAgentHarness is not a function
    • from the Codex plugin registration path
    • This did not appear to be the direct fatal error, but it may indicate a version mismatch worth checking.
  • The gateway logs also showed repeated Bonjour advertiser warnings before/after restarts:
    • watchdog detected non-announced service; attempting re-advertise
    • probably secondary, but worth mentioning in case it points to broader connection lifecycle issues.
  • Repro is not deterministic on demand

extent analysis

TL;DR

The OpenClaw gateway process crashes intermittently due to a zombie connection reconnect issue, and applying a workaround to handle stale connections or upgrading to a potentially fixed version may resolve the issue.

Guidance

  • Investigate the reconnectCallback function in GatewayPlugin.ts to understand how zombie connections are handled and consider adding a check to ignore stale connections.
  • Verify if the sendHeartbeat function in heartbeat.ts is correctly implemented to avoid reconnecting zombie connections.
  • Check for any version mismatches between the OpenClaw gateway and the Codex plugin, as indicated by the TypeError: api.registerAgentHarness is not a function error.
  • Monitor the gateway logs for any repeated Bonjour advertiser warnings, which may indicate broader connection lifecycle issues.

Example

No specific code snippet can be provided without modifying the existing codebase, but a potential workaround could involve adding a check in the reconnectCallback function to ignore stale connections before attempting to reconnect.

Notes

The root cause of the issue is unclear, and the provided information does not point to a specific solution. However, the guidance provided above may help in identifying and resolving the issue.

Recommendation

Apply a workaround to handle stale connections, as the issue appears to be related to the gateway connection layer. Upgrading to a potentially fixed version may also resolve the issue, but this is not explicitly implied in the provided information.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

If a connection is stale/dead, the gateway should:

  • ignore the stale connection
  • establish a fresh one if needed
  • log a warning/error
  • continue running

It should not throw an uncaught exception that kills the whole gateway process.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING