If a connection is stale/dead, the gateway should: - ignore the stale connection - establish a fresh one if needed - log a warning/error - continue running It should not throw an uncaught exception that kills the whole gateway process.

openclaw - ✅(Solved) Fix [Bug]: Gateway crashes with Attempted to reconnect zombie connection after disconnecting first and is auto-restarted by systemd [1 pull requests, 1 participants]

openclaw2026-04-11 20:58:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#65009•Fetched 2026-04-12 13:26:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rivluc

Participants

rivluc

Timeline (top)

labeled ×2cross-referenced ×1referenced ×1

My OpenClaw gateway process intermittently crashes with:

Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

The process exits, then systemd --user automatically restarts openclaw-gateway.service.

This appears to be happening in the gateway heartbeat / reconnect path, not as a full machine reboot.

Error Message

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Root Cause

What I checked

This is not a host reboot. The machine stayed on the same boot.
I checked for OOM / kernel panic / thermal issues and did not find evidence that those caused the restart.
The crash appears to come from OpenClaw itself, specifically the heartbeat / reconnect logic.
I checked whether cron jobs posting to Discord were the immediate cause. I did not find strong evidence that Discord posting is the root cause. At most, activity may expose the bug, but the fatal event is in heartbeat/reconnect handling.

Fix Action

Fixed

Fixed by PR: fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect() (https://github.com/openclaw/openclaw/pull/65087)

PR fix notes

PR #65087: fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

Repository: openclaw/openclaw
Author: SARAMALI15792
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/65087

Description (problem / solution / changelog)

What does this PR do?

Adds a connect() override to SafeGatewayPlugin that clears stale heartbeat timers before delegating to the parent, preventing an intermittent uncaught exception that crashes the Discord gateway process and drops in-flight replies.

Root Cause

@buape/[email protected] has a race in its heartbeat initialisation:

setTimeout(() => {
    sendHeartbeat()                                          // stopHeartbeat() runs here —
                                                             // but heartbeatInterval is still undefined
    heartbeatInterval = setInterval(sendHeartbeat, interval) // stale interval created after the clear
}, initialDelay)

When sendHeartbeat detects a zombie connection it calls stopHeartbeat(), which clears heartbeatInterval — but the interval has not been assigned yet. The setInterval on the next line then creates a timer whose closure holds closed=true. When it fires ~41 seconds later, reconnectCallback(closed=true) throws inside a setInterval callback. Node.js routes this to process.on('uncaughtException'), bypassing the EventEmitter.on('error') path the gateway supervisor monitors. systemd restarts the service and in-flight replies fail.

Solution Applied

Override connect() in SafeGatewayPlugin to unconditionally clear both heartbeatInterval and firstHeartbeatTimeout before calling super.connect():

public override connect(resume = false): void {
  if (this.heartbeatInterval !== undefined) {
    clearInterval(this.heartbeatInterval);
    this.heartbeatInterval = undefined;
  }
  if (this.firstHeartbeatTimeout !== undefined) {
    clearTimeout(this.firstHeartbeatTimeout);
    this.firstHeartbeatTimeout = undefined;
  }
  super.connect(resume);
}

The parent's connect() only calls stopHeartbeat() when isConnecting=false. When isConnecting=true it returns early — leaving any stale timer alive. This override runs before that early-return check, ensuring stale timers are always cleared on reconnect.

Bottleneck Solved

Eliminates intermittent gateway crashes caused by the stale setInterval firing with a closed reconnectCallback
No process restart, no dropped in-flight replies
Works with the currently published @buape/[email protected] without requiring a version bump

Testing

pnpm test:extension discord

Two new unit tests added to extensions/discord/src/monitor/gateway-plugin.test.ts verifying that heartbeatInterval and firstHeartbeatTimeout are cleared when connect() is called while isConnecting=true.

Fixes #65009

Changed files

extensions/discord/src/monitor/gateway-plugin.test.ts (added, +115/-0)
extensions/discord/src/monitor/gateway-plugin.ts (modified, +17/-0)
scripts/check-no-raw-channel-fetch.mjs (modified, +2/-2)

Code Example

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

---

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

---

[plugins] codex failed during register from /usr/lib/node_modules/openclaw/dist/extensions/codex/index.js: TypeError: api.registerAgentHarness is not a function

---

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

---

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

---

text
2026-04-11T18:30:54+00:00 [openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)
2026-04-11T18:30:55+00:00 openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
2026-04-11T18:31:00+00:00 openclaw-gateway.service: Scheduled restart job, restart counter is at 8.
2026-04-11T18:31:16+00:00 Started openclaw-gateway.service - OpenClaw Gateway (v2026.3.12).

RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

Summary

My OpenClaw gateway process intermittently crashes with:

Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

The process exits, then systemd --user automatically restarts openclaw-gateway.service.

This appears to be happening in the gateway heartbeat / reconnect path, not as a full machine reboot.

Environment

OpenClaw version: v2026.3.12
(I updated the gateway, but the issue still occurred afterward)
Host: Linux
Service managed by: systemd --user
Channel in use: Discord
Model: openai-codex/gpt-5.4

Observed behavior

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Representative crash sequence:

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

Then:

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

What I checked

This is not a host reboot. The machine stayed on the same boot.
I checked for OOM / kernel panic / thermal issues and did not find evidence that those caused the restart.
The crash appears to come from OpenClaw itself, specifically the heartbeat / reconnect logic.
I checked whether cron jobs posting to Discord were the immediate cause. I did not find strong evidence that Discord posting is the root cause. At most, activity may expose the bug, but the fatal event is in heartbeat/reconnect handling.

Extra observations

After updating, I also saw:

[plugins] codex failed during register from /usr/lib/node_modules/openclaw/dist/extensions/codex/index.js: TypeError: api.registerAgentHarness is not a function

But this did not seem to be the direct crash trigger. The actual fatal event was still the zombie reconnect exception.

Expected behavior

If a connection is stale/dead, the gateway should:

ignore the stale connection
establish a fresh one if needed
log a warning/error
continue running

It should not throw an uncaught exception that kills the whole gateway process

Steps to reproduce

Run OpenClaw gateway as a systemd --user service on Linux with Discord enabled.
Keep the gateway running for several hours under normal use, including cron activity and Discord messaging.
Intermittently, the gateway crashes with:Attempted to reconnect zombie connection after disconnecting first
systemd then auto-restarts the service.

Expected behavior

If a connection is stale/dead, the gateway should:

ignore the stale connection
establish a fresh one if needed
log a warning/error
continue running

It should not throw an uncaught exception that kills the whole gateway process.

Actual behavior

The gateway runs normally for a while, then crashes with an uncaught exception. systemd notices the failure and restarts the service.

Representative crash sequence:

[openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211/217:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)

Then:

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job ...
Starting openclaw-gateway.service - OpenClaw Gateway ...
Started openclaw-gateway.service - OpenClaw Gateway ...

OpenClaw version

v2026.3.12

Operating system

Ubuntu 24.04.4 LTS (GNU/Linux 6.8.0-107-generic x86_64)

Install method

npm global

Model

openai-codex/gpt-5.4 / anthropic/claude-opus-4.6 / anthropic/claude-sonnet-4.6

Provider / routing chain

Discord -> OpenClaw gateway -> openai-codex/gpt-5.4

Additional provider/model setup details

No response

Logs, screenshots, and evidence

text
2026-04-11T18:30:54+00:00 [openclaw] Uncaught exception: Error: Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)
    at Object.reconnectCallback (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/GatewayPlugin.ts:211:...)
    at Timeout.sendHeartbeat [as _onTimeout] (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/src/plugins/gateway/utils/heartbeat.ts:31:...)
2026-04-11T18:30:55+00:00 openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
2026-04-11T18:31:00+00:00 openclaw-gateway.service: Scheduled restart job, restart counter is at 8.
2026-04-11T18:31:16+00:00 Started openclaw-gateway.service - OpenClaw Gateway (v2026.3.12).

Impact and severity

Impact:

Gateway process crashes intermittently during normal operation
Messaging becomes temporarily unavailable until systemd restarts the service
In-flight replies or tool actions may fail or be interrupted
Reduces reliability of cron jobs and Discord interactions
Creates noisy restart churn and makes the system hard to trust for unattended use

Severity: High

Additional information

The crash appears to be in the gateway connection layer, not in the model provider itself.
systemd --user auto-restart masked the failure somewhat by bringing the service back quickly, but the crash is still user-visible because in-flight replies can fail.
At least one run showed a Discord-side symptom shortly before a crash:
- discord final reply failed: AbortError: This operation was aborted
- This may be related, or just a downstream symptom of connection instability.
After updating, I also observed a likely separate compatibility warning:
- TypeError: api.registerAgentHarness is not a function
- from the Codex plugin registration path
- This did not appear to be the direct fatal error, but it may indicate a version mismatch worth checking.
The gateway logs also showed repeated Bonjour advertiser warnings before/after restarts:
- watchdog detected non-announced service; attempting re-advertise
- probably secondary, but worth mentioning in case it points to broader connection lifecycle issues.
Repro is not deterministic on demand

extent analysis

TL;DR

The OpenClaw gateway process crashes intermittently due to a zombie connection reconnect issue, and applying a workaround to handle stale connections or upgrading to a potentially fixed version may resolve the issue.

Guidance

Investigate the reconnectCallback function in GatewayPlugin.ts to understand how zombie connections are handled and consider adding a check to ignore stale connections.
Verify if the sendHeartbeat function in heartbeat.ts is correctly implemented to avoid reconnecting zombie connections.
Check for any version mismatches between the OpenClaw gateway and the Codex plugin, as indicated by the TypeError: api.registerAgentHarness is not a function error.
Monitor the gateway logs for any repeated Bonjour advertiser warnings, which may indicate broader connection lifecycle issues.

Example

No specific code snippet can be provided without modifying the existing codebase, but a potential workaround could involve adding a check in the reconnectCallback function to ignore stale connections before attempting to reconnect.

Notes

The root cause of the issue is unclear, and the provided information does not point to a specific solution. However, the guidance provided above may help in identifying and resolving the issue.

Recommendation

Apply a workaround to handle stale connections, as the issue appears to be related to the gateway connection layer. Upgrading to a potentially fixed version may also resolve the issue, but this is not explicitly implied in the provided information.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

If a connection is stale/dead, the gateway should:

ignore the stale connection
establish a fresh one if needed
log a warning/error
continue running

It should not throw an uncaught exception that kills the whole gateway process.

#api #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Gateway crashes with Attempted to reconnect zombie connection after disconnecting first and is auto-restarted by systemd [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

What I checked

Fix Action

Fixed

PR fix notes

PR #65087: fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

Description (problem / solution / changelog)

What does this PR do?

Root Cause

Solution Applied

Bottleneck Solved

Testing

Changed files

Code Example

Bug type

Beta release blocker

Summary

Environment

Observed behavior

What I checked

Extra observations

Expected behavior

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING