openclaw - 💡(How to fix) Fix Gateway becomes zombie after system CA rotation; internal reconnect loop cannot recover; Discord READY log line also missing in 2026.4.5 [1 participants]

openclaw2026-04-08 15:27:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#63223•Fetched 2026-04-09 07:56:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jdroblee-afk

Participants

jdroblee-afk

Two related problems surfaced together during a 2026-04-08 outage:

Error Message

[discord] suppressed late gateway other error after dispose: Error: certificate has expired [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006

Root Cause

The embedded Discord provider's auto-restart loop (attempts 2/10 → 10/10 with exponential backoff up to 300s) cannot recover from this, because every restart-within-the-same-process reuses the same cached CAs.

Fix Action

Fix / Workaround

Problem B

[discord] client initialized as … ; awaiting gateway readiness is the terminal log line on a healthy boot
Log-based health checks are blind to actual bot state
The discord-post.mjs workaround in our repo exists specifically because REST is the only reliable write path

Workaround we are using

One-shot manual recovery: launchctl kickstart -k gui/$UID/ai.openclaw.gateway after noticing the bot is gray.
REST-only post path (tools/discord-post.mjs) that bypasses the gateway entirely for cron jobs — noted in its own header comment that this was added to avoid "session conflicts with the OpenClaw gateway bot, which was causing Clawbot to appear offline every time a cron job posted a message."
Planned: patching our gateway-watchdog to grep gateway.err.log for N+ occurrences of certificate has expired in 5 min and force a kickstart.

Code Example

[discord] gateway metadata lookup failed transiently; using default gateway url
     (Failed to get gateway information from Discord: fetch failed)
   [discord] channel resolve failed; using config entries. fetch failed | certificate has expired

---

[discord] suppressed late gateway other error after dispose: Error: certificate has expired
   [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006

---

node -e "const tls=require('tls'); const s=tls.connect(443,'gateway.discord.gg',{servername:'gateway.discord.gg'},()=>{console.log(s.authorized); s.end()});"

---

2026-04-07T18:05:11.650-05:00 [discord] logged in to discord as 1484016201360212069 (Clawbot)   ← last healthy login (old log format)
2026-04-07T21:13:01        gateway process PID 49603 started (launchd)
2026-04-08T00:29:40.940-05:00 [discord] gateway metadata lookup failed transiently; using default gateway url (Failed to get gateway information from Discord: fetch failed)
2026-04-08T00:30:22.368-05:00 [discord] channel resolve failed; using config entries. fetch failed | certificate has expired
2026-04-08T00:30:11.218-05:00 [discord] [default] auto-restart attempt 2/10 in 11s
... attempts 3-10 all fail identically ...
2026-04-08T01:16:57.772-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
2026-04-08T01:22:28.689-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
... "certificate has expired" flood every 30s for next 9 hours ...
2026-04-08T10:12:51.582-05:00 [gateway] ready (5 plugins, 0.7s)                                    ← after manual `launchctl kickstart -k`
2026-04-08T10:13:00.170-05:00 [discord] client initialized as 1484016201360212069 (Clawbot); awaiting gateway readiness
                              ← no "logged in" line, and that's the LAST [discord] log entry,
                                 but the bot is actually READY and presence/REST both work.

---

[discord] ready: logged in as <username> (<id>) · <guild_count> guilds · shard <n>/<m>

RAW_BUFFERClick to expand / collapse

Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing

Repo: github.com/openclaw/openclaw Suggested labels: bug, gateway, discord, stability, observability OpenClaw version: 2026.4.5 Node: 22.22.1 OS: macOS 14 (Apple Silicon) Severity: High — silent outage that the built-in reconnect loop cannot recover from

Title

Long-running gateway becomes permanently unable to connect to Discord after midnight system CA rotation ("certificate has expired"); internal reconnect loop cannot recover, only a full launchctl kickstart fixes it. Separately, the [discord] logged in to discord as … READY log line no longer fires, making the broken state invisible to watchdogs.

Summary

Two related problems surfaced together during a 2026-04-08 outage:

Problem A — Cached system CAs cause unrecoverable zombie state

The gateway daemon reads the system's root CA store once at process startup (it runs with NODE_USE_SYSTEM_CA=1 per its launchd env). When the OS keychain rotates an intermediate or root CA while the process is already running, the gateway's cached TLS context retains the old trust anchors forever. All subsequent outbound TLS connections — most visibly to gateway.discord.gg and discord.com — fail with Error: certificate has expired, even though a freshly-spawned Node process on the same machine can complete the TLS handshake without issue.

Problem B — "logged in" READY log line is missing in 2026.4.5

In prior builds the gateway emitted [discord] logged in to discord as <id> (<username>) when the Discord client's 'ready' event fired. In 2026.4.5 we only see [discord] client initialized as <id> (<username>); awaiting gateway readiness and then nothing — even when the bot is actually fully connected, has live presence, and is responding to REST operations. This makes it impossible for watchdogs (or humans grepping the log) to tell whether the bot is healthy or stuck.

Environment

[email protected]
Node 22.22.1 (NODE_USE_SYSTEM_CA=1 in launchd env)
macOS 14 (Apple Silicon)
LaunchAgent: ai.openclaw.gateway

Reproduction — Problem A

Start the gateway via launchctl. Let it run overnight.
During the overnight period, a root or intermediate CA in the macOS keychain expires (or is rotated).

The gateway will begin emitting, at ~30s intervals:

[discord] gateway metadata lookup failed transiently; using default gateway url
  (Failed to get gateway information from Discord: fetch failed)
[discord] channel resolve failed; using config entries. fetch failed | certificate has expired

The internal auto-restart loop will cycle attempts 2→10 over ~10 minutes, all failing the same way.

At ~30 min the Discord provider hits Max reconnect attempts (50) reached after code 1006 and is disposed. The gateway then enters a permanent suppressed-error state:

[discord] suppressed late gateway other error after dispose: Error: certificate has expired
[discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006

Verify the problem is process-local: in a separate terminal, run

node -e "const tls=require('tls'); const s=tls.connect(443,'gateway.discord.gg',{servername:'gateway.discord.gg'},()=>{console.log(s.authorized); s.end()});"

→ fresh Node process trusts the cert fine. Only the running gateway daemon is broken.

Actual log evidence (2026-04-08 incident)

2026-04-07T18:05:11.650-05:00 [discord] logged in to discord as 1484016201360212069 (Clawbot)   ← last healthy login (old log format)
2026-04-07T21:13:01        gateway process PID 49603 started (launchd)
2026-04-08T00:29:40.940-05:00 [discord] gateway metadata lookup failed transiently; using default gateway url (Failed to get gateway information from Discord: fetch failed)
2026-04-08T00:30:22.368-05:00 [discord] channel resolve failed; using config entries. fetch failed | certificate has expired
2026-04-08T00:30:11.218-05:00 [discord] [default] auto-restart attempt 2/10 in 11s
... attempts 3-10 all fail identically ...
2026-04-08T01:16:57.772-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
2026-04-08T01:22:28.689-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
... "certificate has expired" flood every 30s for next 9 hours ...
2026-04-08T10:12:51.582-05:00 [gateway] ready (5 plugins, 0.7s)                                    ← after manual `launchctl kickstart -k`
2026-04-08T10:13:00.170-05:00 [discord] client initialized as 1484016201360212069 (Clawbot); awaiting gateway readiness
                              ← no "logged in" line, and that's the LAST [discord] log entry,
                                 but the bot is actually READY and presence/REST both work.

Reproduction — Problem B

Restart the gateway cleanly (launchctl kickstart -k gui/$UID/ai.openclaw.gateway).
Tail ~/.openclaw/logs/gateway.log | grep '\[discord\]'.
Observe: [discord] client initialized as <id> (<username>); awaiting gateway readiness fires, then the [discord] log channel goes silent.
Verify the bot actually is READY:
- lsof -p <gateway_pid> shows ESTABLISHED TCP to 162.159.130.234:https / 162.159.134.234:https (gateway.discord.gg IPs)
- REST calls via the bot token succeed: curl -H "Authorization: Bot $TOKEN" https://discord.com/api/v10/users/@me → 200
- A POST to any channel via REST succeeds and shows in Discord
- Hard-refreshing the Discord client shows the bot as online
The missing READY log line means you cannot tell from logs alone whether the bot is stuck awaiting IDENTIFY/READY or is actually fine.

Expected behavior

Problem A

The gateway should either:

Periodically refresh its cached TLS trust store (read system CAs on each new outbound connection, not just at process start), or
Detect the "certificate has expired" signature in its own error stream and trigger a hard process respawn (not just an internal reconnect attempt), or
Expose a plugin-level hook so operators can trigger launchctl kickstart -k from a watchdog when this signature is detected.

Problem B

When the embedded Discord client fires 'ready', the gateway should log it with identifying detail, at minimum:

[discord] ready: logged in as <username> (<id>) · <guild_count> guilds · shard <n>/<m>

Watchdogs and operators rely on this line to distinguish "bot is healthy" from "bot is stuck in IDENTIFY".

Actual behavior

Problem A

Internal reconnect loop hits its 50-attempt ceiling within ~30 min, then gives up entirely
The "dispose" state is terminal within the process lifetime
Operators must manually launchctl kickstart -k to recover
We discovered this 9 hours after the outage began because there was no externally-visible signal

Problem B

[discord] client initialized as … ; awaiting gateway readiness is the terminal log line on a healthy boot
Log-based health checks are blind to actual bot state
The discord-post.mjs workaround in our repo exists specifically because REST is the only reliable write path

Impact

9-hour silent outage on 2026-04-08 (00:29 → 10:12 CT). Bot appeared offline in Discord's member list. All Discord-channel-based agent interactions were dead.
No alerts fired because the built-in auto-restart loop swallowed the error state and the watchdog had no "healthy" log line to look for.
Only resolved when a human noticed the bot was gray and manually kicked the gateway.

Suggested fix

For Problem A

Add a periodic system-CA refresh — either on a timer (hourly?) or per-new-outbound-TLS-session.
Pattern-match "certificate has expired" in error handlers and escalate to process.exit(1) so launchd respawns cleanly (instead of an internal reconnect that inherits the broken state).
Document the NODE_USE_SYSTEM_CA=1 caveat — anyone operating this long-running service needs to know that system CA rotations require a full restart.

For Problem B

Restore the "logged in" log line on Discord client 'ready' event — or add a new one in the same spirit.
Also log identify/resume attempts and their outcomes (session_id, shard_id, intents ack'd) so the handshake path is observable.

Workaround we are using

One-shot manual recovery: launchctl kickstart -k gui/$UID/ai.openclaw.gateway after noticing the bot is gray.
REST-only post path (tools/discord-post.mjs) that bypasses the gateway entirely for cron jobs — noted in its own header comment that this was added to avoid "session conflicts with the OpenClaw gateway bot, which was causing Clawbot to appear offline every time a cron job posted a message."
Planned: patching our gateway-watchdog to grep gateway.err.log for N+ occurrences of certificate has expired in 5 min and force a kickstart.

Sister issue: pi-agent-core lifecycle race → #63220
Sister issue: sessions_spawn modelApplied lying → #63221
Full incident report for the 04-08 outage (local): ~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md

extent analysis

TL;DR

The most likely fix for the gateway becoming a zombie after system CA rotation is to implement a periodic system-CA refresh or pattern-match "certificate has expired" in error handlers to escalate to a clean process restart.

Guidance

To address Problem A, consider adding a timer-based or per-new-outbound-TLS-session system-CA refresh to prevent the gateway from becoming a zombie after system CA rotation.
Implement error handling to detect the "certificate has expired" signature and trigger a hard process respawn instead of an internal reconnect attempt.
For Problem B, restore the "logged in" log line on the Discord client 'ready' event to provide visibility into the bot's state.
Log identify/resume attempts and their outcomes to make the handshake path observable.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The provided information suggests that the issue is related to the gateway's handling of system CA rotations and the Discord client's 'ready' event. However, without more context or code, it is difficult to provide a comprehensive solution.

Recommendation

Apply a workaround by implementing a periodic system-CA refresh or error handling to detect "certificate has expired" and trigger a process restart. This will help prevent the gateway from becoming a zombie after system CA rotation and provide visibility into the bot's state.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix Gateway becomes zombie after system CA rotation; internal reconnect loop cannot recover; Discord READY log line also missing in 2026.4.5 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Problem B

Workaround we are using

Code Example

Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing

Title

Summary

Problem A — Cached system CAs cause unrecoverable zombie state

Problem B — "logged in" READY log line is missing in 2026.4.5

Environment

Reproduction — Problem A

Actual log evidence (2026-04-08 incident)

Reproduction — Problem B

Expected behavior

Problem A

Problem B

Actual behavior

Problem A

Problem B

Impact

Suggested fix

For Problem A

For Problem B

Workaround we are using

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING