openclaw - 💡(How to fix) Fix Gateway crashes with uncaught ENETDOWN inside SSRF guard's outbound connect; macOS launchd silently parks the LaunchAgent

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The gateway process exits with an uncaught Error: connect ENETDOWN … (and likely sibling codes ENETUNREACH/EHOSTUNREACH/ECONNREFUSED) when the local network briefly drops while the SSRF guard is performing an outbound DNS-lookup-and-connect to Telegram's API. The exception propagates to the top-level uncaught handler, OpenClaw writes a stability bundle, and the process terminates.

On macOS this is doubly costly: macOS launchd's hidden respawn-protection silently parks the LaunchAgent after a crash burst, so KeepAlive=true does not recover the service — observed a 2.4-day silent outage in one deployment (2026-05-21 → 2026-05-23) before the operator noticed. A 5-min cron watchdog has since caught another recovery within ~10 minutes of an openclaw update, so the underlying crash recurs frequently, not just on rare network events.

Error Message

From the gateway's stderr log immediately preceding the stability-bundle write:

Error: connect ENETDOWN <telegram-api-ip>:443 - Local <redacted>
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at emitLookup (node:net:1491:9)
    at file:///opt/homebrew/lib/node_modules/openclaw/dist/ssrf-<HASH>.js:207:3
    at node:net:1468:5
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at lookupAndConnect (node:net:1467:3)
    at TLSSocket.Socket.connect (node:net:1344:5)
    at Object.connect (node:internal/tls/wrap:1789:13)
    at connect (/opt/homebrew/lib/node_modules/openclaw/node_modules/undici/lib/core/connect.js:86:20)

The chunk hash on ssrf-<HASH>.js varies per build, but the file/line maps to the SSRF guard's outbound-connect callback inside the Node net lookup-and-connect path. The crash fires when the kernel returns ENETDOWN (or one of the sibling codes) from connect(2) after DNS resolution has already succeeded — i.e. the egress interface flapped between the DNS callback and the actual TCP connect.

Root Cause

macOS launchd's KeepAlive=true is not a sufficient recovery mechanism. After a crash burst, launchd silently parks the job (no documented signal, no log entry, no audit trail). Combined with the fact that integrity-monitoring jobs typically check file hashes / config drift and not service liveness, the gateway can stay dead indefinitely while monitoring reports "all clean." Linux/systemd users may not see the same outage duration because systemd's Restart=on-failure doesn't have the same hidden gate.

Fix Action

Fix / Workaround

Workarounds in use (for affected operators)

Both are operator-side workarounds for a bug that should be fixed in the gateway.

Code Example

Error: connect ENETDOWN <telegram-api-ip>:443 - Local <redacted>
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at emitLookup (node:net:1491:9)
    at file:///opt/homebrew/lib/node_modules/openclaw/dist/ssrf-<HASH>.js:207:3
    at node:net:1468:5
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at lookupAndConnect (node:net:1467:3)
    at TLSSocket.Socket.connect (node:net:1344:5)
    at Object.connect (node:internal/tls/wrap:1789:13)
    at connect (/opt/homebrew/lib/node_modules/openclaw/node_modules/undici/lib/core/connect.js:86:20)

---

{
  "version": 1,
  "generatedAt": "2026-05-21T07:08:43.211Z",
  "reason": "uncaught_exception",
  "process": {
    "pid": <redacted>,
    "platform": "darwin",
    "arch": "arm64",
    "node": "25.9.0",
    "uptimeMs": 316604590
  },
  "error": {
    "name": "Error",
    "code": "ENETDOWN",
    "message": "connect ENETDOWN <telegram-api-ip>:443 - Local <redacted>"
  }
}
RAW_BUFFERClick to expand / collapse

Summary

The gateway process exits with an uncaught Error: connect ENETDOWN … (and likely sibling codes ENETUNREACH/EHOSTUNREACH/ECONNREFUSED) when the local network briefly drops while the SSRF guard is performing an outbound DNS-lookup-and-connect to Telegram's API. The exception propagates to the top-level uncaught handler, OpenClaw writes a stability bundle, and the process terminates.

On macOS this is doubly costly: macOS launchd's hidden respawn-protection silently parks the LaunchAgent after a crash burst, so KeepAlive=true does not recover the service — observed a 2.4-day silent outage in one deployment (2026-05-21 → 2026-05-23) before the operator noticed. A 5-min cron watchdog has since caught another recovery within ~10 minutes of an openclaw update, so the underlying crash recurs frequently, not just on rare network events.

Versions reproduced on

  • 2026.5.7 (initial reproduction; original outage)
  • 2026.5.20 (after first upgrade)
  • 2026.5.22 (after second upgrade; recurrence caught by watchdog within 10 minutes)

Platform: macOS (Apple Silicon), Node 25.9.0, install kind pnpm.

Expected behavior

Transient network errors during an outbound DNS-lookup-and-connect should be caught and either retried with backoff or surfaced as a logged warning, not propagate to the process's uncaught handler.

Actual behavior

The exception is uncaught. Process exits. Stability bundle is written. Gateway is dead until manually restarted (or, on macOS, until something forces launchd to restart it past the respawn-protection gate).

Stack trace

From the gateway's stderr log immediately preceding the stability-bundle write:

Error: connect ENETDOWN <telegram-api-ip>:443 - Local <redacted>
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at emitLookup (node:net:1491:9)
    at file:///opt/homebrew/lib/node_modules/openclaw/dist/ssrf-<HASH>.js:207:3
    at node:net:1468:5
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:473:12)
    at lookupAndConnect (node:net:1467:3)
    at TLSSocket.Socket.connect (node:net:1344:5)
    at Object.connect (node:internal/tls/wrap:1789:13)
    at connect (/opt/homebrew/lib/node_modules/openclaw/node_modules/undici/lib/core/connect.js:86:20)

The chunk hash on ssrf-<HASH>.js varies per build, but the file/line maps to the SSRF guard's outbound-connect callback inside the Node net lookup-and-connect path. The crash fires when the kernel returns ENETDOWN (or one of the sibling codes) from connect(2) after DNS resolution has already succeeded — i.e. the egress interface flapped between the DNS callback and the actual TCP connect.

Stability bundle (redacted)

{
  "version": 1,
  "generatedAt": "2026-05-21T07:08:43.211Z",
  "reason": "uncaught_exception",
  "process": {
    "pid": <redacted>,
    "platform": "darwin",
    "arch": "arm64",
    "node": "25.9.0",
    "uptimeMs": 316604590
  },
  "error": {
    "name": "Error",
    "code": "ENETDOWN",
    "message": "connect ENETDOWN <telegram-api-ip>:443 - Local <redacted>"
  }
}

The code: "ENETDOWN" is the key field — the exception object has a .code property set by Node's net module. A pattern match on err.code in {ENETDOWN, ENETUNREACH, EHOSTUNREACH, ECONNREFUSED, EAI_AGAIN} would catch the entire family.

Reproduction

Hard to reproduce on demand — depends on the local egress interface flapping at the exact moment of an outbound Telegram polling/send. In practice:

  • On a stable network, can run for hours.
  • On a flaky network (or macOS sleep/wake cycle, or a networksetup -setairportpower en0 off; sleep 1; on), the crash fires within minutes.
  • After installing a 5-min launchctl kickstart watchdog on top of v2026.5.22, we observed one auto-recovery within ~10 minutes of openclaw update completing — suggesting it's not just a once-per-deploy event.

Related changelog item that hinted at this family

v2026.5.22 shipped:

Cron: honor cron.retry.retryOn: ["network"] for common network error codes such as EAI_AGAIN, EHOSTUNREACH, and ENETUNREACH.

That change covers the cron subsystem's retry policy but doesn't affect the SSRF guard's outbound connect callback, which is in the gateway's general network path (used by Telegram polling and sends, not only by cron).

Suggested fix

Wrap the connect callback inside the SSRF guard (ssrf-*.js:~207, the file that's currently bundled from src/ssrf/connect.ts or similar) so that errors with code ∈ {ENETDOWN, ENETUNREACH, EHOSTUNREACH, ECONNREFUSED, EAI_AGAIN} either:

  1. Bubble up to a higher-level retry-with-backoff handler in the Telegram client (preferred — Telegram already has retry logic for HTTP 421; extending it to transport-level errors is natural), or
  2. At minimum, emit a warn-level log line and resolve the callback with an error that callers can .catch — instead of propagating to uncaughtException.

Either approach prevents the process exit. The watchdog/launchd issue then becomes irrelevant.

Workarounds in use (for affected operators)

  1. 5-min watchdog script that runs launchctl print … | grep "state = running" and launchctl kickstart on a non-running state. Catches the silent-park-after-crash-burst pattern on macOS.
  2. Daily dashboard surfacing state so the outage isn't invisible behind "integrity jobs: all clean ✓".

Both are operator-side workarounds for a bug that should be fixed in the gateway.

Why this is high-impact on macOS specifically

macOS launchd's KeepAlive=true is not a sufficient recovery mechanism. After a crash burst, launchd silently parks the job (no documented signal, no log entry, no audit trail). Combined with the fact that integrity-monitoring jobs typically check file hashes / config drift and not service liveness, the gateway can stay dead indefinitely while monitoring reports "all clean." Linux/systemd users may not see the same outage duration because systemd's Restart=on-failure doesn't have the same hidden gate.

The OpenClaw fix should not rely on the supervisor catching the crash, because the supervisor often won't.

Operator data available on request

Happy to share, in private if preferred:

  • The full sanitized stability JSON (openclaw-stability-2026-05-21T07-08-43-211Z-…uncaught_exception.json)
  • Last ~30 sanitized lines of gateway.err.log around the crash
  • The watchdog's stdout log showing the recurrence cadence over the next 48h once it accumulates

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Transient network errors during an outbound DNS-lookup-and-connect should be caught and either retried with backoff or surfaced as a logged warning, not propagate to the process's uncaught handler.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING