openclaw - 💡(How to fix) Fix agents.create triggers gateway restart, breaking in-flight calls (~1.5–2s port unavailability) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#76208Fetched 2026-05-03 04:40:45
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
2
Timeline (top)
closed ×1commented ×1cross-referenced ×1

When a backend client calls agents.create over WebSocket, the gateway writes the new agent's auth token to <state>/devices/paired.json. The config-watcher detects this change and concludes a gateway restart is needed; SIGUSR1 fires immediately, the gateway shuts down cleanly, then re-binds — closing port 18789 for ~1.5–2 seconds.

Any in-flight HTTP or WS calls during that window fail with Connection closed (WS — pending requests rejected when the socket closes) or ECONNREFUSED (HTTP — port not listening). The client can't tell the gateway did this to itself, so the failures look like a genuine outage.

Root Cause

When a backend client calls agents.create over WebSocket, the gateway writes the new agent's auth token to <state>/devices/paired.json. The config-watcher detects this change and concludes a gateway restart is needed; SIGUSR1 fires immediately, the gateway shuts down cleanly, then re-binds — closing port 18789 for ~1.5–2 seconds.

Any in-flight HTTP or WS calls during that window fail with Connection closed (WS — pending requests rejected when the socket closes) or ECONNREFUSED (HTTP — port not listening). The client can't tell the gateway did this to itself, so the failures look like a genuine outage.

Fix Action

Workaround

Clients can wrap WS/HTTP calls in retry-with-backoff to mask the window, but that's a per-client fix and adds latency to every legitimate failure too. Fixing it at the gateway is the right place.

Code Example

18:15:11.366  Config overwrite: paired.json (sha256 → new)
18:15:12.224  [ws] ⇄ res ✓ agents.create 4758ms
18:15:12.230  [reload] config change detected; evaluating reload
              (gateway.auth.token, gateway.tailscale, agents.list, meta.lastTouchedAt)
18:15:12.237  [reload] config change requires gateway restart
              (gateway.auth.token, gateway.tailscale)
18:15:12.241  [gateway] signal SIGUSR1 received
18:15:12.666  [shutdown] completed cleanly in 371ms
              ⏤ port 18789 closed ⏤
18:15:14.231  [gateway] starting HTTP server...
18:15:15.281  [gateway] ready
RAW_BUFFERClick to expand / collapse

Summary

When a backend client calls agents.create over WebSocket, the gateway writes the new agent's auth token to <state>/devices/paired.json. The config-watcher detects this change and concludes a gateway restart is needed; SIGUSR1 fires immediately, the gateway shuts down cleanly, then re-binds — closing port 18789 for ~1.5–2 seconds.

Any in-flight HTTP or WS calls during that window fail with Connection closed (WS — pending requests rejected when the socket closes) or ECONNREFUSED (HTTP — port not listening). The client can't tell the gateway did this to itself, so the failures look like a genuine outage.

Repro

  1. Backend with operator scopes connects via WS.
  2. Backend issues agents.create for a new agent, immediately followed by agents.files.set for several workspace files (e.g. SOUL.md, IDENTITY.md, USER.md, MEMORY.md) and a POST /v1/chat/completions.
  3. The follow-up calls fail.

Gateway log timeline

18:15:11.366  Config overwrite: paired.json (sha256 → new)
18:15:12.224  [ws] ⇄ res ✓ agents.create 4758ms
18:15:12.230  [reload] config change detected; evaluating reload
              (gateway.auth.token, gateway.tailscale, agents.list, meta.lastTouchedAt)
18:15:12.237  [reload] config change requires gateway restart
              (gateway.auth.token, gateway.tailscale)
18:15:12.241  [gateway] signal SIGUSR1 received
18:15:12.666  [shutdown] completed cleanly in 371ms
              ⏤ port 18789 closed ⏤
18:15:14.231  [gateway] starting HTTP server...
18:15:15.281  [gateway] ready

~1.6 s of port closure on a routine "add a token" operation.

Why this is a bug

  • agents.create is on the hot path for any multi-tenant deployment (one gateway hosting many users' agents). Every new user / re-provision pays the restart tax.
  • Token addition is logically additive; the running process should be able to absorb it without restarting.
  • gateway.tailscale showing in the reload diff is suspicious — adding an agent token shouldn't dirty Tailscale config.

Proposed fixes (any of)

  1. Don't restart on agent-token additions. A new entry in paired.json[].tokens.operator shouldn't require a re-bind. The runtime can re-read the trusted-token table on the next handshake.
  2. Drain in-flight calls before closing the port. Keep the listener up until pending requests drain (or a short timeout), then re-bind.
  3. Atomic restart — bring the new instance up on a different ephemeral port, hand off via SO_REUSEPORT or socket-passing, then close the old listener. No window of zero-listeners.

(1) is the cleanest. (2) is the smallest change that fixes user-visible impact.

Workaround

Clients can wrap WS/HTTP calls in retry-with-backoff to mask the window, but that's a per-client fix and adds latency to every legitimate failure too. Fixing it at the gateway is the right place.

Environment

  • OpenClaw image built from main (commit 47009dd7)
  • Container running on GCE Container-Optimized OS
  • Gateway started with agents.defaults.skipBootstrap: true, pre-shared device pairing

extent analysis

TL;DR

The gateway can be fixed by not restarting on agent-token additions or by draining in-flight calls before closing the port.

Guidance

  • Verify that the issue is caused by the gateway restarting after an agent-token addition by checking the gateway log timeline for the "config change detected" and "signal SIGUSR1 received" messages.
  • Consider implementing one of the proposed fixes: not restarting on agent-token additions, draining in-flight calls before closing the port, or using an atomic restart.
  • To mitigate the issue, clients can wrap WS/HTTP calls in retry-with-backoff, but this is not a recommended long-term solution.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a change in the gateway's behavior.

Notes

The issue is specific to the OpenClaw image built from the main branch (commit 47009dd7) and may not apply to other environments or versions.

Recommendation

Apply workaround: Don't restart on agent-token additions, as it is the cleanest and most efficient solution, allowing the runtime to re-read the trusted-token table on the next handshake without requiring a restart.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix agents.create triggers gateway restart, breaking in-flight calls (~1.5–2s port unavailability) [1 comments, 2 participants]