openclaw - 💡(How to fix) Fix agents.create triggers gateway restart, breaking in-flight calls (~1.5–2s port unavailability) [1 comments, 2 participants]

openclaw2026-05-02 18:34:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#76208•Fetched 2026-05-03 04:40:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jldmentagent

Participants

clawsweeper[bot]

jldmentagent

Timeline (top)

closed ×1commented ×1cross-referenced ×1

When a backend client calls agents.create over WebSocket, the gateway writes the new agent's auth token to <state>/devices/paired.json. The config-watcher detects this change and concludes a gateway restart is needed; SIGUSR1 fires immediately, the gateway shuts down cleanly, then re-binds — closing port 18789 for ~1.5–2 seconds.

Any in-flight HTTP or WS calls during that window fail with Connection closed (WS — pending requests rejected when the socket closes) or ECONNREFUSED (HTTP — port not listening). The client can't tell the gateway did this to itself, so the failures look like a genuine outage.

Root Cause

Fix Action

Workaround

Clients can wrap WS/HTTP calls in retry-with-backoff to mask the window, but that's a per-client fix and adds latency to every legitimate failure too. Fixing it at the gateway is the right place.

Code Example

18:15:11.366  Config overwrite: paired.json (sha256 → new)
18:15:12.224  [ws] ⇄ res ✓ agents.create 4758ms
18:15:12.230  [reload] config change detected; evaluating reload
              (gateway.auth.token, gateway.tailscale, agents.list, meta.lastTouchedAt)
18:15:12.237  [reload] config change requires gateway restart
              (gateway.auth.token, gateway.tailscale)
18:15:12.241  [gateway] signal SIGUSR1 received
18:15:12.666  [shutdown] completed cleanly in 371ms
              ⏤ port 18789 closed ⏤
18:15:14.231  [gateway] starting HTTP server...
18:15:15.281  [gateway] ready

RAW_BUFFERClick to expand / collapse

Summary

Repro

Backend with operator scopes connects via WS.
Backend issues agents.create for a new agent, immediately followed by agents.files.set for several workspace files (e.g. SOUL.md, IDENTITY.md, USER.md, MEMORY.md) and a POST /v1/chat/completions.
The follow-up calls fail.

Gateway log timeline

18:15:11.366  Config overwrite: paired.json (sha256 → new)
18:15:12.224  [ws] ⇄ res ✓ agents.create 4758ms
18:15:12.230  [reload] config change detected; evaluating reload
              (gateway.auth.token, gateway.tailscale, agents.list, meta.lastTouchedAt)
18:15:12.237  [reload] config change requires gateway restart
              (gateway.auth.token, gateway.tailscale)
18:15:12.241  [gateway] signal SIGUSR1 received
18:15:12.666  [shutdown] completed cleanly in 371ms
              ⏤ port 18789 closed ⏤
18:15:14.231  [gateway] starting HTTP server...
18:15:15.281  [gateway] ready

~1.6 s of port closure on a routine "add a token" operation.

Why this is a bug

agents.create is on the hot path for any multi-tenant deployment (one gateway hosting many users' agents). Every new user / re-provision pays the restart tax.
Token addition is logically additive; the running process should be able to absorb it without restarting.
gateway.tailscale showing in the reload diff is suspicious — adding an agent token shouldn't dirty Tailscale config.

Proposed fixes (any of)

Don't restart on agent-token additions. A new entry in paired.json[].tokens.operator shouldn't require a re-bind. The runtime can re-read the trusted-token table on the next handshake.
Drain in-flight calls before closing the port. Keep the listener up until pending requests drain (or a short timeout), then re-bind.
Atomic restart — bring the new instance up on a different ephemeral port, hand off via SO_REUSEPORT or socket-passing, then close the old listener. No window of zero-listeners.

(1) is the cleanest. (2) is the smallest change that fixes user-visible impact.

Workaround

Clients can wrap WS/HTTP calls in retry-with-backoff to mask the window, but that's a per-client fix and adds latency to every legitimate failure too. Fixing it at the gateway is the right place.

Environment

OpenClaw image built from main (commit 47009dd7)
Container running on GCE Container-Optimized OS
Gateway started with agents.defaults.skipBootstrap: true, pre-shared device pairing

extent analysis

TL;DR

The gateway can be fixed by not restarting on agent-token additions or by draining in-flight calls before closing the port.

Guidance

Verify that the issue is caused by the gateway restarting after an agent-token addition by checking the gateway log timeline for the "config change detected" and "signal SIGUSR1 received" messages.
Consider implementing one of the proposed fixes: not restarting on agent-token additions, draining in-flight calls before closing the port, or using an atomic restart.
To mitigate the issue, clients can wrap WS/HTTP calls in retry-with-backoff, but this is not a recommended long-term solution.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a change in the gateway's behavior.

Notes

The issue is specific to the OpenClaw image built from the main branch (commit 47009dd7) and may not apply to other environments or versions.

Recommendation

Apply workaround: Don't restart on agent-token additions, as it is the cleanest and most efficient solution, allowing the runtime to re-read the trusted-token table on the next handshake without requiring a restart.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix agents.create triggers gateway restart, breaking in-flight calls (~1.5–2s port unavailability) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Repro

Gateway log timeline

Why this is a bug

Proposed fixes (any of)

Workaround

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix agents.create triggers gateway restart, breaking in-flight calls (~1.5–2s port unavailability) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Repro

Gateway log timeline

Why this is a bug

Proposed fixes (any of)

Workaround

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING