openclaw - 💡(How to fix) Fix Gateway WS `agent` dispatch times out 60s + embedded mode contends with running daemon for session file locks [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71605Fetched 2026-04-26 05:10:47
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

On openclaw 2026.4.23, attempting to invoke the agent runtime from an external client (CLI subprocess or direct WS) hits two cooperating bugs that make any "external integration" architecture impractical:

  1. Gateway-side agent WS dispatch times out at 60 s for fresh, never-seen session-keys, even after a clean daemon restart, even with the discord channel disabled.
  2. CLI --local / embedded fallback mode then contends with the running daemon for session-state file locks, blocking 10 s per attempt, falling over multiple times, total wall ~70–130 s per invocation.

Combined effect: a single openclaw agent --message "ping" --json invocation against a healthy running daemon takes 70–130 s wall time and produces multiple lane task error and FailoverError messages, even though the agent itself eventually completes the run in ~5 s.

This breaks the sidecar/external-bot pattern that issues #38596 / #71546 implicitly require if Discord-transport is to be replaced with an alternative library (per the analysis at https://github.com/openclaw/openclaw/issues/71546 and the WS-resilience research).

Error Message

$ openclaw agent --to 9999999999999999997 --message "ping" --json --timeout 30 gateway connect failed: GatewayClientRequestError: ... Gateway agent failed; falling back to embedded: Error: gateway timeout after 60000ms Gateway target: ws://127.0.0.1:18789

Root Cause

On openclaw 2026.4.23, attempting to invoke the agent runtime from an external client (CLI subprocess or direct WS) hits two cooperating bugs that make any "external integration" architecture impractical:

  1. Gateway-side agent WS dispatch times out at 60 s for fresh, never-seen session-keys, even after a clean daemon restart, even with the discord channel disabled.
  2. CLI --local / embedded fallback mode then contends with the running daemon for session-state file locks, blocking 10 s per attempt, falling over multiple times, total wall ~70–130 s per invocation.

Combined effect: a single openclaw agent --message "ping" --json invocation against a healthy running daemon takes 70–130 s wall time and produces multiple lane task error and FailoverError messages, even though the agent itself eventually completes the run in ~5 s.

This breaks the sidecar/external-bot pattern that issues #38596 / #71546 implicitly require if Discord-transport is to be replaced with an alternative library (per the analysis at https://github.com/openclaw/openclaw/issues/71546 and the WS-resilience research).

Fix Action

Fix / Workaround

  1. Gateway-side agent WS dispatch times out at 60 s for fresh, never-seen session-keys, even after a clean daemon restart, even with the discord channel disabled.
  2. CLI --local / embedded fallback mode then contends with the running daemon for session-state file locks, blocking 10 s per attempt, falling over multiple times, total wall ~70–130 s per invocation.

Bug 1 — gateway-side agent dispatch times out 60 s

The daemon DOES respond to agent requests in some cases — there's a [ws] ⇄ res ✓ agent 489ms runId=... line in gateway.log from approximately the same time window — so it's not a total failure of the dispatch path. It's intermittent/conditional. We did not fully isolate the trigger.

Code Example

$ openclaw agent --to 9999999999999999997 --message "ping" --json --timeout 30
gateway connect failed: GatewayClientRequestError: ...
Gateway agent failed; falling back to embedded: Error: gateway timeout after 60000ms
Gateway target: ws://127.0.0.1:18789

---

[diagnostic] lane task error: lane=session:agent:main:explicit:sidecar:isolation-test durationMs=14709
  error="Error: session file locked (timeout 10000ms): pid=19479
    /Users/x./.openclaw/agents/main/sessions/6f6fa460-4849-4f68-86ce-4a7941cc2e05.jsonl.lock"
[model-fallback/decision] model fallback decision: decision=candidate_failed
  requested=openai-codex/gpt-5.4 reason=timeout
FailoverError: session file locked (timeout 10000ms): pid=19479
RAW_BUFFERClick to expand / collapse

Summary

On openclaw 2026.4.23, attempting to invoke the agent runtime from an external client (CLI subprocess or direct WS) hits two cooperating bugs that make any "external integration" architecture impractical:

  1. Gateway-side agent WS dispatch times out at 60 s for fresh, never-seen session-keys, even after a clean daemon restart, even with the discord channel disabled.
  2. CLI --local / embedded fallback mode then contends with the running daemon for session-state file locks, blocking 10 s per attempt, falling over multiple times, total wall ~70–130 s per invocation.

Combined effect: a single openclaw agent --message "ping" --json invocation against a healthy running daemon takes 70–130 s wall time and produces multiple lane task error and FailoverError messages, even though the agent itself eventually completes the run in ~5 s.

This breaks the sidecar/external-bot pattern that issues #38596 / #71546 implicitly require if Discord-transport is to be replaced with an alternative library (per the analysis at https://github.com/openclaw/openclaw/issues/71546 and the WS-resilience research).

Environment

OpenClaw2026.4.23 (a979721)
OSmacOS 15.6.1 (arm64)
Node (gateway runtime)22.22.2
SetupSingle bot account, single guild, default config; channels.telegram.enabled: false; `channels.discord.enabled: true
Networkwired Ethernet, 9–31 ms ping to gateway.discord.gg, 0% packet loss
Servicelaunchd ai.openclaw.gateway, OPENCLAW_GATEWAY_TOKEN configured

Bug 1 — gateway-side agent dispatch times out 60 s

$ openclaw agent --to 9999999999999999997 --message "ping" --json --timeout 30
gateway connect failed: GatewayClientRequestError: ...
Gateway agent failed; falling back to embedded: Error: gateway timeout after 60000ms
Gateway target: ws://127.0.0.1:18789

This happens for:

  • Brand-new never-seen Discord-id session-keys (rules out stuck-session theory)
  • Both with and without channels.discord.enabled (rules out Carbon WS holding a runtime lock)
  • After a fresh launchctl kickstart -k gui/$UID/ai.openclaw.gateway (rules out accumulated daemon state)
  • After openclaw devices approve <pending-scope-upgrade-request> (rules out auth scope issues)

The daemon DOES respond to agent requests in some cases — there's a [ws] ⇄ res ✓ agent 489ms runId=... line in gateway.log from approximately the same time window — so it's not a total failure of the dispatch path. It's intermittent/conditional. We did not fully isolate the trigger.

Bug 2 — embedded mode contends with the running daemon for session-file locks

When the gateway-path falls back to embedded:

[diagnostic] lane task error: lane=session:agent:main:explicit:sidecar:isolation-test durationMs=14709
  error="Error: session file locked (timeout 10000ms): pid=19479
    /Users/x./.openclaw/agents/main/sessions/6f6fa460-4849-4f68-86ce-4a7941cc2e05.jsonl.lock"
[model-fallback/decision] model fallback decision: decision=candidate_failed
  requested=openai-codex/gpt-5.4 reason=timeout
FailoverError: session file locked (timeout 10000ms): pid=19479

pid=19479 is the running openclaw-gateway daemon itself (ps -p 19479 confirms). Even though the embedded subprocess used a unique --session-id sidecar:isolation-test that the daemon has never seen, the embedded mode tries to acquire a lock on a freshly-created session file, and the running daemon has a process-level lock on the parent directory or on shared state that contends.

Effect: every embedded subprocess hits a 10 s lock-wait, fails over (no fallback model configured), retries on a different lane (also locked), eventually succeeds after multiple retries. Net wall time 70–130 s per call.

Why this matters for external integrations

The ability to invoke openclaw's agent runtime from an external process is the foundation for:

  • Replacing or augmenting transport plugins (the sidecar pattern relevant to #38596 / #71546)
  • Building scripted/scheduled agent workflows
  • Programmatic test harnesses

Today, openclaw agent ... --json is documented as the canonical CLI for agent invocation (per openclaw agent --help examples). In practice it doesn't work usefully against a running daemon — every call takes 70 s+ and produces failure-mode log noise. The only fast path is to run openclaw agent --local with the daemon STOPPED, which defeats the purpose of having a long-running daemon.

Reproduction

  1. openclaw 2026.4.23 daemon running healthy, all default config except channels.discord.enabled: false (to rule out Carbon noise — though same behavior either way).
  2. From shell: openclaw agent --to 1111 --session-id test-isolation --message "ping" --json --timeout 30.
  3. Observe: 60 s gateway timeout → fallback to embedded → 10 s lock contention → fail over → retry → eventually returns "pong" after 70–130 s wall time.
  4. gateway.log shows the daemon was idle (no concurrent agent runs) during this window.
  5. Lock file in ~/.openclaw/agents/main/sessions/<sid>.jsonl.lock is held by the daemon's own PID.

Asks

Three concrete options ordered by effort:

  1. Fix the gateway-side agent dispatch timeout. When the daemon is healthy, the WS agent method should return a runId in <1 s (we observed it doing so in one log line; the failure is intermittent). Identify the trigger for the 60 s timeout and fix.

  2. Make embedded mode safe for parallel-with-daemon execution. Two openclaw runtimes operating against the same ~/.openclaw/agents/... directory should either:

    • Cooperate via shared lock with reasonable timeouts and clear error messaging
    • Use separate state directories (e.g., --state-dir <path> flag for embedded mode)
    • OR refuse to start with a clear error: "embedded mode unavailable while daemon is running on this state dir"
  3. Document the limitation. If "external clients should never use the CLI when the daemon is running" is the intended design, that needs to be in the docs. Today the CLI silently degrades to a 70–130 s slow path with confusing error noise. Either flag it loudly at startup or improve the error messaging.

Related issues filed today (2026-04-25)

  • #38596 — Discord health-monitor restart loop (un-staled with fresh evidence)
  • #71546 — Discord ingest lag of 100–400 s on stable connection (the user-facing surface)
  • #71568 — doctor.memory.status blocks 6.7–105 s on live embedding probe
  • (this one) — Gateway dispatch timeout + embedded-mode lock contention

These four together describe the structural state of openclaw 2026.4.23 from an external-integration standpoint.

Logs / artifacts available

  • /Users/x./.openclaw/logs/gateway.log — gateway-side agent 489ms line + 1006/1000 close cadence
  • /tmp/openclaw/openclaw-2026-04-25.log — full ndjson trace
  • Subprocess output samples showing the pid=N session file locked / FailoverError cycle, available on request as gist links.

extent analysis

TL;DR

The most likely fix involves addressing the gateway-side agent dispatch timeout and the embedded mode's contention with the running daemon for session-file locks.

Guidance

  1. Investigate the gateway-side agent dispatch timeout: Identify the trigger for the 60 s timeout and fix it to ensure the WS agent method returns a runId in <1 s when the daemon is healthy.
  2. Improve embedded mode's parallel execution safety: Modify embedded mode to either cooperate with the daemon via shared locks, use separate state directories, or refuse to start with a clear error message when the daemon is running.
  3. Enhance error messaging and documentation: Clearly document the limitation of using the CLI with a running daemon and improve error messaging to avoid confusing error noise.
  4. Verify the fixes: Test the changes by reproducing the issue and checking for the absence of the 60 s timeout and lock contention errors.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The provided information suggests that the issues are related to the interaction between the gateway and embedded modes, but without further details, it's challenging to provide a more specific solution. The guidance provided is based on the information given and may require additional investigation to fully resolve the issues.

Recommendation

Apply a workaround by modifying the embedded mode to use a separate state directory or refuse to start when the daemon is running, as this seems to be the most straightforward solution to mitigate the contention issue. This change can help reduce the complexity of the problem and provide a clearer understanding of the underlying causes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING