openclaw - 💡(How to fix) Fix Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73874Fetched 2026-04-29 06:13:51
View on GitHub
Comments
3
Participants
3
Timeline
3
Reactions
0
Author
Timeline (top)
commented ×3

On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.

Error Message

  1. [skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.
  2. pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.

Root Cause

  1. session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.

Fix Action

Fix / Workaround

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.

The dispatch deadlock causes:

Code Example

00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)

---

services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.
RAW_BUFFERClick to expand / collapse

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.

Reproduction environment

  • Host: Windows 11 Pro 26200, Docker Desktop (WSL2 backend)
  • Container image: ghcr.io/openclaw/openclaw:2026.4.26 (and .25, .24) extended via Dockerfile (FROM) with Playwright/Chromium/gh/gog CLI installs
  • Bind mount: ./config:/home/node/.openclaw from a Windows NTFS path
  • Container user: node (uid 1000); compose overrides user: "0:0" for an entrypoint wrapper that runs runuser -u node -- "$@" after fix-up perms
  • Two bots tested:
    • "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session
    • "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations

Symptoms (identical on both bots)

After gateway ready:

ProbeBehavior
curl http://127.0.0.1:18789/healthzTCP connection accepted, never any HTTP response — times out at 8s with HTTP 000
curl http://127.0.0.1:18789/ (gateway dashboard)Same — TCP accept, no response
curl http://127.0.0.1:18789/__openclaw__/canvas/Same
curl http://127.0.0.1:18789/api/statusSame
openclaw gateway status --deep (WebSocket probe to ws://127.0.0.1:18789)Same — timeout
openclaw plugins inspect <id> (CLI → gateway RPC)Hangs
openclaw plugins doctor (CLI → gateway RPC)Hangs / silent
codex exec directly (CLI, bypasses gateway)Works — returns gpt-5.4 reply

Process state: openclaw-gateway PID is Sl (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads S (sleeping). Event-loop deadlock signature.

Kernel TCP state for port 18789:

00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)

Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never close()'d its side. Classic Node await-never-resolves signature.

Cascading downstream symptoms

The dispatch deadlock causes:

  1. Slack provider stalls after channels resolved with no socket mode connected line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.

  2. session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.

  3. stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0 — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.

  4. [ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found — slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests plugin registry mismatch / multiple registry instances.

  5. 5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack) — sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail with Requested agent harness "codex" is not registered and PI fallback is disabled. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.

  6. [skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.

  7. [heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace' — heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.

  8. Plugin runtime-deps mirror-lock contention: on first 2026.4.24/.26 startup, plugin runtime deps install into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.json persists across container restarts. Have to manually rm -rf the lock dir to recover.

  9. Cannot find module '.../slack/pipeline.runtime-<hash>.js' — slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.

  10. openclaw doctor --fix writes openclaw.json with 0600 root:root perms when invoked via docker exec -t (which inherits compose user: "0:0"). The gateway then can't read its own config and fails restart loop with "Missing config. Run openclaw setup or set gateway.mode=local". Have to chown the file manually to recover.

  11. pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.

  12. 2026.4.24+ silently rewrites openclaw.json on first start: agents.defaults.model.primary from codex/gpt-5.4 to openai/gpt-5.4, adds openai plugin entry. Persists across rollback to .23 — manual revert needed. (Note: the openai provider in 2026.4.x does work with codex/ChatGPT OAuth via the agentRuntime: {id: "codex"} runtime, but the rewrite caught us off guard initially.)

What works on 2026.4.23 with the same setup

  • /healthz returns HTTP 200 in ~20ms
  • All plugins load without validation failures
  • Slack socket-mode connects within ~30s of ready
  • Session-write-lock acquired/released in milliseconds
  • No nativeHook registry mismatches
  • No plugin-runtime-deps install needed (.23 doesn't use that mechanism)

Diagnostic data we collected

  • gateway process state (Sl/Rl, CPU %, thread count, all wchan=0 sleeping)
  • TCP socket state (LISTEN + N CLOSE_WAIT accumulating)
  • Stability bundles in ~/.openclaw/logs/stability/ (only one from a MODULE_NOT_FOUND during very first .24 attempt; nothing for the dispatch deadlocks themselves)
  • openclaw plugins list output (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)
  • Full container logs from multiple startup attempts

Happy to attach files / run additional diagnostics on request.

What I tried and what did/didn't help

StepEffect
chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules}Fixes lots of unrelated EACCES errors but does not fix dispatch deadlock
chown node:node ~/.openclaw (the bind-mount root, mode 0700)Fixes [skills] watcher EACCES storm and [heartbeat] mkdir EACCES
rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/Unblocks plugin loading after a stuck-lock startup
chown node:node ~/.openclaw/openclaw.jsonFixes "Missing config" restart-loop after openclaw doctor --fix
openclaw doctor --fix (interactive)Migrates legacy embeddedHarnessagentRuntime, but writes config with bad perms (see above)
Renaming sessions.json asideCaused a different startup hang; restoring fixed that
compose down && up --force-recreateDoesn't help — same regression
Rolling back to 2026.4.23 (retag local image, restore openclaw.json from pre-update commit, wipe stale node_modules + plugin-runtime-deps)Fully restores working state

Compose context

services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.

fix-codex-perms.sh chowns ~/.codex/, ~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}, strips world-writable bits on ~/.openclaw/extensions/*/, then runuser -u node -- "$@".

Asks

  1. Is this dispatch deadlock a known issue you're tracking? It's been present in three consecutive releases (.24, .25, .26).
  2. Would adding ~/.openclaw/ itself (the bind-mount root) and ~/.openclaw/openclaw.json to whatever owns initial perm setup help, or is the breakage independent of that?
  3. Any way to enable verbose dispatcher logging that would capture why the dispatcher is stuck post-ready?
  4. Can you confirm whether the openai/gpt-5.4 provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?

extent analysis

TL;DR

The most likely fix for the dispatch deadlock issue is to investigate and address the potential causes of the deadlock, such as the nativeHook registry mismatch, plugin runtime dependencies installation issues, and file permission problems, and to consider rolling back to version 2026.4.23, which is known to work with the same setup.

Guidance

  • Investigate the nativeHook registry mismatch and plugin runtime dependencies installation issues, as they may be contributing to the dispatch deadlock.
  • Verify file permissions and ownership for the ~/.openclaw/ directory and its contents, and adjust them if necessary to ensure proper access.
  • Consider adding verbose dispatcher logging to capture more information about the deadlock.
  • Review the Compose context and fix-codex-perms.sh script to ensure that they are not contributing to the issue.

Example

No specific code snippet is provided, as the issue is complex and requires a more thorough investigation.

Notes

The issue is likely related to changes introduced in version 2026.4.24, and rolling back to version 2026.4.23 may be a temporary solution. However, it is essential to investigate and address the root causes of the issue to ensure a permanent fix.

Recommendation

Apply a workaround by rolling back to version 2026.4.23, which is known to work with the same setup, while investigating the root causes of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING