openclaw - 💡(How to fix) Fix Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) [3 comments, 3 participants]

purpleant · 2026-04-28T23:52:56Z

[openclaw] On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the lis… On Windows host + Docker Desktop + bind-mounted `~/.openclaw/` setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs `ready` and binds the listener socket, but the **request-handling dispatch is deadlocked**. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck `agent:main:main` session and never delivered. **Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories.** Fully working on 2026.4.23 with the same compose/config. ## Fix / Workaround # Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) On Windows host + Docker Desktop + bind-mounted `~/.openclaw/` setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs `ready` and binds the listener socket, but the **request-handling dispatch is deadlocked**. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck `agent:main:main` session and never delivered. **Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories.** Fully working on 2026.4.23 with the same compose/config. The dispatch deadlock causes: # Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) ## Summary On Windows host + Docker Desktop + bind-mounted `~/.openclaw/` setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs `ready` and binds the listener socket, but the **request-handling dispatch is deadlocked**. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck `agent:main:main` session and never delivered. **Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories.** Fully working on 2026.4.23 with the same compose/config. ## Reproduction environment - **Host**: Windows 11 Pro 26200, Docker Desktop (WSL2 backend) - **Container image**: `ghcr.io/openclaw/openclaw:2026.4.26` (and .25, .24) extended via Dockerfile (`FROM`) with Playwright/Chromium/gh/gog CLI installs - **Bind mount**: `./config:/home/node/.openclaw` from a Windows NTFS path - **Container user**: `node` (uid 1000); compose overrides `user: "0:0"` for an entrypoint wrapper that runs `runuser -u node -- "$@"` after fix-up perms - **Two bots tested**: - "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session - "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations ## Symptoms (identical on both bots) After `gateway ready`: | Probe | Behavior | |---|---| | `curl http://127.0.0.1:18789/healthz` | TCP connection accepted, **never any HTTP response** — times out at 8s with `HTTP 000` | | `curl http://127.0.0.1:18789/` (gateway dashboard) | Same — TCP accept, no response | | `curl http://127.0.0.1:18789/__openclaw__/canvas/` | Same | | `curl http://127.0.0.1:18789/api/status` | Same | | `openclaw gateway status --deep` (WebSocket probe to `ws://127.0.0.1:18789`) | **Same — timeout** | | `openclaw plugins inspect ` (CLI → gateway RPC) | Hangs | | `openclaw plugins doctor` (CLI → gateway RPC) | Hangs / silent | | `codex exec` directly (CLI, bypasses gateway) | **Works** — returns gpt-5.4 reply | Process state: `openclaw-gateway` PID is `Sl` (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads `S` (sleeping). Event-loop deadlock signature. Kernel TCP state for port 18789: ``` 00000000:4965 -> 00000000:0000 [LISTEN] 0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT] 0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT] 0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT] ... (one CLOSE_WAIT per probe attempt) ``` Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never `close()`'d its side. Classic Node `await`-never-resolves signature. ## Cascading downstream symptoms The dispatch deadlock causes: 1. **Slack provider stalls after `channels resolved`** with no `socket mode connected` line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs. 2. **`session-write-lock` held for 200,000+ ms (max 15,000 ms expected)** — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely. 3. **`stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0`** — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating. 4. **`[ws] ⇄ res ✗ nativeHook.invoke erro

openclaw2026-04-28 23:52:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73874•Fetched 2026-04-29 06:13:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×3

On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.

Error Message

[skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.
pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.

Root Cause

session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.

Fix Action

Fix / Workaround

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

The dispatch deadlock causes:

Code Example

00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)

---

services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.

RAW_BUFFERClick to expand / collapse

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

Reproduction environment

Host: Windows 11 Pro 26200, Docker Desktop (WSL2 backend)
Container image: ghcr.io/openclaw/openclaw:2026.4.26 (and .25, .24) extended via Dockerfile (FROM) with Playwright/Chromium/gh/gog CLI installs
Bind mount: ./config:/home/node/.openclaw from a Windows NTFS path
Container user: node (uid 1000); compose overrides user: "0:0" for an entrypoint wrapper that runs runuser -u node -- "$@" after fix-up perms
Two bots tested:
- "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session
- "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations

Symptoms (identical on both bots)

After gateway ready:

Probe	Behavior
`curl http://127.0.0.1:18789/healthz`	TCP connection accepted, never any HTTP response — times out at 8s with `HTTP 000`
`curl http://127.0.0.1:18789/` (gateway dashboard)	Same — TCP accept, no response
`curl http://127.0.0.1:18789/__openclaw__/canvas/`	Same
`curl http://127.0.0.1:18789/api/status`	Same
`openclaw gateway status --deep` (WebSocket probe to `ws://127.0.0.1:18789`)	Same — timeout
`openclaw plugins inspect <id>` (CLI → gateway RPC)	Hangs
`openclaw plugins doctor` (CLI → gateway RPC)	Hangs / silent
`codex exec` directly (CLI, bypasses gateway)	Works — returns gpt-5.4 reply

Process state: openclaw-gateway PID is Sl (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads S (sleeping). Event-loop deadlock signature.

Kernel TCP state for port 18789:

00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)

Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never close()'d its side. Classic Node await-never-resolves signature.

Cascading downstream symptoms

The dispatch deadlock causes:

Slack provider stalls after channels resolved with no socket mode connected line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.
session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.
stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0 — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.
[ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found — slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests plugin registry mismatch / multiple registry instances.
5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack) — sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail with Requested agent harness "codex" is not registered and PI fallback is disabled. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.
[skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.
[heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace' — heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.
Plugin runtime-deps mirror-lock contention: on first 2026.4.24/.26 startup, plugin runtime deps install into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.json persists across container restarts. Have to manually rm -rf the lock dir to recover.
Cannot find module '.../slack/pipeline.runtime-<hash>.js' — slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.
openclaw doctor --fix writes openclaw.json with 0600 root:root perms when invoked via docker exec -t (which inherits compose user: "0:0"). The gateway then can't read its own config and fails restart loop with "Missing config. Run openclaw setup or set gateway.mode=local". Have to chown the file manually to recover.
pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.
2026.4.24+ silently rewrites openclaw.json on first start: agents.defaults.model.primary from codex/gpt-5.4 to openai/gpt-5.4, adds openai plugin entry. Persists across rollback to .23 — manual revert needed. (Note: the openai provider in 2026.4.x does work with codex/ChatGPT OAuth via the agentRuntime: {id: "codex"} runtime, but the rewrite caught us off guard initially.)

What works on 2026.4.23 with the same setup

/healthz returns HTTP 200 in ~20ms
All plugins load without validation failures
Slack socket-mode connects within ~30s of ready
Session-write-lock acquired/released in milliseconds
No nativeHook registry mismatches
No plugin-runtime-deps install needed (.23 doesn't use that mechanism)

Diagnostic data we collected

gateway process state (Sl/Rl, CPU %, thread count, all wchan=0 sleeping)
TCP socket state (LISTEN + N CLOSE_WAIT accumulating)
Stability bundles in ~/.openclaw/logs/stability/ (only one from a MODULE_NOT_FOUND during very first .24 attempt; nothing for the dispatch deadlocks themselves)
openclaw plugins list output (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)
Full container logs from multiple startup attempts

Happy to attach files / run additional diagnostics on request.

What I tried and what did/didn't help

Step	Effect
`chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules}`	Fixes lots of unrelated EACCES errors but does not fix dispatch deadlock
`chown node:node ~/.openclaw` (the bind-mount root, mode 0700)	Fixes `[skills]` watcher EACCES storm and `[heartbeat]` mkdir EACCES
`rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/`	Unblocks plugin loading after a stuck-lock startup
`chown node:node ~/.openclaw/openclaw.json`	Fixes "Missing config" restart-loop after `openclaw doctor --fix`
`openclaw doctor --fix` (interactive)	Migrates legacy `embeddedHarness` → `agentRuntime`, but writes config with bad perms (see above)
Renaming `sessions.json` aside	Caused a different startup hang; restoring fixed that
`compose down && up --force-recreate`	Doesn't help — same regression
Rolling back to 2026.4.23 (retag local image, restore `openclaw.json` from pre-update commit, wipe stale `node_modules` + `plugin-runtime-deps`)	Fully restores working state

Compose context

services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.

fix-codex-perms.sh chowns ~/.codex/, ~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}, strips world-writable bits on ~/.openclaw/extensions/*/, then runuser -u node -- "$@".

Asks

Is this dispatch deadlock a known issue you're tracking? It's been present in three consecutive releases (.24, .25, .26).
Would adding ~/.openclaw/ itself (the bind-mount root) and ~/.openclaw/openclaw.json to whatever owns initial perm setup help, or is the breakage independent of that?
Any way to enable verbose dispatcher logging that would capture why the dispatcher is stuck post-ready?
Can you confirm whether the openai/gpt-5.4 provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?

extent analysis

TL;DR

The most likely fix for the dispatch deadlock issue is to investigate and address the potential causes of the deadlock, such as the nativeHook registry mismatch, plugin runtime dependencies installation issues, and file permission problems, and to consider rolling back to version 2026.4.23, which is known to work with the same setup.

Guidance

Investigate the nativeHook registry mismatch and plugin runtime dependencies installation issues, as they may be contributing to the dispatch deadlock.
Verify file permissions and ownership for the ~/.openclaw/ directory and its contents, and adjust them if necessary to ensure proper access.
Consider adding verbose dispatcher logging to capture more information about the deadlock.
Review the Compose context and fix-codex-perms.sh script to ensure that they are not contributing to the issue.

Example

No specific code snippet is provided, as the issue is complex and requires a more thorough investigation.

Notes

The issue is likely related to changes introduced in version 2026.4.24, and rolling back to version 2026.4.23 may be a temporary solution. However, it is essential to investigate and address the root causes of the issue to ensure a permanent fix.

Recommendation

Apply a workaround by rolling back to version 2026.4.23, which is known to work with the same setup, while investigating the root causes of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Code Example

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

Reproduction environment

Symptoms (identical on both bots)

Cascading downstream symptoms

What works on 2026.4.23 with the same setup

Diagnostic data we collected

What I tried and what did/didn't help

Compose context

Asks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Code Example

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

Reproduction environment

Symptoms (identical on both bots)

Cascading downstream symptoms

What works on 2026.4.23 with the same setup

Diagnostic data we collected

What I tried and what did/didn't help

Compose context

Asks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING