openclaw - 💡(How to fix) Fix [Bug]: 2026.4.26 hangs in futex_wait_queue on docker restart of an existing container; fresh boot works [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75224Fetched 2026-05-01 05:36:41
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

Error Message

  • Gateway log file gets the [gateway] starting... line and then nothing further. No error, no panic, no shutdown signal.

Root Cause

Suspected cause categories (we don't have a confirmed root cause)

RAW_BUFFERClick to expand / collapse

TL;DR

docker restart of an existing OpenClaw 2026.4.26 container deadlocks at [gateway] starting…. All Node threads sit idle in futex_wait_queue, the event loop is empty, and the gateway never reaches http server listening. A fresh image pull boots fine; the hang only manifests on restart of a previously-running 2026.4.26 container. We had to roll back to 2026.4.25 to recover.

Repro

  1. Run OpenClaw 2026.4.26 in a Docker container as the gateway foreground process. Initial boot is healthy (http server listening, channels online, normal traffic).
  2. After some uptime (we hit it after several hours), docker restart the container.
  3. The container starts. The gateway prints [gateway] starting... and never advances.
  4. Healthcheck never passes. No subsequent log lines. Channels never reconnect.
  5. Container stays in this state indefinitely.

This is reproducible across multiple restart attempts on the same container — we tried 5+ on 2026-04-29.

Symptoms / diagnostic signals

  • Process is alive, just blocked. top / ps show the Node process consuming no CPU.
  • Inspecting threads via strace / gdb / top -H (or docker exec ... top) shows all 11 Node threads idle in futex_wait_queue. Nothing is ever woken.
  • Event loop appears empty — no timers, no I/O, no pending work surfaces in any inspection we tried.
  • Gateway log file gets the [gateway] starting... line and then nothing further. No error, no panic, no shutdown signal.
  • Healthcheck endpoint never responds. curl localhost:18789 connection-refused.

Things that did NOT recover the hang

We tried each of these and re-attempted the restart; none made the gateway progress past [gateway] starting...:

  • Removing chrome WhatsApp profile LOCK files (in case Chrome was holding a wedge).
  • Removing stale app-state-sync-version-*.json files.
  • Removing stale sqlite-WAL files alongside the channel auth DB.
  • Disabling broken plugins (audit-webhook, deepseek).
  • Editing plugins.allow whitelist in openclaw.json (this actually MADE things worse — it caused a subsequent fresh-pull boot to ALSO hang at the same spot, suggesting plugins.allow interacts with whatever the deadlock is).
  • docker compose down + docker compose up -d (full container teardown — same hang).
  • docker kill followed by docker compose up -d (SIGKILL force-stop — same hang on next boot).

Recovery path that worked

Rolled back to 2026.4.25 by editing ~/openclaw/.env to pin the older image tag and running docker compose up -d. 2026.4.25 booted cleanly (taking 2-4 minutes — separate observation, healthy boot times in 2026.4.x are noticeably longer than older versions, worth flagging in release notes if not already known).

Suspected cause categories (we don't have a confirmed root cause)

We didn't get to a definitive root cause, but the symptom set narrows things:

  • All threads in futex_wait_queue strongly suggests every thread is blocked on a mutex/condvar that's never going to be signaled. Classic deadlock or missing wake-up.
  • [gateway] starting... is the last line — this is from the early-boot phase, BEFORE plugin loading completes (channels never initialize, MCP never registers, control UI never serves). The hang is in early init.
  • Fresh boot vs restart asymmetry suggests something in the persistent state (config? lockfiles? sqlite WAL? channel auth state?) is differently shaped after a previous run, and the new boot's init code path encounters that state and waits on a primitive that won't be released.
  • plugins.allow worsening the situation suggests plugin-loading is in or near the critical section of the deadlock.

Not enough to pinpoint, but a heap dump (or kill -USR1 <pid> if Node has the v8 inspector exposed) on a hung instance would likely localize it within minutes for someone with the source map.

Why we're filing this separately

Mentioned in passing in issue #75153 as context (we rolled back to 2026.4.25 to fix the WhatsApp channel issue described there), but this hang deserves its own surface for two reasons:

  1. Separate root cause — the WhatsApp channel-stopped-after-DNS-blip in #75153 is a channel-runtime retry-budget issue. The 2026.4.26 hang is a gateway-init deadlock. Different teams of code, different fix path, no point bundling.
  2. Other operators are likely hitting this too — a hang-on-restart on a recent release would affect anyone running a long-uptime container who triggers a restart for any reason (config change, host reboot, etc). Worth a discoverable bug record.

Environment

  • OpenClaw 2026.4.26
  • Docker on macOS Sequoia (host: Mac Studio M2 Ultra)
  • Single-tenant, gateway-as-foreground container
  • Channels active at the time of the failed restart: WhatsApp, Discord
  • Plugins enabled: standard bundled set + (briefly) audit-webhook, deepseek — we tried disabling both, no effect

What would help us help you

If a future build can either:

  • Print thread-dump / node --inspect-friendly state on the early-boot critical section, or
  • Add a 30-second watchdog around the starting -> http server listening transition that bails with a panic instead of hanging silently,

…we'd have actionable debugging info instead of a black-box silence next time. Right now operators have no signal between "starting…" and either "running" or "rolled back to a previous version after wasting 90+ minutes."

Happy to capture more diagnostics next time if we hit it again — let us know what would be most useful (heap dump, strace, gdb backtrace from a container with debug symbols, etc).

extent analysis

TL;DR

The most likely fix for the OpenClaw 2026.4.26 container deadlock issue is to roll back to version 2026.4.25, as this has been confirmed to resolve the issue.

Guidance

  • To troubleshoot the issue, capture a thread dump or use node --inspect to analyze the early-boot critical section and identify the cause of the deadlock.
  • Consider adding a watchdog around the starting -> http server listening transition to detect and handle hangs.
  • If the issue recurs, collect diagnostics such as a heap dump, strace, or gdb backtrace from a container with debug symbols to aid in debugging.
  • Be cautious when editing plugins.allow in openclaw.json, as this has been observed to worsen the situation.

Example

No code snippet is provided, as the issue is related to a specific version of OpenClaw and its interaction with Docker.

Notes

The root cause of the issue is still unknown, but the symptoms suggest a deadlock or missing wake-up in the early-boot phase. The fact that a fresh boot works fine, but a restart hangs, implies that something in the persistent state is causing the issue.

Recommendation

Apply the workaround by rolling back to version 2026.4.25, as this has been confirmed to resolve the issue. This is a safer approach until the root cause is identified and a fix is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: 2026.4.26 hangs in futex_wait_queue on docker restart of an existing container; fresh boot works [1 comments, 2 participants]