openclaw - ✅(Solved) Fix Post-upgrade stability regressions — v2026.4.5 (3e72c03) [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62095Fetched 2026-04-08 03:09:04
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2commented ×1subscribed ×1

Error Message

openclaw node run --host 192.168.x.x --port 18789 now fails with SECURITY ERROR: Cannot connect over plaintext ws://. This is a new check that broke existing setups where gateway binds to loopback but node was configured with the LAN IP. The node crash-looped until the plist was manually edited to use 127.0.0.1. Could use a clearer migration note or auto-detection when both processes are on the same machine. Gateway reached 1.5GB RAM and 47% CPU within a few hours of running. Contributing factors: 379 accumulated session files (187MB), 167MB error log, zombie WebSocket connections. May be a connection/session cleanup regression — sessions and connections don't seem to be released properly.

Fix Action

Fix / Workaround

2. Subagent announce timeout defaults to 120s — causes gateway hangs When the main session is busy processing a turn, subagent completion announcements block for 120s per attempt, then retry 4x. That's ~8 minutes of gateway pressure per failed announce. Had 119 announce timeouts in one day. This causes iMessage to stop responding and webchat to disconnect. Workaround: set agents.defaults.subagents.announceTimeoutMs: 15000. Suggest a much lower default (15-30s) — if the main session is busy, fail fast.

4. Slack health-monitor stale-socket reconnection loop Both Slack accounts reconnected every ~35 minutes due to health-monitor: restarting (reason: stale-socket). This ran continuously all day, adding gateway churn. Each reconnect also triggered channel resolve failed: missing_scope errors. Workaround: disabled Slack entirely.

PR fix notes

PR #8: fix(evals): pin openclaw 2026.4.2, restore log-based detection, add group/cross-convo evals

Description (problem / solution / changelog)

Summary

Fix eval infrastructure hanging caused by two issues:

  • OpenClaw v2026.4.5 has a confirmed memory leak and 12+ open regressions (Issue #62095). Pinned to v2026.4.2, the last stable release.
  • Container readiness detection was rewritten to poll file logs and send probe DMs requiring LLM inference. Reverted to the original stream-based docker logs -f approach, which works correctly.

New eval capabilities:

  • Group conversations with bystander agents (EVAL-006 updated, EVAL-010, EVAL-011 new)
  • Cross-conversation information leak probes (EVAL-008 updated with real cross-convo probe)
  • Deterministic pass/fail checks (skip LLM judge for obvious results)
  • Full transcript tracking for multi-turn and cross-conversation scenarios
  • SessionKey fix: uses actual chat type (group/dm) instead of hardcoding dm

Code quality:

  • Client cleanup moved to finally block (prevents resource leaks on early exit)
  • Bystander registration parallelized via Promise.all
  • Removed unused crossConversationExpected type field
  • Fixed "connected" pattern to "connected as" to avoid false-matching on "disconnected"

Test plan

  • Build passes (all 6 packages)
  • 193 unit tests pass across 5 packages (protocol, cli, server-core, openclaw-channel)
  • Lint: 0 errors, 0 warnings
  • E2E evals verified: 7/11 pass (4 failures are agent behavior, not infra)

🤖 Generated with Claude Code

Changed files

  • packages/evals/Dockerfile.eval-agent (modified, +1/-1)
  • packages/evals/scripts/build-eval-agent.sh (modified, +1/-1)
  • packages/evals/src/e2e-infra/llm-judge.ts (modified, +53/-6)
  • packages/evals/src/e2e-infra/model-config.ts (modified, +6/-0)
  • packages/evals/src/e2e-infra/runner.ts (modified, +161/-10)
  • packages/evals/src/e2e-infra/scenarios.ts (modified, +63/-11)
  • packages/evals/src/e2e-infra/types.ts (modified, +20/-0)
  • packages/openclaw-channel/src/openclaw-entry.inbound-contract.test.ts (modified, +1/-0)
  • packages/openclaw-channel/src/openclaw-entry.ts (modified, +1/-1)
  • packages/openclaw-channel/src/test-utils/container-core.ts (modified, +2/-2)
RAW_BUFFERClick to expand / collapse

Post-upgrade stability issues — v2026.4.5 (3e72c03)

Upgraded to 2026.4.5 this morning. System was stable before the upgrade. Experienced 10 gateway restarts in ~8 hours due to several issues. Environment: Mac Studio M3 Ultra, BlueBubbles iMessage, local loopback gateway.

1. doctor --fix doesn't fix its own warnings doctor --fix reports legacy config keys (channels.slack.channels.<id>.allow, messages.tts.<provider>, plugins.entries.voice-call.config.tts.<provider>) and tells you to run doctor --fix — but running it doesn't actually fix them. Had to manually edit openclaw.json to rename allowenabled and nest TTS provider configs under providers. The --fix flag should handle these migrations automatically.

2. Subagent announce timeout defaults to 120s — causes gateway hangs When the main session is busy processing a turn, subagent completion announcements block for 120s per attempt, then retry 4x. That's ~8 minutes of gateway pressure per failed announce. Had 119 announce timeouts in one day. This causes iMessage to stop responding and webchat to disconnect. Workaround: set agents.defaults.subagents.announceTimeoutMs: 15000. Suggest a much lower default (15-30s) — if the main session is busy, fail fast.

3. Node refuses plaintext WS to private LAN IPs (new security check) openclaw node run --host 192.168.x.x --port 18789 now fails with SECURITY ERROR: Cannot connect over plaintext ws://. This is a new check that broke existing setups where gateway binds to loopback but node was configured with the LAN IP. The node crash-looped until the plist was manually edited to use 127.0.0.1. Could use a clearer migration note or auto-detection when both processes are on the same machine.

4. Slack health-monitor stale-socket reconnection loop Both Slack accounts reconnected every ~35 minutes due to health-monitor: restarting (reason: stale-socket). This ran continuously all day, adding gateway churn. Each reconnect also triggered channel resolve failed: missing_scope errors. Workaround: disabled Slack entirely.

5. Gateway memory growth — 1.5GB within hours Gateway reached 1.5GB RAM and 47% CPU within a few hours of running. Contributing factors: 379 accumulated session files (187MB), 167MB error log, zombie WebSocket connections. May be a connection/session cleanup regression — sessions and connections don't seem to be released properly.

extent analysis

TL;DR

Downgrade to a previous version or apply workarounds for specific issues, such as adjusting the subagent announce timeout and editing configuration files to fix legacy config keys.

Guidance

  • Manually edit openclaw.json to rename legacy config keys, such as allow to enabled and nest TTS provider configs under providers, as the doctor --fix flag does not handle these migrations automatically.
  • Set agents.defaults.subagents.announceTimeoutMs to a lower value, such as 15-30s, to prevent gateway hangs caused by subagent announce timeouts.
  • Update the node configuration to use 127.0.0.1 instead of the LAN IP to avoid the new security check that refuses plaintext WS connections to private LAN IPs.
  • Disable Slack or investigate the health-monitor stale-socket reconnection loop to reduce gateway churn.
  • Monitor and clean up accumulated session files and error logs to prevent gateway memory growth.

Example

No code snippet is provided as the issue does not require a specific code change, but rather configuration edits and workarounds.

Notes

The provided workarounds may not fix all issues, and a more thorough investigation may be required to address the underlying causes. Additionally, downgrading to a previous version may not be feasible or desirable in all cases.

Recommendation

Apply workarounds, as downgrading to a previous version may not be a viable long-term solution, and the workarounds provided can help mitigate the specific issues mentioned in the problem report.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING