openclaw - 💡(How to fix) Fix Gateway restart treats managed listener as port conflict; embedded Codex direct session can wedge during compaction/rebind

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I am seeing two related reliability problems in OpenClaw 2026.5.22 on macOS with a LaunchAgent-managed local gateway on 127.0.0.1:18789.

  1. openclaw gateway restart can fail because port 18789 is already occupied by the expected managed OpenClaw gateway process. The command reports this as a port busy failure rather than treating it as the restart target.
  2. A Telegram direct session backed by embedded Codex can wedge during app-server compaction/rebinding. The gateway and Telegram transport remain healthy, but the user-facing assistant session appears stalled/dead.

These two failure modes are especially damaging when doing configuration changes from an active messaging session, because the operator can lose or severely delay the communication path used to supervise recovery.

Error Message

Gateway restart failed: Error: gateway port 18789 is still busy before LaunchAgent restart Port 18789 is already in use.

  • pid <pid> <user>: /opt/homebrew/opt/node/bin/node /opt/homebrew/lib/node_modules/openclaw/dist/index.js gateway --port 18789 (127.0.0.1:18789)

Root Cause

For messaging-first OpenClaw setups, communication continuity is the control plane. A restart or session compaction failure that leaves the channel technically connected but the agent session wedged looks like total assistant failure to the user. The diagnostics need to make that layer distinction clear, and the runtime should recover automatically where possible.

Code Example

Gateway restart failed: Error: gateway port 18789 is still busy before LaunchAgent restart
Port 18789 is already in use.
- pid <pid> <user>: /opt/homebrew/opt/node/bin/node /opt/homebrew/lib/node_modules/openclaw/dist/index.js gateway --port 18789 (127.0.0.1:18789)

---

gateway pid: <pid>
gateway rpc reachable: true
port 18789: busy, held by expected gateway
Telegram channel: running
Telegram accounts: connected

---

started codex app-server compaction

---

stalled session:
sessionKey=agent:main:telegram:direct:<redacted>
state=processing
queueDepth=2
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=embedded_run:started

---

codex app-server compaction timed out; restarting app-server

---

codex app-server compaction could not use thread binding
reason="thread not found: <thread-id>"
recovery=stale_thread_binding

---

native harness compaction could not use its session binding; falling back to context engine:
thread not found: <thread-id>
RAW_BUFFERClick to expand / collapse

Summary

I am seeing two related reliability problems in OpenClaw 2026.5.22 on macOS with a LaunchAgent-managed local gateway on 127.0.0.1:18789.

  1. openclaw gateway restart can fail because port 18789 is already occupied by the expected managed OpenClaw gateway process. The command reports this as a port busy failure rather than treating it as the restart target.
  2. A Telegram direct session backed by embedded Codex can wedge during app-server compaction/rebinding. The gateway and Telegram transport remain healthy, but the user-facing assistant session appears stalled/dead.

These two failure modes are especially damaging when doing configuration changes from an active messaging session, because the operator can lose or severely delay the communication path used to supervise recovery.

Environment

  • OpenClaw version: 2026.5.22
  • Runtime: Node 25.9.0
  • Platform: macOS 26.4 arm64
  • Gateway: LaunchAgent-managed local gateway
  • Gateway bind: 127.0.0.1:18789
  • Channel involved: Telegram polling
  • Agent runtime involved: embedded Codex app-server

Failure Mode 1: openclaw gateway restart false port-busy failure

Observed behavior:

Gateway restart failed: Error: gateway port 18789 is still busy before LaunchAgent restart
Port 18789 is already in use.
- pid <pid> <user>: /opt/homebrew/opt/node/bin/node /opt/homebrew/lib/node_modules/openclaw/dist/index.js gateway --port 18789 (127.0.0.1:18789)

The process holding the port is the expected managed OpenClaw gateway process.

Expected behavior:

  • If port 18789 is held by the managed LaunchAgent gateway, openclaw gateway restart should treat that process as the restart target, not an unexpected blocker.
  • The command should perform a graceful launchd/internal handoff and wait for the replacement gateway to become reachable.
  • If active work makes restart unsafe, the CLI should explicitly defer/refuse non-destructively and keep the gateway running.

Actual behavior:

  • The expected running gateway is reported as a blocking port conflict.
  • This pushes the operator toward manual stop/start/recovery paths.
  • In an active messaging setup, this can break communication continuity or make recovery appear unreliable.

Failure Mode 2: embedded Codex direct session wedges during compaction/rebind

After the gateway was again healthy and Telegram was connected, the user-visible direct session still stalled.

Current health at the time of investigation:

gateway pid: <pid>
gateway rpc reachable: true
port 18789: busy, held by expected gateway
Telegram channel: running
Telegram accounts: connected

Relevant log signatures:

started codex app-server compaction
stalled session:
sessionKey=agent:main:telegram:direct:<redacted>
state=processing
queueDepth=2
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=embedded_run:started
codex app-server compaction timed out; restarting app-server
codex app-server compaction could not use thread binding
reason="thread not found: <thread-id>"
recovery=stale_thread_binding
native harness compaction could not use its session binding; falling back to context engine:
thread not found: <thread-id>

Expected behavior:

  • Compaction/rebind should either complete or recover into a fresh valid session binding.
  • A stale thread binding should not leave new inbound messages queued behind a dead/stale embedded run.
  • openclaw status --deep should clearly distinguish "gateway/channel healthy" from "agent session wedged".

Actual behavior:

  • The direct messaging session entered a stalled processing state.
  • Repeated stalled-session warnings appeared.
  • The gateway and Telegram transport were healthy, but the assistant appeared delayed/dead to the user.

Related Context

The incident occurred around a configuration change touching messages.tts. Logs showed gateway reload detection for the relevant config paths, but a stale status read led to a restart attempt.

Even if the operator should avoid restarting from inside an active messaging session, the runtime should still be robust against:

  • false port-busy restart failures when the port owner is the managed gateway itself
  • stale embedded Codex thread bindings after compaction/app-server restart
  • direct-session queues stuck behind a stale run

Suggested Fix Criteria

For gateway restart:

  • Distinguish expected managed gateway listener from unexpected port conflict.
  • Restart the managed LaunchAgent gateway gracefully when it owns the configured port.
  • If active work exists, produce a clear deferred/refused state rather than a misleading port-busy error.
  • Keep enough state to verify replacement PID/RPC readiness before returning success.

For embedded Codex session recovery:

  • Automatically recover from stale_thread_binding / thread not found.
  • Rebind the direct session to a fresh valid thread or clear/requeue stale work safely.
  • Ensure new inbound channel messages are not blocked behind a dead embedded run.
  • Surface wedged agent sessions in openclaw status --deep even when gateway/channel health is OK.

Why this matters

For messaging-first OpenClaw setups, communication continuity is the control plane. A restart or session compaction failure that leaves the channel technically connected but the agent session wedged looks like total assistant failure to the user. The diagnostics need to make that layer distinction clear, and the runtime should recover automatically where possible.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway restart treats managed listener as port conflict; embedded Codex direct session can wedge during compaction/rebind