openclaw - 💡(How to fix) Fix Gateway restart treats managed listener as port conflict; embedded Codex direct session can wedge during compaction/rebind

I am seeing two related reliability problems in OpenClaw 2026.5.22 on macOS with a LaunchAgent-managed local gateway on 127.0.0.1:18789.

openclaw gateway restart can fail because port 18789 is already occupied by the expected managed OpenClaw gateway process. The command reports this as a port busy failure rather than treating it as the restart target.
A Telegram direct session backed by embedded Codex can wedge during app-server compaction/rebinding. The gateway and Telegram transport remain healthy, but the user-facing assistant session appears stalled/dead.

These two failure modes are especially damaging when doing configuration changes from an active messaging session, because the operator can lose or severely delay the communication path used to supervise recovery.

Root Cause

For messaging-first OpenClaw setups, communication continuity is the control plane. A restart or session compaction failure that leaves the channel technically connected but the agent session wedged looks like total assistant failure to the user. The diagnostics need to make that layer distinction clear, and the runtime should recover automatically where possible.

Code Example

Gateway restart failed: Error: gateway port 18789 is still busy before LaunchAgent restart
Port 18789 is already in use.
- pid <pid> <user>: /opt/homebrew/opt/node/bin/node /opt/homebrew/lib/node_modules/openclaw/dist/index.js gateway --port 18789 (127.0.0.1:18789)

---

gateway pid: <pid>
gateway rpc reachable: true
port 18789: busy, held by expected gateway
Telegram channel: running
Telegram accounts: connected

---

started codex app-server compaction

---

stalled session:
sessionKey=agent:main:telegram:direct:<redacted>
state=processing
queueDepth=2
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=embedded_run:started

---

codex app-server compaction timed out; restarting app-server

---

codex app-server compaction could not use thread binding
reason="thread not found: <thread-id>"
recovery=stale_thread_binding

---

native harness compaction could not use its session binding; falling back to context engine:
thread not found: <thread-id>

Summary

I am seeing two related reliability problems in OpenClaw 2026.5.22 on macOS with a LaunchAgent-managed local gateway on 127.0.0.1:18789.

openclaw gateway restart can fail because port 18789 is already occupied by the expected managed OpenClaw gateway process. The command reports this as a port busy failure rather than treating it as the restart target.
A Telegram direct session backed by embedded Codex can wedge during app-server compaction/rebinding. The gateway and Telegram transport remain healthy, but the user-facing assistant session appears stalled/dead.

Environment

OpenClaw version: 2026.5.22
Runtime: Node 25.9.0
Platform: macOS 26.4 arm64
Gateway: LaunchAgent-managed local gateway
Gateway bind: 127.0.0.1:18789
Channel involved: Telegram polling
Agent runtime involved: embedded Codex app-server

Failure Mode 1: `openclaw gateway restart` false port-busy failure

Observed behavior:

Gateway restart failed: Error: gateway port 18789 is still busy before LaunchAgent restart
Port 18789 is already in use.
- pid <pid> <user>: /opt/homebrew/opt/node/bin/node /opt/homebrew/lib/node_modules/openclaw/dist/index.js gateway --port 18789 (127.0.0.1:18789)

The process holding the port is the expected managed OpenClaw gateway process.

Expected behavior:

If port 18789 is held by the managed LaunchAgent gateway, openclaw gateway restart should treat that process as the restart target, not an unexpected blocker.
The command should perform a graceful launchd/internal handoff and wait for the replacement gateway to become reachable.
If active work makes restart unsafe, the CLI should explicitly defer/refuse non-destructively and keep the gateway running.

Actual behavior:

The expected running gateway is reported as a blocking port conflict.
This pushes the operator toward manual stop/start/recovery paths.
In an active messaging setup, this can break communication continuity or make recovery appear unreliable.

Failure Mode 2: embedded Codex direct session wedges during compaction/rebind

After the gateway was again healthy and Telegram was connected, the user-visible direct session still stalled.

Current health at the time of investigation:

gateway pid: <pid>
gateway rpc reachable: true
port 18789: busy, held by expected gateway
Telegram channel: running
Telegram accounts: connected

Relevant log signatures:

started codex app-server compaction

stalled session:
sessionKey=agent:main:telegram:direct:<redacted>
state=processing
queueDepth=2
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=embedded_run:started

codex app-server compaction timed out; restarting app-server

codex app-server compaction could not use thread binding
reason="thread not found: <thread-id>"
recovery=stale_thread_binding

native harness compaction could not use its session binding; falling back to context engine:
thread not found: <thread-id>

Expected behavior:

Compaction/rebind should either complete or recover into a fresh valid session binding.
A stale thread binding should not leave new inbound messages queued behind a dead/stale embedded run.
openclaw status --deep should clearly distinguish "gateway/channel healthy" from "agent session wedged".

Actual behavior:

The direct messaging session entered a stalled processing state.
Repeated stalled-session warnings appeared.
The gateway and Telegram transport were healthy, but the assistant appeared delayed/dead to the user.

Related Context

The incident occurred around a configuration change touching messages.tts. Logs showed gateway reload detection for the relevant config paths, but a stale status read led to a restart attempt.

Even if the operator should avoid restarting from inside an active messaging session, the runtime should still be robust against:

false port-busy restart failures when the port owner is the managed gateway itself
stale embedded Codex thread bindings after compaction/app-server restart
direct-session queues stuck behind a stale run

Suggested Fix Criteria

For gateway restart:

Distinguish expected managed gateway listener from unexpected port conflict.
Restart the managed LaunchAgent gateway gracefully when it owns the configured port.
If active work exists, produce a clear deferred/refused state rather than a misleading port-busy error.
Keep enough state to verify replacement PID/RPC readiness before returning success.

For embedded Codex session recovery:

Automatically recover from stale_thread_binding / thread not found.
Rebind the direct session to a fresh valid thread or clear/requeue stale work safely.
Ensure new inbound channel messages are not blocked behind a dead embedded run.
Surface wedged agent sessions in openclaw status --deep even when gateway/channel health is OK.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway restart treats managed listener as port conflict; embedded Codex direct session can wedge during compaction/rebind

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Failure Mode 1: `openclaw gateway restart` false port-busy failure

Failure Mode 2: embedded Codex direct session wedges during compaction/rebind

Related Context

Suggested Fix Criteria

Why this matters

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Gateway restart treats managed listener as port conflict; embedded Codex direct session can wedge during compaction/rebind

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Failure Mode 1: openclaw gateway restart false port-busy failure

Failure Mode 2: embedded Codex direct session wedges during compaction/rebind

Related Context

Suggested Fix Criteria

Why this matters

Still need to ship something?

TRENDING

Failure Mode 1: `openclaw gateway restart` false port-busy failure