openclaw - 💡(How to fix) Fix codex app-server children orphan to PPID=1 across Gateway restarts; accumulate over days, drive OAuth refresh storm and silent agent turn timeouts

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Embedded codex app-server child processes spawned by the Gateway are not terminated when the Gateway exits/restarts. They get re-parented to init (PPID=1) and continue running for days, each holding cached OAuth profile state and periodically attempting refresh. Over time this produces:

  1. A storm of OAuth refresh failures (observed: 56 failures in one day from a single user)
  2. Profile lock contention when a live agent tries to dispatch to embedded codex
  3. codex app-server turn idle timed out followed by client retired, with no fallback succeeding
  4. End-user symptom on Telegram: bot shows "typing…", typing disappears, no message is ever delivered — silent failure

Error Message

10:43:11 memory sync failed (session-start): Unknown system error -11 This is the root cause of an entire class of "Telegram agent shows typing but never replies" reports. The bug is silent (no error surfaced to the user) and self-amplifying (each Gateway restart adds more orphans, which makes the next failure more likely). It also makes Gateway upgrades land into a worse state than they started.

Root Cause

This is the root cause of an entire class of "Telegram agent shows typing but never replies" reports. The bug is silent (no error surfaced to the user) and self-amplifying (each Gateway restart adds more orphans, which makes the next failure more likely). It also makes Gateway upgrades land into a worse state than they started.

Fix Action

Workaround

Periodically:

pkill -f 'codex.*app-server.*--enable goals --listen unix://'

…or just restart the Gateway and then immediately:

ps -eo pid,ppid,command | awk '\$2==1 && /codex app-server/ {print \$1}' | xargs -r kill

After cleanup, agent turns recover (verified locally).

Code Example

PID    PPID  ELAPSED      COMMAND
64334     1  01-01:18:00  node .../codex app-server --enable goals --listen unix://app-server.sock
64335 64334  01-01:18:00  .../codex app-server --enable goals --listen unix://app-server.sock
66282     1  02-20:25:15  node .../codex app-server --enable goals --listen unix://app-server.sock
66283 66282  02-20:25:15  .../codex app-server --enable goals --listen unix://app-server.sock
69580     1  02-20:08:26  node .../codex app-server --enable goals --listen unix://app-server.sock
69581 69580  02-20:08:26  .../codex app-server --enable goals --listen unix://app-server.sock
78340     1  02-19:17:23  node .../codex app-server --enable goals --listen unix://app-server.sock
78341 78340  02-19:17:23  .../codex app-server --enable goals --listen unix://app-server.sock
78359     1  02-19:17:16  node .../codex app-server --enable goals --listen unix://app-server.sock
78360 78359  02-19:17:16  .../codex app-server --enable goals --listen unix://app-server.sock
... (plus earlier-killed: 9139/9140, 10674/10691, 24161)

---

10:34:10  started codex app-server compaction
10:34:39  completed codex app-server compaction
10:37:33  liveness warning: event_loop_delay max 21055ms, utilization 0.854
10:39:18  Inbound message telegram:group:...:topic:14 -> @...bot  (the user message)
10:39:46  event_loop_delay max 21223ms
10:41:48  event_loop_delay max 9420ms
10:43:11  memory sync failed (session-start): Unknown system error -11
10:43:57  codex app-server turn idle timed out waiting for completion
10:43:57  [agent/embedded] codex app-server client retired after timed-out turn
10:43:57  Profile openai-codex:default timed out. Trying next account...
(no further turn completion logged for this run; no telegram outbound for this inbound)

---

pkill -f 'codex.*app-server.*--enable goals --listen unix://'

---

ps -eo pid,ppid,command | awk '\$2==1 && /codex app-server/ {print \$1}' | xargs -r kill
RAW_BUFFERClick to expand / collapse

Summary

Embedded codex app-server child processes spawned by the Gateway are not terminated when the Gateway exits/restarts. They get re-parented to init (PPID=1) and continue running for days, each holding cached OAuth profile state and periodically attempting refresh. Over time this produces:

  1. A storm of OAuth refresh failures (observed: 56 failures in one day from a single user)
  2. Profile lock contention when a live agent tries to dispatch to embedded codex
  3. codex app-server turn idle timed out followed by client retired, with no fallback succeeding
  4. End-user symptom on Telegram: bot shows "typing…", typing disappears, no message is ever delivered — silent failure

Evidence

On a single user's machine (after multiple Gateway restarts over 48h), 13 orphan codex app-server processes were running concurrently:

PID    PPID  ELAPSED      COMMAND
64334     1  01-01:18:00  node .../codex app-server --enable goals --listen unix://app-server.sock
64335 64334  01-01:18:00  .../codex app-server --enable goals --listen unix://app-server.sock
66282     1  02-20:25:15  node .../codex app-server --enable goals --listen unix://app-server.sock
66283 66282  02-20:25:15  .../codex app-server --enable goals --listen unix://app-server.sock
69580     1  02-20:08:26  node .../codex app-server --enable goals --listen unix://app-server.sock
69581 69580  02-20:08:26  .../codex app-server --enable goals --listen unix://app-server.sock
78340     1  02-19:17:23  node .../codex app-server --enable goals --listen unix://app-server.sock
78341 78340  02-19:17:23  .../codex app-server --enable goals --listen unix://app-server.sock
78359     1  02-19:17:16  node .../codex app-server --enable goals --listen unix://app-server.sock
78360 78359  02-19:17:16  .../codex app-server --enable goals --listen unix://app-server.sock
... (plus earlier-killed: 9139/9140, 10674/10691, 24161)

All orphan node parents have PPID=1, indicating the original Gateway died without taking them down.

Gateway log shows the user-visible failure mode (timestamps in CST):

10:34:10  started codex app-server compaction
10:34:39  completed codex app-server compaction
10:37:33  liveness warning: event_loop_delay max 21055ms, utilization 0.854
10:39:18  Inbound message telegram:group:...:topic:14 -> @...bot  (the user message)
10:39:46  event_loop_delay max 21223ms
10:41:48  event_loop_delay max 9420ms
10:43:11  memory sync failed (session-start): Unknown system error -11
10:43:57  codex app-server turn idle timed out waiting for completion
10:43:57  [agent/embedded] codex app-server client retired after timed-out turn
10:43:57  Profile openai-codex:default timed out. Trying next account...
(no further turn completion logged for this run; no telegram outbound for this inbound)

OAuth refresh failures from [agents/auth-profiles] log channel: 56 entries on a single day.

Why it matters

This is the root cause of an entire class of "Telegram agent shows typing but never replies" reports. The bug is silent (no error surfaced to the user) and self-amplifying (each Gateway restart adds more orphans, which makes the next failure more likely). It also makes Gateway upgrades land into a worse state than they started.

Suggested fixes

  1. Spawn codex children with a die-when-parent-dies guarantee. On macOS/Linux this needs an explicit dance because Node doesn't get PR_SET_PDEATHSIG-equivalent for free. Options:
    • On Linux, set prctl(PR_SET_PDEATHSIG, SIGTERM) in the child (via a small launcher) before exec.
    • On macOS, use a watchdog goroutine/thread in the child that polls getppid() == 1 and exits, OR have the Gateway pass its pid via env and the child watches via kqueue EVFILT_PROC.
    • Cross-platform fallback: spawn via stdio pipe; child exits on stdin EOF. (The newer stdio:// variant of codex app-server appears to do this — the bug is specific to the older unix://app-server.sock --enable goals variant.)
  2. On Gateway shutdown, explicitly SIGTERM all known codex children before exiting. Track child pids in a registry; on shutdown started: gateway restarting, walk the registry and kill.
  3. Detect-and-reap on Gateway startup. Scan for codex app-server processes with PPID=1 belonging to the current user and matching the OpenClaw codex install path, and reap them before binding the OAuth profile.
  4. Backoff on OAuth refresh failures. Even with orphans gone, an OAuth refresh that fails should not retry tightly enough to produce 56 failures/day from one machine — add exponential backoff with a ceiling.

Workaround

Periodically:

pkill -f 'codex.*app-server.*--enable goals --listen unix://'

…or just restart the Gateway and then immediately:

ps -eo pid,ppid,command | awk '\$2==1 && /codex app-server/ {print \$1}' | xargs -r kill

After cleanup, agent turns recover (verified locally).

Related

  • openclaw/openclaw#86308 (separate bug: update.run handoff cwd ENOENT also strands the Gateway, which is one of the ways orphans get created in the first place)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix codex app-server children orphan to PPID=1 across Gateway restarts; accumulate over days, drive OAuth refresh storm and silent agent turn timeouts