openclaw - 💡(How to fix) Fix Regression: app-server per-agent CODEX_HOME is not a replacement for codex-cli process isolation

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

OpenClaw 2026.5.20+ removed the bundled codex-cli backend and migrates legacy codex-cli/* model refs to the openai/* Codex app-server route.

This is a regression for deployments that run many high-context agents concurrently, especially through Telegram. The Codex app-server route provides some state isolation through per-agent CODEX_HOME, but that is not equivalent to the old codex-cli execution model.

Please restore codex-cli as a first-class, supported, opt-in runtime.

I am not asking to remove the Codex app-server path. The app-server route can remain the default for normal users. The request is to preserve a real CLI-isolated runtime for users who need stronger operational isolation.

Error Message

codex app-server compaction timed out; restarting app-server context-engine compaction failed stuck session recovery ... aborted=true drained=true fetch timeout ... timer delayed ... likely event-loop starvation GatewayTransportError: gateway timeout after 10000ms

Root Cause

The old codex-cli backend provided a stronger boundary because each turn/session ran through external codex exec CLI execution instead of the shared app-server harness path.

Code Example

Gateway -> codex exec -> CLI process / CLI session

---

Gateway -> Codex app-server harness -> app-server thread / compaction / tool bridge

---

codex app-server compaction timed out; restarting app-server
context-engine compaction failed
stuck session recovery ... aborted=true drained=true
fetch timeout ... timer delayed ... likely event-loop starvation
GatewayTransportError: gateway timeout after 10000ms

---

openclaw agent --message "hi" --model codex-cli/gpt-5.5

---

codex exec ...

---

{
  agents: {
    defaults: {
      model: {
        primary: "codex-cli/gpt-5.5",
        fallbacks: []
      },
      models: {
        "codex-cli/gpt-5.5": {}
      },
      cliBackends: {
        "codex-cli": {
          command: "/opt/homebrew/bin/codex"
        }
      }
    }
  }
}

---

runner: cli
provider: codex-cli
model: gpt-5.5
result: OK

---

agentRuntime.id: "codex"      -> Codex app-server runtime
agentRuntime.id: "codex-cli"  -> Codex CLI runtime using codex exec

---

{
  agents: {
    defaults: {
      models: {
        "openai/gpt-5.5": {
          agentRuntime: { id: "codex-cli" }
        }
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Regression: app-server per-agent CODEX_HOME is not a replacement for codex-cli process isolation

Summary

OpenClaw 2026.5.20+ removed the bundled codex-cli backend and migrates legacy codex-cli/* model refs to the openai/* Codex app-server route.

This is a regression for deployments that run many high-context agents concurrently, especially through Telegram. The Codex app-server route provides some state isolation through per-agent CODEX_HOME, but that is not equivalent to the old codex-cli execution model.

Please restore codex-cli as a first-class, supported, opt-in runtime.

I am not asking to remove the Codex app-server path. The app-server route can remain the default for normal users. The request is to preserve a real CLI-isolated runtime for users who need stronger operational isolation.

Why per-agent CODEX_HOME is not enough

The current Codex app-server plugin does isolate some state:

  • each agent gets its own CODEX_HOME
  • Codex config, account state, plugin cache/data, and thread state are scoped per agent
  • local app-server launches use a child process

That is useful, but it is state isolation, not execution-path isolation.

The old codex-cli backend had a different operational shape:

Gateway -> codex exec -> CLI process / CLI session

The 2026.5.20+ Codex app-server path has this shape:

Gateway -> Codex app-server harness -> app-server thread / compaction / tool bridge

With app-server routing:

  • Codex app-server owns the model loop
  • Codex app-server owns canonical thread state
  • compaction is Codex-native and coordinated through the app-server path
  • OpenClaw dynamic tools are bridged through the Codex adapter
  • session recovery, progress, tool calls, compaction notifications, and Telegram delivery remain coupled through the Gateway/app-server control path

So even with per-agent CODEX_HOME, long-running or large-context app-server turns can still interfere with the responsiveness of unrelated Telegram agents.

The old codex-cli backend provided a stronger boundary because each turn/session ran through external codex exec CLI execution instead of the shared app-server harness path.

Affected use case

My deployment has:

  • many Telegram bot accounts
  • multiple agents active at the same time
  • very large contexts
  • background cron jobs
  • long-running research/code tasks that can run for many minutes

This is exactly the kind of workload where process/runtime isolation matters. A heavy background agent should not make unrelated interactive Telegram agents appear silent or stalled.

Observed behavior on 2026.5.22

Environment:

  • macOS
  • Node.js v22.22.2
  • OpenClaw 2026.5.22
  • primary model: openai/gpt-5.5
  • Codex runtime: app-server route through agentRuntime.id: "codex"
  • multiple Telegram agents configured
  • long-context sessions and cron jobs active

Symptoms:

  • Telegram shows typing, then stops silently.
  • Replies eventually arrive late, after a long quiet gap.
  • Gateway requests can exceed the client timeout even when the Gateway eventually answers.
  • App-server compaction / recovery activity correlates with the stalls.

Representative log symptoms:

codex app-server compaction timed out; restarting app-server
context-engine compaction failed
stuck session recovery ... aborted=true drained=true
fetch timeout ... timer delayed ... likely event-loop starvation
GatewayTransportError: gateway timeout after 10000ms

In one case, the client timed out at 10s while Gateway health later completed after about 12.8s.

Concurrency check:

  • single openclaw gateway health: about 2s
  • 8 concurrent openclaw gateway health on 2026.5.22: about 20.3s to 20.5s each

This suggests the Gateway/app-server control path becomes a bottleneck under concurrent Codex app-server workloads.

Version comparison

I compared package behavior/docs:

  • 2026.5.12 still documents and registers bundled codex-cli.
  • 2026.5.20 already states that OpenClaw no longer keeps a bundled Codex CLI backend.
  • 2026.5.22 keeps that direction.

In 2026.5.12 this worked:

openclaw agent --message "hi" --model codex-cli/gpt-5.5

The bundled OpenAI plugin registered a default backend that ran:

codex exec ...

In 2026.5.20+, legacy codex-cli/* refs are migrated to openai/* and run through the Codex app-server route instead.

Rollback result

After rolling back to OpenClaw 2026.5.12 and configuring agents to use:

{
  agents: {
    defaults: {
      model: {
        primary: "codex-cli/gpt-5.5",
        fallbacks: []
      },
      models: {
        "codex-cli/gpt-5.5": {}
      },
      cliBackends: {
        "codex-cli": {
          command: "/opt/homebrew/bin/codex"
        }
      }
    }
  }
}

A smoke test confirmed:

runner: cli
provider: codex-cli
model: gpt-5.5
result: OK

The process list showed codex exec ... --model gpt-5.5, not codex app-server.

This restored the desired execution model.

Expected behavior

OpenClaw should support both Codex runtime modes:

agentRuntime.id: "codex"      -> Codex app-server runtime
agentRuntime.id: "codex-cli"  -> Codex CLI runtime using codex exec

The app-server route can remain the default. But users should be able to intentionally choose CLI execution when they need isolation.

Requested changes

Please consider:

  • restore the bundled codex-cli backend in 2026.5.20+
  • keep codex-cli/gpt-* refs working for users who explicitly choose them
  • do not forcibly migrate explicit codex-cli/* refs to openai/* app-server
  • support model-scoped runtime policy such as:
{
  agents: {
    defaults: {
      models: {
        "openai/gpt-5.5": {
          agentRuntime: { id: "codex-cli" }
        }
      }
    }
  }
}
  • document two valid Codex modes:
    • app-server mode for native Codex integration
    • CLI mode for high-concurrency / large-context / stronger isolation deployments
  • keep Gateway control-plane operations responsive even when app-server compaction, recovery, or long-running turns are active
  • expose clearer backpressure/status instead of letting Telegram users see typing followed by silence

Why this should be first-class

This is not just a user preference for an older model ref. It is a runtime isolation boundary.

Per-agent CODEX_HOME prevents state leakage, but it does not prevent a busy app-server harness/control path from affecting unrelated agent turns. For deployments with many concurrent long-context agents, codex-cli is still operationally necessary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

OpenClaw should support both Codex runtime modes:

agentRuntime.id: "codex"      -> Codex app-server runtime
agentRuntime.id: "codex-cli"  -> Codex CLI runtime using codex exec

The app-server route can remain the default. But users should be able to intentionally choose CLI execution when they need isolation.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Regression: app-server per-agent CODEX_HOME is not a replacement for codex-cli process isolation