openclaw - 💡(How to fix) Fix Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started

StepCodex · 2026-05-20T17:00:51Z

[openclaw] Problem A production OpenClaw Docker gateway on 2026.5.18 shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing t… ## Fix / Workaround With `OPENCLAW_LOG_LEVEL=trace`, `[trace:embedded-run] startup stages` ends at `attempt-dispatch` and reports only ~1.36s of startup/prep. However, `session.started` is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (`startOrResumeThread` / `thread.resume` or `thread.start`), but today it has no comparable stage breakdown. ```text [trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms ``` - OpenClaw lane wait is tiny: `5-6ms`. - Existing embedded startup trace reports only `1365ms`. - `session.started` does not occur until ~`6.3s` after `strict-agentic active` / `attempt-dispatch`. - `prompt.submitted` is immediate after `session.started` (`~11ms`). - Therefore the largest unaccounted startup block is before `session.started`, likely in Codex app-server thread lifecycle. ## Problem A production OpenClaw Docker gateway on `2026.5.18` shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize. With `OPENCLAW_LOG_LEVEL=trace`, `[trace:embedded-run] startup stages` ends at `attempt-dispatch` and reports only ~1.36s of startup/prep. However, `session.started` is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (`startOrResumeThread` / `thread.resume` or `thread.start`), but today it has no comparable stage breakdown. This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes. ## Environment - OpenClaw: `2026.5.18` - Runtime: Docker, derived runtime image - Channel tested: Dashboard/Webchat direct conversation - Model/provider: `openai-codex/gpt-5.5` - Gateway health: healthy during the clean sample - `OPENCLAW_LOG_LEVEL=trace` was enabled temporarily for the measurement, then restored - `diagnostics.cacheTrace` was not enabled ## Observed clean sample Run: `7141c10d-811a-49b8-a5e2-c3ca23f08b84` Timeline from gateway logs / trajectory: ```text 08:37:16.717 message queued 08:37:20.273 lane enqueue 08:37:20.278 lane dequeue waitMs=5 08:37:20.282 main lane enqueue 08:37:20.289 main lane dequeue waitMs=6 08:37:21.642 strict-agentic execution contract active 08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365 08:37:27.981 session.started 08:37:27.992 prompt.submitted 08:37:32.186 tool.call message 08:37:32.260 tool.result message 08:37:34.222 model.completed 08:37:35.022 message processed duration=18334ms ``` Existing embedded startup trace: ```text [trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms ``` Trajectory for the same run: ```text session.started 08:37:27.981 context.compiled 08:37:27.984 (+3ms) prompt.submitted 08:37:27.992 (+11ms) tool.call 08:37:32.186 (+4205ms) tool.result 08:37:32.260 (+4279ms) model.completed 08:37:34.222 (+6241ms) session.ended 08:37:34.222 (+6241ms) ``` Key observation: - OpenClaw lane wait is tiny: `5-6ms`. - Existing embedded startup trace reports only `1365ms`. - `session.started` does not occur until ~`6.3s` after `strict-agentic active` / `attempt-dispatch`. - `prompt.submitted` is immediate after `session.started` (`~11ms`). - Therefore the largest unaccounted startup block is before `session.started`, likely in Codex app-server thread lifecycle. ## Relevant code paths The current startup trace is emitted in: - `src/agents/pi-embedded-runner/run.ts` It marks and emits at: ```ts startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch); notifyExecutionPhase("attempt_dispatch", { provider, model: modelId }); emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch); startupStagesEmitted = true; ``` The next expensive path appears to be in: - `extensions/codex/src/app-server/run-attempt.ts` - `extensions/codex/src/app-server/thread-lifecycle.ts` Specifically around: ```ts const startupThread = await startOrResumeThread(buildThreadLifecycleParams()); ``` and inside `startOrResumeThread`, including: - `re

openclaw2026-05-20 17:00:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Warn automatically if total thread lifecycle exceeds e.g. 2000ms, or any stage exceeds e.g. 1000ms.

Root Cause

For short turns like “reply only OK”, a user-visible 18s turn currently looks like:

queue: essentially free
embedded startup trace: ~1.3s
prompt submit after session start: immediate
model/tool work: ~6s

But the missing ~6s before session.started is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.

Fix Action

Fix / Workaround

With OPENCLAW_LOG_LEVEL=trace, [trace:embedded-run] startup stages ends at attempt-dispatch and reports only ~1.36s of startup/prep. However, session.started is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread / thread.resume or thread.start), but today it has no comparable stage breakdown.

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

OpenClaw lane wait is tiny: 5-6ms.
Existing embedded startup trace reports only 1365ms.
session.started does not occur until ~6.3s after strict-agentic active / attempt-dispatch.
prompt.submitted is immediate after session.started (~11ms).
Therefore the largest unaccounted startup block is before session.started, likely in Codex app-server thread lifecycle.

Code Example

08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms

---

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

---

session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)

---

startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;

---

const startupThread = await startOrResumeThread(buildThreadLifecycleParams());

---

[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms

---

native hook relay bridge server address unavailable

---

server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}

RAW_BUFFERClick to expand / collapse

Problem

A production OpenClaw Docker gateway on 2026.5.18 shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.

This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.

Environment

OpenClaw: 2026.5.18
Runtime: Docker, derived runtime image
Channel tested: Dashboard/Webchat direct conversation
Model/provider: openai-codex/gpt-5.5
Gateway health: healthy during the clean sample
OPENCLAW_LOG_LEVEL=trace was enabled temporarily for the measurement, then restored
diagnostics.cacheTrace was not enabled

Observed clean sample

Run: 7141c10d-811a-49b8-a5e2-c3ca23f08b84

Timeline from gateway logs / trajectory:

08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms

Existing embedded startup trace:

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

Trajectory for the same run:

session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)

Key observation:

OpenClaw lane wait is tiny: 5-6ms.
Existing embedded startup trace reports only 1365ms.
session.started does not occur until ~6.3s after strict-agentic active / attempt-dispatch.
prompt.submitted is immediate after session.started (~11ms).
Therefore the largest unaccounted startup block is before session.started, likely in Codex app-server thread lifecycle.

Relevant code paths

The current startup trace is emitted in:

src/agents/pi-embedded-runner/run.ts

It marks and emits at:

startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;

The next expensive path appears to be in:

extensions/codex/src/app-server/run-attempt.ts
extensions/codex/src/app-server/thread-lifecycle.ts

Specifically around:

const startupThread = await startOrResumeThread(buildThreadLifecycleParams());

and inside startOrResumeThread, including:

readCodexAppServerBinding(...)
context-engine/user-MCP/MCP/plugin binding compatibility checks
optional pluginThreadConfig.build() / recoverable plugin binding recheck
client.request("thread/resume", ...) or client.request("thread/start", ...)
writeCodexAppServerBinding(...)

Suggested fix

Add a Codex app-server thread lifecycle stage summary, analogous to the embedded startup summary, around startOrResumeThread().

Suggested stages:

dynamic-tools-fingerprint
context-engine-binding
user-mcp-config
read-binding
binding-compat-checks
plugin-binding-stale-check
plugin-thread-config-build / plugin-thread-config-recheck
thread-resume-request
thread-start-request
write-binding
thread-ready

Suggested behavior:

Emit at trace level when OPENCLAW_LOG_LEVEL=trace.
Warn automatically if total thread lifecycle exceeds e.g. 2000ms, or any stage exceeds e.g. 1000ms.
Include lifecycle action (started vs resumed) and whether plugin app config was built/reused.

Example target log shape:

[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms

Secondary observation: native hook relay address debug race

In the same trace window, this line appears repeatedly a few seconds after harness selection:

native hook relay bridge server address unavailable

The source appears to call writeNativeHookRelayBridgeRecordForRegistration(...) immediately after server.listen(...), before the listen callback is guaranteed to have a bound address:

server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}

This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.

Why this matters

For short turns like “reply only OK”, a user-visible 18s turn currently looks like:

queue: essentially free
embedded startup trace: ~1.3s
prompt submit after session start: immediate
model/tool work: ~6s

What was not tested

I did not test with diagnostics.cacheTrace because of the privacy footprint.
I did not patch runtime code locally for this report.
I did not isolate whether native Codex plugin app config (#78733) specifically contributes to the thread lifecycle cost; the requested change is primarily observability so that can be measured cleanly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering