openclaw - 💡(How to fix) Fix Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Warn automatically if total thread lifecycle exceeds e.g. 2000ms, or any stage exceeds e.g. 1000ms.

Root Cause

For short turns like “reply only OK”, a user-visible 18s turn currently looks like:

  • queue: essentially free
  • embedded startup trace: ~1.3s
  • prompt submit after session start: immediate
  • model/tool work: ~6s

But the missing ~6s before session.started is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.

Fix Action

Fix / Workaround

With OPENCLAW_LOG_LEVEL=trace, [trace:embedded-run] startup stages ends at attempt-dispatch and reports only ~1.36s of startup/prep. However, session.started is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread / thread.resume or thread.start), but today it has no comparable stage breakdown.

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms
  • OpenClaw lane wait is tiny: 5-6ms.
  • Existing embedded startup trace reports only 1365ms.
  • session.started does not occur until ~6.3s after strict-agentic active / attempt-dispatch.
  • prompt.submitted is immediate after session.started (~11ms).
  • Therefore the largest unaccounted startup block is before session.started, likely in Codex app-server thread lifecycle.

Code Example

08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms

---

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

---

session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)

---

startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;

---

const startupThread = await startOrResumeThread(buildThreadLifecycleParams());

---

[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms

---

native hook relay bridge server address unavailable

---

server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}
RAW_BUFFERClick to expand / collapse

Problem

A production OpenClaw Docker gateway on 2026.5.18 shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.

With OPENCLAW_LOG_LEVEL=trace, [trace:embedded-run] startup stages ends at attempt-dispatch and reports only ~1.36s of startup/prep. However, session.started is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread / thread.resume or thread.start), but today it has no comparable stage breakdown.

This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.

Environment

  • OpenClaw: 2026.5.18
  • Runtime: Docker, derived runtime image
  • Channel tested: Dashboard/Webchat direct conversation
  • Model/provider: openai-codex/gpt-5.5
  • Gateway health: healthy during the clean sample
  • OPENCLAW_LOG_LEVEL=trace was enabled temporarily for the measurement, then restored
  • diagnostics.cacheTrace was not enabled

Observed clean sample

Run: 7141c10d-811a-49b8-a5e2-c3ca23f08b84

Timeline from gateway logs / trajectory:

08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms

Existing embedded startup trace:

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

Trajectory for the same run:

session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)

Key observation:

  • OpenClaw lane wait is tiny: 5-6ms.
  • Existing embedded startup trace reports only 1365ms.
  • session.started does not occur until ~6.3s after strict-agentic active / attempt-dispatch.
  • prompt.submitted is immediate after session.started (~11ms).
  • Therefore the largest unaccounted startup block is before session.started, likely in Codex app-server thread lifecycle.

Relevant code paths

The current startup trace is emitted in:

  • src/agents/pi-embedded-runner/run.ts

It marks and emits at:

startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;

The next expensive path appears to be in:

  • extensions/codex/src/app-server/run-attempt.ts
  • extensions/codex/src/app-server/thread-lifecycle.ts

Specifically around:

const startupThread = await startOrResumeThread(buildThreadLifecycleParams());

and inside startOrResumeThread, including:

  • readCodexAppServerBinding(...)
  • context-engine/user-MCP/MCP/plugin binding compatibility checks
  • optional pluginThreadConfig.build() / recoverable plugin binding recheck
  • client.request("thread/resume", ...) or client.request("thread/start", ...)
  • writeCodexAppServerBinding(...)

Suggested fix

Add a Codex app-server thread lifecycle stage summary, analogous to the embedded startup summary, around startOrResumeThread().

Suggested stages:

  • dynamic-tools-fingerprint
  • context-engine-binding
  • user-mcp-config
  • read-binding
  • binding-compat-checks
  • plugin-binding-stale-check
  • plugin-thread-config-build / plugin-thread-config-recheck
  • thread-resume-request
  • thread-start-request
  • write-binding
  • thread-ready

Suggested behavior:

  • Emit at trace level when OPENCLAW_LOG_LEVEL=trace.
  • Warn automatically if total thread lifecycle exceeds e.g. 2000ms, or any stage exceeds e.g. 1000ms.
  • Include lifecycle action (started vs resumed) and whether plugin app config was built/reused.

Example target log shape:

[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms

Secondary observation: native hook relay address debug race

In the same trace window, this line appears repeatedly a few seconds after harness selection:

native hook relay bridge server address unavailable

The source appears to call writeNativeHookRelayBridgeRecordForRegistration(...) immediately after server.listen(...), before the listen callback is guaranteed to have a bound address:

server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}

This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.

Why this matters

For short turns like “reply only OK”, a user-visible 18s turn currently looks like:

  • queue: essentially free
  • embedded startup trace: ~1.3s
  • prompt submit after session start: immediate
  • model/tool work: ~6s

But the missing ~6s before session.started is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.

What was not tested

  • I did not test with diagnostics.cacheTrace because of the privacy footprint.
  • I did not patch runtime code locally for this report.
  • I did not isolate whether native Codex plugin app config (#78733) specifically contributes to the thread lifecycle cost; the requested change is primarily observability so that can be measured cleanly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started