openclaw - 💡(How to fix) Fix pi-embedded-runner: bash tool inside codex_app_server has no per-tool-call cap — wedges run lane up to ~30.5 min before failover

openclaw2026-05-22 13:25:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

The pi-embedded command-lane backstop timeout (~30.5 min = runTimeoutMs + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS=30s) is the only safety net for a hung bash tool call inside the codex app-server. When a bash tool call started by the codex app-server stops emitting progress (no tool:bash:ended, no further outputDelta), there is no dedicated activeToolAge cap — the lane waits the full ~30.5 minutes before FailoverError: LLM request timed out fires and the model-fallback chain triggers. During that window the gateway diagnostic emits stalled session ... reason=blocked_tool_call ... activeToolAge=786s ... 1150s ... recovery=none every 30s but does not actually intervene.

For interactive Discord/Telegram lanes this gates user-facing turns behind a 30-minute hard floor on the failure mode. The proposed fix is a tighter per-tool-call cap (default ~10 min for bash tools running under codex_app_server), distinct from the LLM stream timeout and the lane backstop.

Error Message

2026-05-22T15:15:37 stalled session: sessionId=f2199b12-9630-42b8-b83d-b8ba76dd9935 sessionKey=agent:ops:discord:channel:1485601326640664766 state=processing age=593s queueDepth=1 reason=blocked_tool_call classification=blocked_tool_call activeWorkKind=tool_call lastProgress=codex_app_server:notification:thread/tokenUsage/updated lastProgressAge=579s activeTool=bash activeToolCallId=exec-6a479699-dff5-4bb8-9afa-b675a0843cb8 activeToolAge=786s recovery=none

2026-05-22T15:21:41 stalled session: ... activeToolAge=1150s recovery=none

2026-05-22T15:31:07 lane task error: lane=session:agent:ops:discord:channel:1485601326640664766 durationMs=1816681 error="FailoverError: LLM request timed out."

Root Cause

src/agents/pi-embedded-runner/run.ts:178-193 wires the lane timeout to runTimeoutMs + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS:

const EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS = 30_000;

function resolveEmbeddedRunLaneTimeoutMs(timeoutMs: number): number | undefined {
  if (!Number.isFinite(timeoutMs) || timeoutMs <= 0) return undefined;
  return Math.floor(timeoutMs) + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS;
}

This becomes taskTimeoutMs on the CommandQueueEnqueueOptions. The bash-tool exec helpers use DEFAULT_EXEC_APPROVAL_TIMEOUT_MS = 1_800_000 (src/infra/exec-approvals.ts:196) for the approval timeout. Neither is a per-active-tool-call cap. src/logging/diagnostic-session-attention.ts:35 only logs blocked_tool_call once activity.activeToolAgeMs > params.staleMs; it does not abort. There is no equivalent of turnCompletionIdleTimeoutMs (referenced in #84076) wired specifically to activeToolAge for codex app-server bash tools.

The result: a hung bash tool call inside codex_app_server can run for 19+ minutes with no intervention before the run-lane backstop arrives. Compared to user expectations on an interactive Discord lane, that is far above P99 — for bash we already document a default subprocess limit elsewhere (DEFAULT_JOB_TTL_MS = 30 * 60 * 1000 for bash-process-registry).

Fix Action

Workaround

Reduce the agent's per-run timeoutSeconds config (currently effectively ~30 min via runTimeoutMs) so the run-lane backstop fires earlier. Cost: tighter ceiling on legitimate long-running runs.

Code Example

2026-05-22T15:15:37 stalled session: sessionId=f2199b12-9630-42b8-b83d-b8ba76dd9935
  sessionKey=agent:ops:discord:channel:1485601326640664766
  state=processing age=593s queueDepth=1
  reason=blocked_tool_call classification=blocked_tool_call
  activeWorkKind=tool_call
  lastProgress=codex_app_server:notification:thread/tokenUsage/updated
  lastProgressAge=579s
  activeTool=bash activeToolCallId=exec-6a479699-dff5-4bb8-9afa-b675a0843cb8
  activeToolAge=786s
  recovery=none

2026-05-22T15:21:41 stalled session: ... activeToolAge=1150s recovery=none

2026-05-22T15:31:07 lane task error: lane=session:agent:ops:discord:channel:1485601326640664766
  durationMs=1816681
  error="FailoverError: LLM request timed out."

---

const EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS = 30_000;

function resolveEmbeddedRunLaneTimeoutMs(timeoutMs: number): number | undefined {
  if (!Number.isFinite(timeoutMs) || timeoutMs <= 0) return undefined;
  return Math.floor(timeoutMs) + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS;
}

---

// new: src/agents/codex-app-server/tool-call-timeout.ts (or similar)
const DEFAULT_CODEX_BASH_TOOL_CALL_TIMEOUT_MS = 10 * 60 * 1000; // 10 min

// when activeToolAge > DEFAULT_CODEX_BASH_TOOL_CALL_TIMEOUT_MS
// and lastProgress is stale (no outputDelta/notification in N seconds),
// emit a synthetic tool-error to unwedge the lane, then proceed to
// the normal failover/retry path.

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw 2026.5.20 (e510042) — npm install at ~/.local/lib/node_modules/openclaw
Node 25.8.1, macOS 25.3.0 (arm64)
Provider: openai-codex / model gpt-5.5 / modelApi: openai-codex-responses
Codex app-server: stdio
Bash tool approval timeout: DEFAULT_EXEC_APPROVAL_TIMEOUT_MS = 1_800_000 (30 min)
Lane backstop: runTimeoutMs (~30 min, agent config) + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS (30s) ≈ 1,816,681 ms
Channel: Discord channel session (agent:ops:discord:channel:1485601326640664766)

Reproduction

Configure an OpenClaw agent to use openai-codex with gpt-5.5 and let it issue a bash tool call via the codex app-server.
Cause the bash subprocess to stop emitting output but not terminate (in our incident, the underlying subprocess produced notification:thread/tokenUsage/updated and notification:item/completed then went silent — different from the closed #82640 case where the subprocess exited).
Observe the gateway diagnostic emit stalled session ... reason=blocked_tool_call ... activeToolAge=Xs recovery=none every 30 s.
Wait. The lane unblocks only when the ~30.5 minute backstop fires, not on any tool-specific cap.

Error / log evidence

From /Users/agent/.openclaw/logs/gateway.log.20260522-180005 (incident 2026-05-22 15:15-15:31 ICT, UTC +7):

2026-05-22T15:15:37 stalled session: sessionId=f2199b12-9630-42b8-b83d-b8ba76dd9935
  sessionKey=agent:ops:discord:channel:1485601326640664766
  state=processing age=593s queueDepth=1
  reason=blocked_tool_call classification=blocked_tool_call
  activeWorkKind=tool_call
  lastProgress=codex_app_server:notification:thread/tokenUsage/updated
  lastProgressAge=579s
  activeTool=bash activeToolCallId=exec-6a479699-dff5-4bb8-9afa-b675a0843cb8
  activeToolAge=786s
  recovery=none

2026-05-22T15:21:41 stalled session: ... activeToolAge=1150s recovery=none

2026-05-22T15:31:07 lane task error: lane=session:agent:ops:discord:channel:1485601326640664766
  durationMs=1816681
  error="FailoverError: LLM request timed out."

durationMs=1816681 = ~30 min 16 s — the run-lane backstop. activeToolAge exceeded 19 minutes before any timeout fired. Discord/embedded run failover only happened at the backstop, with recovery=none between.

Root cause

src/agents/pi-embedded-runner/run.ts:178-193 wires the lane timeout to runTimeoutMs + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS:

const EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS = 30_000;

function resolveEmbeddedRunLaneTimeoutMs(timeoutMs: number): number | undefined {
  if (!Number.isFinite(timeoutMs) || timeoutMs <= 0) return undefined;
  return Math.floor(timeoutMs) + EMBEDDED_RUN_LANE_TIMEOUT_GRACE_MS;
}

Suggested fix

Add a per-active-tool-call backstop specifically for bash (or all codex_app_server tools), bounded much tighter than the run-lane timeout:

// new: src/agents/codex-app-server/tool-call-timeout.ts (or similar)
const DEFAULT_CODEX_BASH_TOOL_CALL_TIMEOUT_MS = 10 * 60 * 1000; // 10 min

// when activeToolAge > DEFAULT_CODEX_BASH_TOOL_CALL_TIMEOUT_MS
// and lastProgress is stale (no outputDelta/notification in N seconds),
// emit a synthetic tool-error to unwedge the lane, then proceed to
// the normal failover/retry path.

The cap should be:

default 10 * 60 * 1000 (10 min) — well below the run-lane backstop
configurable per-provider/per-agent via agents.defaults.timeoutMs.codexBashToolCallMs or similar
scoped to bash tools under codex_app_server (and ACP tool calls generally), since other providers carry their own timeouts

This is independent of #84076 (codex turnCompletionIdleTimeoutMs recovery semantics, fixed); my case is a tool call that never emits its terminal event, not a turn that never completes.

Workaround

Reduce the agent's per-run timeoutSeconds config (currently effectively ~30 min via runTimeoutMs) so the run-lane backstop fires earlier. Cost: tighter ceiling on legitimate long-running runs.

Severity

P2 — interactive lanes (Discord/Telegram) wait up to 30.5 min for a wedged bash tool to fail before the model-fallback chain can take over. For our incident, this gated three sessions (ops, main queued behind it, chatgpt upstream) on a 30-min hard floor instead of ~10-min per-tool cap.

#82640 — Codex harness session stalls forever after bash subprocess exits (closed; different — terminal tool:bash:ended arrived, gateway didn't observe)
#83474 — Codex dynamic tool calls can leave sessions stuck as blocked_tool_call (closed; symptomatic overlap, different cause)
#84076 — Codex app-server stalls after item/completed, then aborts without recovery/status (closed; turnCompletionIdleTimeoutMs semantics — orthogonal to bash-tool-specific cap)
#71127 — Stuck processing sessions are detected but never aborted — gateway requires external restart to recover (closed; same family)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix pi-embedded-runner: bash tool inside codex_app_server has no per-tool-call cap — wedges run lane up to ~30.5 min before failover

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Error / log evidence

Root cause

Suggested fix

Workaround

Severity

Related

Still need to ship something?

TRENDING