openclaw - 💡(How to fix) Fix Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
  • Version: OpenClaw 2026.5.18 (50a2481).
  • Channel: Discord (@openclaw/discord)
  • Runtime / providers reproduced: harness=codex (openai-codex/gpt-5.5), harness=pi (Pi-compat over codex OAuth, openai/gpt-5.5), and at least one confirmed wedge on provider=google model=gemini-3.5-flash (channel sticky-pinned via session.json) — so the wedge is not provider-specific.
  • Symptom:
    • before_dispatch / embedded_run:started observed
    • stalls before [agent/embedded] strict-agentic execution contract active
    • no llm_input, no before_tool_call, no agent_end
    • known-session lane: stuck session recovery fires at age=360s with action=abort_embedded_run aborted=true drained=false forceCleared=true released=1
    • unknown-session lane: recovery=none continues past 270s until gateway restart — no recovery observed
    • Gateway restart clears in-flight stuck lanes but new dispatches re-wedge within minutes — appears persistent in the gateway runtime state.
  • References (already landed in 2026.5.18, this happens after them):
    • #82782 / 91ae1a6c03 — split embedded attempt dispatch timing
    • #82891 / 8a060b2904 — release embedded session write lock before model I/O
    • e30be460e1 — shortened stalled Codex recovery window

Error Message

  • five workspace-lead agents (baikinman, design-library-lead, takeshi, design-lead, content-ops-lead) fail with No API key found for provider "openai-codex" in ~1s each; each emits [diagnostic] lane task error: lane=main durationMs=~1140
  • infra-lead failing-fast with xai grok 403 billing error (agent_end success=true durationMs=696 — billing-error fast-fail papered as success at observability layer); recurs every ~30 min on the gateway

Root Cause

  1. sessionId=unknown falls outside recovery. The 360s stuck session recovery only fires when a sessionId has been registered. Wedges that hang before sessionId registration appear to be uncoverable from the diagnostic — the lane has channel/session-key identity but recovery skips it. This looks like either a recovery coverage gap (recovery should key on lane / sessionKey too) or a missing "cannot recover because missing session id" diagnostic reason.

Fix Action

Fix / Workaround

  • Version: OpenClaw 2026.5.18 (50a2481).
  • Channel: Discord (@openclaw/discord)
  • Runtime / providers reproduced: harness=codex (openai-codex/gpt-5.5), harness=pi (Pi-compat over codex OAuth, openai/gpt-5.5), and at least one confirmed wedge on provider=google model=gemini-3.5-flash (channel sticky-pinned via session.json) — so the wedge is not provider-specific.
  • Symptom:
    • before_dispatch / embedded_run:started observed
    • stalls before [agent/embedded] strict-agentic execution contract active
    • no llm_input, no before_tool_call, no agent_end
    • known-session lane: stuck session recovery fires at age=360s with action=abort_embedded_run aborted=true drained=false forceCleared=true released=1
    • unknown-session lane: recovery=none continues past 270s until gateway restart — no recovery observed
    • Gateway restart clears in-flight stuck lanes but new dispatches re-wedge within minutes — appears persistent in the gateway runtime state.
  • References (already landed in 2026.5.18, this happens after them):
    • #82782 / 91ae1a6c03 — split embedded attempt dispatch timing
    • #82891 / 8a060b2904 — release embedded session write lock before model I/O
    • e30be460e1 — shortened stalled Codex recovery window

Wedged turn — reuses sessionId=0e9608aa…, no contract activation follows:

13:19:26.153 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:… 13:19:34.810 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:<chan-1> (no strict-agentic execution contract active, no llm_input, no tool calls, no agent_end — for 360s)

Next dispatch was a user /new, 5.5 min later:

13:31:38.913 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:… 13:31:39.310 [observability] session_end …/channel:<chan-1> reason=new hadBinding=false

Code Example

{
    "agents.defaults.model.primary": "openai/gpt-5.5",
    "agents.defaults.models": {
      "openai/gpt-5.5":   { "agentRuntime": { "id": "pi" } },
      "openai/gpt-5.4":   { "agentRuntime": { "id": "pi" } },
      "openai/chat-latest": { "agentRuntime": { "id": "pi" } }
    }
  }

---

# Prior sessionId=unknown wedge being cleared by lane-suspension TTL — different recovery path:
13:14:34.648 [diagnostic] stalled session: sessionId=unknown sessionKey=/channel:<chan-1>
              state=processing age=149s queueDepth=1 reason=active_work_without_progress
              classification=stalled_agent_run activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=148s recovery=none
13:14:36.183 [diagnostic] lane wait exceeded: lane=main waitedMs=162682113:14:36.188 [diagnostic] lane wait exceeded: lane=main waitedMs=149154 queueAhead=1 activeAhead=0 activeNow=1
13:14:36.192 [session-suspension] auto-resumed lane after suspension TTL    ← recovery via TTL

# Successful turn — full marker sequence:
13:14:37.049 [agent/embedded] strict-agentic execution contract active:
              runId=4fcbf81b… sessionId=0e9608aa… provider=openai-codex/gpt-5.5 harness=codex
13:14:40.516 [observability] llm_input sessionKey=/channel:<chan-1> provider=openai-codex model=gpt-5.5
13:14:50.717 [observability] agent_end sessionKey=/channel:<chan-1> success=true durationMs=13664

# Wedged turn — reuses sessionId=0e9608aa…, no contract activation follows:
13:19:26.153 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:13:19:34.810 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:<chan-1>
              (no `strict-agentic execution contract active`, no `llm_input`,
               no tool calls, no agent_end — for 360s)

# 30s-cadence stall diagnostics (gateway.err.log):
13:21:35.159 [diagnostic] long-running session: sessionId=0e9608aa… age=120s queueDepth=1
              reason=queued_behind_active_work classification=long_running activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=118s recovery=none
13:22:05.159 [diagnostic] stalled session: … age=150s … recovery=none
13:22:35.161 [diagnostic] stalled session: … age=180s … recovery=none
13:23:05.162 [diagnostic] stalled session: … age=210s … recovery=none
13:23:35.166 [diagnostic] stalled session: … age=240s … recovery=none
13:24:05.163 [diagnostic] stalled session: … age=270s … recovery=none
13:24:35.164 [diagnostic] stalled session: … age=300s … recovery=none
13:25:05.168 [diagnostic] stalled session: … age=330s … recovery=none
13:25:35.170 [diagnostic] stalled session: … age=360s … recovery=checking    ← threshold reached

# Auto-recovery fires:
13:25:50.196 [diagnostic] stuck session recovery: sessionId=0e9608aa… age=360s
              action=abort_embedded_run aborted=true drained=false released=1
13:25:50.199 [diagnostic] stuck session recovery outcome: status=aborted
              action=abort_embedded_run … activeWorkKind=embedded_run
              lane=session:agent:hidetoshi:discord:channel:<chan-1>
              aborted=true drained=false forceCleared=true released=1

# Next dispatch was a user /new, 5.5 min later:
13:31:38.913 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:13:31:39.310 [observability] session_end …/channel:<chan-1> reason=new hadBinding=false

---

15:06:45.110 before_dispatch …/channel:<chan-B>
15:07:50.477 before_dispatch …/channel:<chan-A>
              (no contract activation, no llm_input, no agent_end follows either)

15:08:47 [diagnostic] stalled session: sessionId=unknown sessionKey=/<chan-B> age=122s recovery=none
…30s cadence continues, both lanes…
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=/<chan-B> age=272s recovery=none
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=/<chan-A> age=207s recovery=none
15:11:23 [gateway] SIGTERM (manual restart — would not have hit the 360s threshold)
RAW_BUFFERClick to expand / collapse

Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

Summary

  • Version: OpenClaw 2026.5.18 (50a2481).
  • Channel: Discord (@openclaw/discord)
  • Runtime / providers reproduced: harness=codex (openai-codex/gpt-5.5), harness=pi (Pi-compat over codex OAuth, openai/gpt-5.5), and at least one confirmed wedge on provider=google model=gemini-3.5-flash (channel sticky-pinned via session.json) — so the wedge is not provider-specific.
  • Symptom:
    • before_dispatch / embedded_run:started observed
    • stalls before [agent/embedded] strict-agentic execution contract active
    • no llm_input, no before_tool_call, no agent_end
    • known-session lane: stuck session recovery fires at age=360s with action=abort_embedded_run aborted=true drained=false forceCleared=true released=1
    • unknown-session lane: recovery=none continues past 270s until gateway restart — no recovery observed
    • Gateway restart clears in-flight stuck lanes but new dispatches re-wedge within minutes — appears persistent in the gateway runtime state.
  • References (already landed in 2026.5.18, this happens after them):
    • #82782 / 91ae1a6c03 — split embedded attempt dispatch timing
    • #82891 / 8a060b2904 — release embedded session write lock before model I/O
    • e30be460e1 — shortened stalled Codex recovery window

Environment

  • macOS 26.3 (arm64), Node 22.22.1
  • Profile: secondary (~/.openclaw-hidetoshi/)
  • Auth: openai-codex:* OAuth profile only (no OPENAI_API_KEY)
  • 21 plugins (incl. observability for the diagnostic markers)
  • Pi-compat route configured via model-level override:
    {
      "agents.defaults.model.primary": "openai/gpt-5.5",
      "agents.defaults.models": {
        "openai/gpt-5.5":   { "agentRuntime": { "id": "pi" } },
        "openai/gpt-5.4":   { "agentRuntime": { "id": "pi" } },
        "openai/chat-latest": { "agentRuntime": { "id": "pi" } }
      }
    }
    Verified resolving: /status shows Runtime: OpenClaw Pi Default + 🔑 oauth (openai-codex:<email>); successful turns log provider=openai-codex/gpt-5.5 harness=pi.

Reproduction signal — primary case (full lifecycle through 360s auto-recovery, harness=codex)

24-line redacted slice from a single channel covering one successful turn, a wedged turn on the same session, the 30s-cadence stall diagnostics, and the 360s abort:

# Prior sessionId=unknown wedge being cleared by lane-suspension TTL — different recovery path:
13:14:34.648 [diagnostic] stalled session: sessionId=unknown sessionKey=…/channel:<chan-1>
              state=processing age=149s queueDepth=1 reason=active_work_without_progress
              classification=stalled_agent_run activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=148s recovery=none
13:14:36.183 [diagnostic] lane wait exceeded: lane=main waitedMs=1626821 …
13:14:36.188 [diagnostic] lane wait exceeded: lane=main waitedMs=149154 queueAhead=1 activeAhead=0 activeNow=1
13:14:36.192 [session-suspension] auto-resumed lane after suspension TTL    ← recovery via TTL

# Successful turn — full marker sequence:
13:14:37.049 [agent/embedded] strict-agentic execution contract active:
              runId=4fcbf81b… sessionId=0e9608aa… provider=openai-codex/gpt-5.5 harness=codex
13:14:40.516 [observability] llm_input sessionKey=…/channel:<chan-1> provider=openai-codex model=gpt-5.5
13:14:50.717 [observability] agent_end sessionKey=…/channel:<chan-1> success=true durationMs=13664

# Wedged turn — reuses sessionId=0e9608aa…, no contract activation follows:
13:19:26.153 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:19:34.810 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:<chan-1>
              (no `strict-agentic execution contract active`, no `llm_input`,
               no tool calls, no agent_end — for 360s)

# 30s-cadence stall diagnostics (gateway.err.log):
13:21:35.159 [diagnostic] long-running session: sessionId=0e9608aa… age=120s queueDepth=1
              reason=queued_behind_active_work classification=long_running activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=118s recovery=none
13:22:05.159 [diagnostic] stalled session: … age=150s … recovery=none
13:22:35.161 [diagnostic] stalled session: … age=180s … recovery=none
13:23:05.162 [diagnostic] stalled session: … age=210s … recovery=none
13:23:35.166 [diagnostic] stalled session: … age=240s … recovery=none
13:24:05.163 [diagnostic] stalled session: … age=270s … recovery=none
13:24:35.164 [diagnostic] stalled session: … age=300s … recovery=none
13:25:05.168 [diagnostic] stalled session: … age=330s … recovery=none
13:25:35.170 [diagnostic] stalled session: … age=360s … recovery=checking    ← threshold reached

# Auto-recovery fires:
13:25:50.196 [diagnostic] stuck session recovery: sessionId=0e9608aa… age=360s
              action=abort_embedded_run aborted=true drained=false released=1
13:25:50.199 [diagnostic] stuck session recovery outcome: status=aborted
              action=abort_embedded_run … activeWorkKind=embedded_run
              lane=session:agent:hidetoshi:discord:channel:<chan-1>
              aborted=true drained=false forceCleared=true released=1

# Next dispatch was a user /new, 5.5 min later:
13:31:38.913 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:31:39.310 [observability] session_end …/channel:<chan-1> reason=new hadBinding=false

(Full file: stuck-turn-window-filtered.log.)

Reproduction signal — same wedge on harness=pi with sessionId=unknown (no recovery)

After a gateway restart that picked up agentRuntime.id: "pi" (subsequent successful turns logged harness=pi), two Discord channels wedged with identical signature but sessionId=unknownrecovery=none for the full observation window:

15:06:45.110 before_dispatch …/channel:<chan-B>
15:07:50.477 before_dispatch …/channel:<chan-A>
              (no contract activation, no llm_input, no agent_end follows either)

15:08:47 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=122s recovery=none
…30s cadence continues, both lanes…
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=272s recovery=none
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-A> age=207s recovery=none
15:11:23 [gateway] SIGTERM (manual restart — would not have hit the 360s threshold)

Both lanes had lastProgress=embedded_run:started lastProgressAge≈age and recovery=none throughout. The diagnostic emitter has enough identity to log the channel/session-key and keep the lane wedged, but the recovery path appears to require sessionId=known — so these lanes never get aborted automatically.

(Full file: stuck-turn-window-pi-harness.log.)

Notes and questions

  1. Wedge zone is pre-runtime. The lane reaches embedded_run:started but never strict-agentic execution contract active. Whatever blocks sits in the embedded-run prep path (workspace-sandbox / runtime-plugins / hooks / model-resolution / auth / context-engine / attempt-workspace / attempt-prompt) — the same prep stages I see traced in [trace:embedded-run] prep stages lines elsewhere. Is there a code path in there that can block-forever without a timeout enforcer?

  2. Affects both harnesses. Same signature on harness=codex and harness=pi, on the same gateway, same channel. This isn't a runtime bug — it's in the layer underneath both. The three adjacent fixes (#82782, #82891, e30be460e1) are already in 2026.5.18 and don't cover this case.

  3. sessionId=unknown falls outside recovery. The 360s stuck session recovery only fires when a sessionId has been registered. Wedges that hang before sessionId registration appear to be uncoverable from the diagnostic — the lane has channel/session-key identity but recovery skips it. This looks like either a recovery coverage gap (recovery should key on lane / sessionKey too) or a missing "cannot recover because missing session id" diagnostic reason.

  4. drained=false forceCleared=true on the recovery outcome — the lane wasn't drained, only force-cleared. Is the embedded-run / codex-app-server child cleaned up cleanly when this happens, or could there be a leak that contributes to the gateway accumulating stale state over time? (For context: this gateway has been seeing wedges roughly every few hours of active Discord use.)

Adjacent observation (possibly related — not proven)

In the same window as the primary wedge, lane=main showed accumulated congestion:

  • [diagnostic] lane wait exceeded: lane=main waitedMs=1626821 (27 minutes of accumulated wait) at 13:14:36, cleared by [session-suspension] auto-resumed lane after suspension TTL
  • Cascade of failing-fast embedded runs on lane=main 13:14:59–13:15:30:
    • five workspace-lead agents (baikinman, design-library-lead, takeshi, design-lead, content-ops-lead) fail with No API key found for provider "openai-codex" in ~1s each; each emits [diagnostic] lane task error: lane=main durationMs=~1140
    • infra-lead failing-fast with xai grok 403 billing error (agent_end success=true durationMs=696 — billing-error fast-fail papered as success at observability layer); recurs every ~30 min on the gateway
    • [fetch-timeout] fetch timeout after 10000ms (elapsed 13365ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
  • ~4 min after the cascade, the primary wedge fires.

These failing-fast paths emit lane-task errors but the observability layer logs agent_end success=true for some — possibly polluting shared scheduler state (lane queue, codex-app-server pool, auth cache, workspace-sandbox prep) without surfacing a leak. Couldn't prove causation without source-level tracing — flagging in case it points to a shared lock / pool / queue path that the maintainer would recognize.

What I can share on request

  • Full unfiltered merged log window (gateway + err) for both cases
  • The lane=main cascade window (workspace-lead + xai heartbeat failures)
  • models status, config get agents.defaults.models, config get channelModels outputs
  • [trace:embedded-run] startup stages / prep stages line samples from successful turns for comparison
  • Longer-window repro if a "leave-it-running" repro would be useful

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes