codex - 💡(How to fix) Fix Session task panics can leave turns without TurnComplete [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#19934Fetched 2026-04-29 06:25:12
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×4unlabeled ×2

While investigating a stuck app-server runner in an external WebSocket client, I found a Codex lifecycle failure mode where a panicking SessionTask::run() can leave the turn without a TurnComplete event.

The process and WebSocket can remain alive, but clients that wait for the semantic turn lifecycle never observe completion and keep the task running indefinitely.

Error Message

The patch wraps SessionTask::run() with catch_unwind(). On panic it emits an internal error event, then continues through the existing finish path so rollout flush, task finish handling, waiter notification, and TurnComplete still happen.

  1. EventMsg::Error with CodexErrorInfo::InternalServerError

Root Cause

In codex-rs/core/src/tasks/mod.rs, Session::start_task spawns an async task and awaits task_for_run.run(...) directly. If the task future panics, execution exits the spawned task before the cleanup path runs:

  • flush_rollout() is skipped
  • on_task_finished(...) is skipped
  • waiters are not notified
  • no TurnComplete is emitted
  • the active turn can remain stuck from the client perspective

Fix Action

Fix / Workaround

Proposed patch

I prepared a small patch in my fork:

The patch wraps SessionTask::run() with catch_unwind(). On panic it emits an internal error event, then continues through the existing finish path so rollout flush, task finish handling, waiter notification, and TurnComplete still happen.

Code Example

CARGO_NET_GIT_FETCH_WITH_CLI=true cargo test -p codex-core session::tests::panicking_task_emits_error_and_completes_turn
CARGO_NET_GIT_FETCH_WITH_CLI=true cargo test -p codex-core session::tests::spawn_task_turn_span_inherits_dispatch_trace_context
just fmt
git diff --check
RAW_BUFFERClick to expand / collapse

Summary

While investigating a stuck app-server runner in an external WebSocket client, I found a Codex lifecycle failure mode where a panicking SessionTask::run() can leave the turn without a TurnComplete event.

The process and WebSocket can remain alive, but clients that wait for the semantic turn lifecycle never observe completion and keep the task running indefinitely.

Root cause

In codex-rs/core/src/tasks/mod.rs, Session::start_task spawns an async task and awaits task_for_run.run(...) directly. If the task future panics, execution exits the spawned task before the cleanup path runs:

  • flush_rollout() is skipped
  • on_task_finished(...) is skipped
  • waiters are not notified
  • no TurnComplete is emitted
  • the active turn can remain stuck from the client perspective

Impact

This is especially visible for app-server/WebSocket clients that treat TurnComplete as the authoritative completion signal. Transport-level heartbeat/ping/pong can still be healthy, so this does not look like a dead process or broken socket.

I do not want to overclaim that every externally observed stuck turn is caused by this path, but this is a concrete lifecycle hole that can produce the same class of symptom.

Proposed patch

I prepared a small patch in my fork:

https://github.com/dyjxg4xygary/codex/commit/c51d3c56f0efdcbbf494b5a65e4309a8410d99c5

Compare link:

https://github.com/openai/codex/compare/main...dyjxg4xygary:fix/websocket-semantic-idle?expand=1

The patch wraps SessionTask::run() with catch_unwind(). On panic it emits an internal error event, then continues through the existing finish path so rollout flush, task finish handling, waiter notification, and TurnComplete still happen.

I could not open a PR because this repository currently limits PR creation to collaborators.

Regression test

The patch adds panicking_task_emits_error_and_completes_turn, which creates a synthetic SessionTask that panics in run() and asserts that Codex emits:

  1. EventMsg::Error with CodexErrorInfo::InternalServerError
  2. EventMsg::TurnComplete
  3. a cleared active_turn

Verification

Targeted tests passed locally:

CARGO_NET_GIT_FETCH_WITH_CLI=true cargo test -p codex-core session::tests::panicking_task_emits_error_and_completes_turn
CARGO_NET_GIT_FETCH_WITH_CLI=true cargo test -p codex-core session::tests::spawn_task_turn_span_inherits_dispatch_trace_context
just fmt
git diff --check

I also ran the full cargo test -p codex-core. It completed with unrelated local failures in existing agent init / shell snapshot / unified exec timing tests: 1617 passed; 9 failed; 3 ignored.

extent analysis

TL;DR

The proposed patch that wraps SessionTask::run() with catch_unwind() to ensure TurnComplete is emitted even if the task panics is the most likely fix.

Guidance

  • Review the proposed patch in the provided GitHub commit and compare link to understand the changes made to handle panicking tasks.
  • Apply the patch to your local codebase and run targeted tests, such as panicking_task_emits_error_and_completes_turn, to verify the fix.
  • Run the full cargo test -p codex-core suite to ensure no regressions are introduced.
  • Consider reaching out to repository collaborators to discuss opening a PR for the proposed patch.

Example

No code snippet is provided as the issue already includes a proposed patch and the focus is on reviewing and applying that patch.

Notes

The patch's effectiveness in resolving stuck app-server runners and WebSocket clients depends on the specific use case and environment. Thorough testing is recommended before deploying the patch to production.

Recommendation

Apply the workaround by implementing the proposed patch, as it directly addresses the identified lifecycle hole and ensures TurnComplete is emitted even in the event of a panicking task.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Session task panics can leave turns without TurnComplete [1 participants]