codex - 💡(How to fix) Fix Demote `failed to record rollout items` from error to warn so harness stderr stops looking like a fatal failure [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#22055Fetched 2026-05-11 03:20:02
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×2

Session::persist_rollout_items (codex-rs/core/src/session/mod.rs:2777-2783) logs at error! whenever the active LiveThread rejects an append:

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
    if let Some(live_thread) = self.live_thread()
        && let Err(e) = live_thread.append_items(items).await
    {
        error!("failed to record rollout items: {e:#}");
    }
}

The append path can fail for reasons that are routine in normal operation and not user-actionable from a parent harness's perspective:

  • the live recorder for that thread_id was already removed by a concurrent shutdown / discard,
  • the state-db row points at a rollout file that has since been pruned,
  • ThreadStoreError::ThreadNotFound { thread_id } is returned because the row was deleted between the resume resolve and the append (race),
  • the writer task is mid-shutdown and refuses new appends.

In all of these cases the active turn continues normally — only the persistence of those rollout items is skipped. The error! level is misleading because nothing is broken for the user.

Error Message

Session::persist_rollout_items (codex-rs/core/src/session/mod.rs:2777-2783) logs at error! whenever the active LiveThread rejects an append: error!("failed to record rollout items: {e:#}"); In all of these cases the active turn continues normally — only the persistence of those rollout items is skipped. The error! level is misleading because nothing is broken for the user. ERROR codex_core::session: failed to record rollout items: thread <uuid> not found Demote error!warn! and shorten the phrasing so the line stops reading as a fatal failure, while keeping it visible by default for operators who care about persistence health: warn!("rollout persistence skipped: {e}"); Operators that genuinely need the full error chain while debugging rollout I/O can re-enable verbose logging with RUST_LOG=codex_core::session=debug (or just promote it back to error! with a custom filter). The line stays in the binary; it just stops emitting at ERROR-level on a hot path. The companion thread <uuid> not found line that parent harnesses also filter (turn_processor.rs:184, thread_processor.rs:654/3692, thread_lifecycle.rs:151, returning invalid_request(format!(\"thread not found: {thread_id}\"))) is a JSON-RPC error response, not a tracing log. It surfaces when a thread/resume lookup races against a shutdown/delete and the resolved id is stale by the time the resume request runs. The cleanest fix for that case is server-side: accept thread/start { resumeId }, fall back to a fresh thread when the target is missing, and surface a thread/resumeFailed notification instead of a hard error. That is a separate, larger change; this issue is scoped to the rollout-persistence log level only.

Root Cause

codex exec and codex app-server (in-process or stdio) are increasingly driven by parent harnesses (cdx, IDE clients, scripted runners) that pipe stderr to a user-facing pane. With tracing-subscriber at default info, every one of these append failures shows up as:

ERROR codex_core::session: failed to record rollout items: thread <uuid> not found

Downstream, harness authors end up writing ad-hoc stderr regex filters to suppress the line so the UI doesn't look like it's exploding. Real example from a wrapper TUI:

if (/^(items:\s*)?thread\s+[0-9a-f-]+\s+not found$/i.test(trimmed)) return false;

That filter would not be needed if the upstream log level reflected the actual severity (degraded persistence, turn still succeeded).

Code Example

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
    if let Some(live_thread) = self.live_thread()
        && let Err(e) = live_thread.append_items(items).await
    {
        error!("failed to record rollout items: {e:#}");
    }
}

---

ERROR codex_core::session: failed to record rollout items: thread <uuid> not found

---

if (/^(items:\s*)?thread\s+[0-9a-f-]+\s+not found$/i.test(trimmed)) return false;

---

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
    if let Some(live_thread) = self.live_thread()
        && let Err(e) = live_thread.append_items(items).await
    {
        warn!(\"rollout persistence skipped: {e}\");
    }
}
RAW_BUFFERClick to expand / collapse

Summary

Session::persist_rollout_items (codex-rs/core/src/session/mod.rs:2777-2783) logs at error! whenever the active LiveThread rejects an append:

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
    if let Some(live_thread) = self.live_thread()
        && let Err(e) = live_thread.append_items(items).await
    {
        error!("failed to record rollout items: {e:#}");
    }
}

The append path can fail for reasons that are routine in normal operation and not user-actionable from a parent harness's perspective:

  • the live recorder for that thread_id was already removed by a concurrent shutdown / discard,
  • the state-db row points at a rollout file that has since been pruned,
  • ThreadStoreError::ThreadNotFound { thread_id } is returned because the row was deleted between the resume resolve and the append (race),
  • the writer task is mid-shutdown and refuses new appends.

In all of these cases the active turn continues normally — only the persistence of those rollout items is skipped. The error! level is misleading because nothing is broken for the user.

Why this matters

codex exec and codex app-server (in-process or stdio) are increasingly driven by parent harnesses (cdx, IDE clients, scripted runners) that pipe stderr to a user-facing pane. With tracing-subscriber at default info, every one of these append failures shows up as:

ERROR codex_core::session: failed to record rollout items: thread <uuid> not found

Downstream, harness authors end up writing ad-hoc stderr regex filters to suppress the line so the UI doesn't look like it's exploding. Real example from a wrapper TUI:

if (/^(items:\s*)?thread\s+[0-9a-f-]+\s+not found$/i.test(trimmed)) return false;

That filter would not be needed if the upstream log level reflected the actual severity (degraded persistence, turn still succeeded).

Proposed change

Demote error!warn! and shorten the phrasing so the line stops reading as a fatal failure, while keeping it visible by default for operators who care about persistence health:

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
    if let Some(live_thread) = self.live_thread()
        && let Err(e) = live_thread.append_items(items).await
    {
        warn!(\"rollout persistence skipped: {e}\");
    }
}

Operators that genuinely need the full error chain while debugging rollout I/O can re-enable verbose logging with RUST_LOG=codex_core::session=debug (or just promote it back to error! with a custom filter). The line stays in the binary; it just stops emitting at ERROR-level on a hot path.

Related: `thread <uuid> not found` resume noise

The companion thread <uuid> not found line that parent harnesses also filter (turn_processor.rs:184, thread_processor.rs:654/3692, thread_lifecycle.rs:151, returning invalid_request(format!(\"thread not found: {thread_id}\"))) is a JSON-RPC error response, not a tracing log. It surfaces when a thread/resume lookup races against a shutdown/delete and the resolved id is stale by the time the resume request runs. The cleanest fix for that case is server-side: accept thread/start { resumeId }, fall back to a fresh thread when the target is missing, and surface a thread/resumeFailed notification instead of a hard error. That is a separate, larger change; this issue is scoped to the rollout-persistence log level only.

Reference implementation

A working implementation lives on the team-wcv fork:

Only codex-rs/core/src/session/mod.rs changes (six lines: one log macro swap plus a 5-line comment pointing at `RUST_LOG=codex_core::session=debug`). cargo check -p codex-core and cargo clippy -p codex-core --lib --no-deps are clean.

Per `docs/contributing.md`

External contributions are by invitation only, so this is filed as an enhancement request with reference implementation attached. Happy to open the PR against this repo if the approach is acceptable.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Demote `failed to record rollout items` from error to warn so harness stderr stops looking like a fatal failure [1 participants]