codex - 💡(How to fix) Fix /goal can make Desktop threads unrecoverable by growing rollout JSONL beyond resume/list limits

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A long-running /goal workflow can create a rollout JSONL that Codex Desktop can no longer safely resume or list, even though the workflow is behaving as designed.

In this case, the thread was running normally under Codex Desktop with WSL enabled. The user paused the active goal to install a new Desktop update, then the updated app could not recover/resume the same thread because the local rollout had grown to about 3.06 GB. Small threads still loaded. The large goal thread repeatedly caused app-server failures and the UI fell back to Codex app-server is not available errors.

This is not just a generic "large file" problem. /goal encourages exactly this usage pattern: an agent runs for a long time, loops through work, compacts context, spawns/reviews sub-work, and keeps appending local history. The model-visible context is bounded by compaction, but the persisted rollout file is not bounded or rotated.

Error Message

Request failed ... method=thread/resume ... error={"code":-32000,"message":"Codex app-server is not available"} Request failed ... method=thread/goal/get ... error={"code":-32000,"message":"Codex app-server is not available"}

  • If a thread is too large, Codex should show a recoverable per-thread error and offer a built-in compact/export/rescue path.
  • Auto-update should preflight active goal/session size and warn or create a safe checkpoint before replacing the running app.

Root Cause

In this case, the thread was running normally under Codex Desktop with WSL enabled. The user paused the active goal to install a new Desktop update, then the updated app could not recover/resume the same thread because the local rollout had grown to about 3.06 GB. Small threads still loaded. The large goal thread repeatedly caused app-server failures and the UI fell back to Codex app-server is not available errors.

Code Example

~/.codex/sessions/2026/05/19/rollout-<timestamp>-<thread-id>.jsonl
size: 3,063,881,398 bytes

---

~/.codex/rescue_rollouts/rollout-rescue-<timestamp>-<thread-id>.jsonl
size: 20,737,905 bytes
lines: 382
records: session_meta=1, compacted=1, turn_context=1, event_msg=140, response_item=239
payloads included: thread_goal_updated=61, function_call=82, function_call_output=82, token_count=53, context_compacted=1

---

app_server_connection.closed code=9 ... transport=stdio
fatal_error_broadcasted ... (code=9, signal=null)
Request failed ... method=thread/resume ... error={"code":-32000,"message":"Codex app-server is not available"}
Request failed ... method=thread/goal/get ... error={"code":-32000,"message":"Codex app-server is not available"}
app_server_restart_recovery_failed ... errorMessage="Codex app-server is not available"

---

pub async fn load_rollout_items(
    path: &Path,
) -> std::io::Result<(Vec<RolloutItem>, Option<ThreadId>, usize)> {
    trace!("Resuming rollout from {path:?}");
    let text = tokio::fs::read_to_string(path).await?;
    ...
    let mut items: Vec<RolloutItem> = Vec::new();
    for line in text.lines() {
        ... serde_json::from_str(line) ...

---

let items = load_history_items(&path).await?;
thread.history = Some(StoredThreadHistory { thread_id, items });

---

let (items, _, _) = RolloutRecorder::load_rollout_items(path).await?;

---

.read_stored_thread_for_resume(thread_id, path, /*include_history*/ true)

---

// This API optimizes network transfer by letting clients page through a
// thread's turns incrementally, but it still replays the entire rollout on
// every request. Rollback and compaction events can change earlier turns, so
// the server has to rebuild the full turn list until turn metadata is indexed
// separately.
RAW_BUFFERClick to expand / collapse

Summary

A long-running /goal workflow can create a rollout JSONL that Codex Desktop can no longer safely resume or list, even though the workflow is behaving as designed.

In this case, the thread was running normally under Codex Desktop with WSL enabled. The user paused the active goal to install a new Desktop update, then the updated app could not recover/resume the same thread because the local rollout had grown to about 3.06 GB. Small threads still loaded. The large goal thread repeatedly caused app-server failures and the UI fell back to Codex app-server is not available errors.

This is not just a generic "large file" problem. /goal encourages exactly this usage pattern: an agent runs for a long time, loops through work, compacts context, spawns/reviews sub-work, and keeps appending local history. The model-visible context is bounded by compaction, but the persisted rollout file is not bounded or rotated.

Environment

  • Codex Desktop: Windows Store app, observed after update to package family OpenAI.Codex_26.527.x_x64__2p2nqsd0c76g0
  • App-server: WSL/Linux Codex runtime launched by Desktop
  • Platform: Windows 11 x64 with Codex Desktop configured to run the agent in WSL
  • Subscription/model: ChatGPT-authenticated Codex Desktop, high-reasoning model
  • Local state paths, user names, and project names are intentionally redacted

Reproduction shape

  1. Use Codex Desktop on Windows with WSL-backed execution.
  2. Start a long-running /goal workflow that performs repeated local work, review loops, sub-agent/tool loops, and iterative improvement passes.
  3. Let the goal run for days, including many turns and multiple context compactions.
  4. Pause the thread to install a Desktop update.
  5. Reopen Codex Desktop and click/resume the same large thread.
  6. The app-server crashes or becomes unavailable while hydrating/resuming the thread; a smaller thread still loads successfully.

Local evidence from the affected profile

The affected rollout file:

~/.codex/sessions/2026/05/19/rollout-<timestamp>-<thread-id>.jsonl
size: 3,063,881,398 bytes

A rescue rollout built from the latest compacted checkpoint plus subsequent records was much smaller:

~/.codex/rescue_rollouts/rollout-rescue-<timestamp>-<thread-id>.jsonl
size: 20,737,905 bytes
lines: 382
records: session_meta=1, compacted=1, turn_context=1, event_msg=140, response_item=239
payloads included: thread_goal_updated=61, function_call=82, function_call_output=82, token_count=53, context_compacted=1

The rescue was generated only after a full backup, and only by extracting the latest compaction boundary and following records. That is not a reasonable normal recovery path for a product feature.

Desktop logs around the failing resume showed the large thread pending thread/resume and thread/goal/get, then the app-server becoming unavailable:

app_server_connection.closed code=9 ... transport=stdio
fatal_error_broadcasted ... (code=9, signal=null)
Request failed ... method=thread/resume ... error={"code":-32000,"message":"Codex app-server is not available"}
Request failed ... method=thread/goal/get ... error={"code":-32000,"message":"Codex app-server is not available"}
app_server_restart_recovery_failed ... errorMessage="Codex app-server is not available"

Small-thread control test in the same app/profile loaded successfully, so the failure correlated with the large rollout/resume path rather than all Desktop startup.

Source-level RCA

Current origin/main at 00ca857d3ff6883b7334292d887601344e1bd029 still has full-file rollout hydration in key resume/list paths.

codex-rs/rollout/src/recorder.rs reads the whole rollout into one string, then parses all lines into a Vec:

pub async fn load_rollout_items(
    path: &Path,
) -> std::io::Result<(Vec<RolloutItem>, Option<ThreadId>, usize)> {
    trace!("Resuming rollout from {path:?}");
    let text = tokio::fs::read_to_string(path).await?;
    ...
    let mut items: Vec<RolloutItem> = Vec::new();
    for line in text.lines() {
        ... serde_json::from_str(line) ...

codex-rs/thread-store/src/local/read_thread.rs attaches history by loading the full rollout when include_history=true:

let items = load_history_items(&path).await?;
thread.history = Some(StoredThreadHistory { thread_id, items });

load_history_items() delegates back to the full-file rollout loader:

let (items, _, _) = RolloutRecorder::load_rollout_items(path).await?;

codex-rs/app-server/src/request_processors/thread_processor.rs resumes stored threads with history included:

.read_stored_thread_for_resume(thread_id, path, /*include_history*/ true)

The thread/turns/list path also has a source comment acknowledging the scalability problem:

// This API optimizes network transfer by letting clients page through a
// thread's turns incrementally, but it still replays the entire rollout on
// every request. Rollback and compaction events can change earlier turns, so
// the server has to rebuild the full turn list until turn metadata is indexed
// separately.

That design makes a 3 GB goal-created rollout inherently unsafe to hydrate. Even if context compaction keeps model context manageable, the local JSONL keeps growing and later resume/list paths still eagerly read and replay it.

Expected behavior

A /goal thread should remain recoverable after an app update if it was running normally before the update.

At minimum:

  • Long-running /goal workflows should not create local session files that the product cannot reopen.
  • The app should not need to read a multi-GB rollout into one string to resume a thread or list turns.
  • If a thread is too large, Codex should show a recoverable per-thread error and offer a built-in compact/export/rescue path.
  • Goal state should be exportable/restorable from the latest compacted checkpoint without manual SQLite edits or JSONL surgery.
  • Auto-update should preflight active goal/session size and warn or create a safe checkpoint before replacing the running app.

Actual behavior

  • The active goal thread became unrecoverable from Desktop after updating.
  • Resuming the large thread made the app-server unavailable while smaller threads loaded.
  • Recovering the work required unsupported manual steps: backing up the profile, finding the latest compaction boundary in a 3 GB JSONL, creating a reduced rescue rollout, and redirecting local thread metadata.
  • The recovered handoff is necessarily lower fidelity than a normal /goal resume because old pre-compaction records had to be dropped to keep the app usable.

Suggested fixes

  1. Stop using read_to_string for rollout hydration on resume/list paths. Stream parse JSONL, enforce record/byte caps, and avoid building a full Vec<RolloutItem> unless explicitly exporting.
  2. Index turn metadata separately so thread/turns/list can page without replaying the full rollout on every request.
  3. Make context compaction also produce a storage checkpoint. After a successful compaction, Codex should be able to prune or archive pre-compaction records while preserving an exportable full archive.
  4. Add rollout rotation or hard warnings for /goal sessions. Example thresholds: 250 MB warning, 500 MB danger, 1 GB force checkpoint/export path.
  5. Add a built-in "recover from latest compaction" command that creates a continuation thread or handoff without requiring direct DB edits.
  6. During Desktop update, detect active goal threads and write a verified resume checkpoint before replacing the app/runtime.
  7. Isolate per-thread failures. One oversized thread should not make the whole app-server or WSL-backed Desktop unusable.

Related issues

  • #22004: Desktop main-process crash when rollout JSONL exceeds V8 max string length
  • #22991: app freezes with very large rollout/history JSONL
  • #21134: long active thread caused multi-GB app-server memory footprint
  • #22411: app-server loads/deserializes all session files for thread/list
  • #24510: long-running goal sessions emit many goal/progress events and stress local history/list paths
  • #24544: long sessions break /goal workflows via compaction failures
  • #21291: /goal and compaction behavior issue
  • #23340: /goal long-running loop caused runaway log growth
  • #23777: separate Windows Desktop WSL update failure that triggered this recovery attempt
  • #23053: update prompt should surface target versions/environment impact before users accept risky updates

Privacy note

I cannot attach the raw 3 GB rollout because it contains private local conversation history and project content. The sizes, record counts, method names, source paths, and stack-level RCA above are sanitized.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A /goal thread should remain recoverable after an app update if it was running normally before the update.

At minimum:

  • Long-running /goal workflows should not create local session files that the product cannot reopen.
  • The app should not need to read a multi-GB rollout into one string to resume a thread or list turns.
  • If a thread is too large, Codex should show a recoverable per-thread error and offer a built-in compact/export/rescue path.
  • Goal state should be exportable/restorable from the latest compacted checkpoint without manual SQLite edits or JSONL surgery.
  • Auto-update should preflight active goal/session size and warn or create a safe checkpoint before replacing the running app.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix /goal can make Desktop threads unrecoverable by growing rollout JSONL beyond resume/list limits