codex - ✅(Solved) Fix Remote TUI can remain stale after app-server slow-websocket disconnect [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#18860Fetched 2026-04-22 07:51:19
View on GitHub
Comments
4
Participants
3
Timeline
14
Reactions
0
Author
Timeline (top)
labeled ×6commented ×4unlabeled ×3cross-referenced ×1

Error Message

WARN codex_app_server::transport

  • receive a close/error and exit clearly;
  • Remote-client test: force the websocket read side to receive close/error/EOF after app-server disconnect and verify AppServerEvent::Disconnected reaches the TUI fatal-exit/reconciliation path.
  • Protocol test: turn/steer with expectedTurnId for a completed turn should return a response/error that allows the TUI to clear stale state and start a fresh turn.

Root Cause

Raw logs are not attached because they contain local paths, prompt text, command transcripts, hook paths, private repository remotes, and unrelated large remote response bodies. The package contains derived redacted metadata, summaries, and aggressively redacted screenshots.

Fix Action

Fixed

PR fix notes

PR #18932: TUI: Keep remote app-server events draining

Description (problem / solution / changelog)

Addresses #18860

Problem: Remote app-server clients could stop draining websocket events when their bounded local event channel filled, leaving clients stuck on stale in-progress turns after a disconnect.

Solution: Use an unbounded local event channel for the remote client so the websocket reader can keep forwarding disconnect and progress events instead of blocking or dropping them.

Why this is reasonable: This does not make the remote websocket itself unbounded. The changed queue lives inside the remote client, between the task that reads the remote websocket and the API consumer in the same client process. Once an event has been received from the remote server, preserving it is preferable to blocking websocket reads or dropping disconnect/lifecycle events; network-level backpressure still happens at the websocket boundary if the remote side outpaces the client.

Changed files

  • codex-rs/app-server-client/src/remote.rs (modified, +14/-161)

Code Example

seq 53  TUI -> app-server  turn/start   id=6
seq 54  app-server -> TUI  response     id=6 result.turn.status=inProgress
seq 55  app-server -> TUI  thread/status/changed status=active
seq 56  app-server -> TUI  turn/started
seq 57-247 app-server -> TUI item/hook/output frames for that turn

---

WARN codex_app_server::transport
disconnecting slow connection after outbound queue filled: ConnectionId(0)

---

{
  "threadStatusType": "idle",
  "turnCount": 1,
  "latestTurnStatus": "completed",
  "inProgressTurnCount": 0,
  "activeTurnId": null
}

---

{
  "directTurnStartStatus": "inProgress",
  "completed": true,
  "notificationCounts": {
    "turn/started": 1,
    "turn/completed": 1,
    "thread/status/changed": 2
  }
}

---

TUI -> app-server turn/steer id=7
expectedTurnId=<previous completed turn id>

---

019db067-8e04-71e0-a0a4-e1106ee75148

---

019db068-4be4-7063-a7dc-55d20ed439dd

---

{
  "frameCount": 248,
  "malformedLineCount": 0,
  "turnStartCount": 1,
  "turnStartedCount": 1,
  "turnCompletedCount": 0,
  "turnSteerCount": 1,
  "staleTurnStateSuspected": true
}

---

{
  "slowConnectionDisconnectCount": 1,
  "droppedDisconnectedMessageCount": 65,
  "connectionIds": ["0"],
  "backpressureDisconnectObserved": true
}

---

{
  "turnStartCount": 2,
  "turnStartedCount": 2,
  "turnCompletedCount": 2,
  "turnSteerCount": 0,
  "staleTurnStateSuspected": false,
  "backpressureDisconnectObserved": false
}
RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

codex-cli 0.122.0

What subscription do you have?

Pro

Which model were you using?

gpt-5.4

What platform is your computer?

Microsoft Windows NT 10.0.19045.0 x64

What terminal emulator and version are you using (if applicable)?

Windows Terminal, PowerShell 7

What issue are you seeing?

This is related to #18203, which reports the app-server outbound websocket queue disconnect trigger. This issue is specifically about the TUI stale-state/reconciliation failure after that kind of disconnect: the server-side thread can be completed/idle, while the TUI remains in Working state and routes the next prompt as turn/steer against the completed turn.

In the captured reproduction, the TUI websocket connection saw a normal turn start:

seq 53  TUI -> app-server  turn/start   id=6
seq 54  app-server -> TUI  response     id=6 result.turn.status=inProgress
seq 55  app-server -> TUI  thread/status/changed status=active
seq 56  app-server -> TUI  turn/started
seq 57-247 app-server -> TUI item/hook/output frames for that turn

Then app-server stderr reported:

WARN codex_app_server::transport
disconnecting slow connection after outbound queue filled: ConnectionId(0)

It was followed by 65 dropping message for disconnected connection: ConnectionId(0) warnings.

The TUI relay log never received the terminal lifecycle frames for that turn:

  • no turn/completed
  • no idle thread/status/changed

At the same time, a separate passive observer against the same app-server reported the authoritative thread state as:

{
  "threadStatusType": "idle",
  "turnCount": 1,
  "latestTurnStatus": "completed",
  "inProgressTurnCount": 0,
  "activeTurnId": null
}

A direct websocket turn/start against the same app-server and same thread then completed successfully:

{
  "directTurnStartStatus": "inProgress",
  "completed": true,
  "notificationCounts": {
    "turn/started": 1,
    "turn/completed": 1,
    "thread/status/changed": 2
  }
}

This suggests the app-server/thread was healthy, while the original TUI connection retained stale active-turn state.

When a later visible prompt was entered into the stale TUI, the TUI sent:

TUI -> app-server turn/steer id=7
expectedTurnId=<previous completed turn id>

No response to that turn/steer request was observed in the TUI relay log.

Evidence chain:

ClaimEvidence
TUI started a normal turnStale relay metadata: turn/start, response inProgress, active status, turn/started
app-server hit outbound websocket backpressureTransport analysis: one disconnecting slow connection after outbound queue filled warning
app-server dropped later messages for that disconnected connectionTransport analysis: 65 dropped-message warnings for the same connection id
stale TUI did not receive terminal lifecycle framesStale relay analysis: zero turn/completed, no idle status delivered
server-side thread was actually complete/idlePassive observer summary: threadStatusType=idle, latestTurnStatus=completed, inProgressTurnCount=0, activeTurnId=null
app-server/thread were still capable of workDirect websocket probe: new turn/start completed with turn/started, turn/completed, and status updates
same wrapper/relay path can complete normallyClean control: two turn/start, two turn/started, two turn/completed, zero turn/steer, zero backpressure events

What steps can reproduce the bug?

I do not have a minimal upstream-only repro script for the stale-state recovery part. The strongest reproduction used a local remote-TUI wrapper/relay and a turn that produced a burst of output frames large enough to fill the app-server outbound websocket queue.

The queue-fill disconnect trigger itself is already reported with an upstream-only reproduction in #18203. The additional observation here is that after such a disconnect, the TUI can remain stale rather than clearly exiting/reconciling.

The useful maintainer-side repro direction is likely:

  1. Start Codex TUI in remote app-server websocket mode.
  2. Put a slow/throttled websocket client or proxy between the TUI and app-server.
  3. Run a turn that emits many output delta frames.
  4. Observe whether the app-server logs the slow-connection disconnect.
  5. Check whether the TUI exits/reconciles, or instead remains in Working and routes the next prompt as turn/steer for the previous turn id.

In my captured Worker 06 reproduction, the relevant thread id was:

019db067-8e04-71e0-a0a4-e1106ee75148

The initial stale turn id was:

019db068-4be4-7063-a7dc-55d20ed439dd

The later stale turn/steer used that same completed turn id as expectedTurnId.

What is the expected behavior?

After app-server transport disconnects a slow websocket client, the remote TUI should do one of the following:

  • receive a close/error and exit clearly;
  • reconnect/resume and reconcile with authoritative server thread state;
  • clear stale active-turn/running state before accepting the next user prompt.

It should not continue accepting prompts while still believing a completed turn is active.

Additional information

Mechanical analysis for the stale run:

{
  "frameCount": 248,
  "malformedLineCount": 0,
  "turnStartCount": 1,
  "turnStartedCount": 1,
  "turnCompletedCount": 0,
  "turnSteerCount": 1,
  "staleTurnStateSuspected": true
}

App-server transport analysis for the stale run:

{
  "slowConnectionDisconnectCount": 1,
  "droppedDisconnectedMessageCount": 65,
  "connectionIds": ["0"],
  "backpressureDisconnectObserved": true
}

Clean control under the same wrapper/relay instrumentation:

{
  "turnStartCount": 2,
  "turnStartedCount": 2,
  "turnCompletedCount": 2,
  "turnSteerCount": 0,
  "staleTurnStateSuspected": false,
  "backpressureDisconnectObserved": false
}

Likely source areas:

  • App-server bounded outbound queue and slow-client disconnect:
    • codex-rs/app-server/src/transport/mod.rs
  • Websocket close/EOF propagation:
    • codex-rs/app-server/src/transport/websocket.rs
    • codex-rs/app-server-client/src/remote.rs
  • TUI routing of next input as turn/steer based on cached active turn id:
    • codex-rs/tui/src/app/thread_routing.rs
  • Clearing active turn and visible Working state after turn/completed:
    • codex-rs/tui/src/app/thread_events.rs
    • codex-rs/tui/src/chatwidget.rs

More detailed source links and a claim-to-evidence map are included in the attached redacted evidence package.

Hypothesis:

The trigger is app-server websocket outbound backpressure from a burst of output frames. The app-server intentionally disconnects the slow websocket and drops later messages for that connection. The TUI then misses turn/completed and the idle status frame, leaving both client-side state machines stale:

  • ThreadEventStore.active_turn_id remains set, so the next prompt is routed as turn/steer.
  • ChatWidget.agent_turn_running remains true, so the visible UI can remain in Working / queued-input mode.

Possible regression tests:

  • TUI/client test: simulate loss of turn/completed after turn/started, then verify the next prompt cannot be silently routed as turn/steer against a completed/non-active turn.
  • Remote-client test: force the websocket read side to receive close/error/EOF after app-server disconnect and verify AppServerEvent::Disconnected reaches the TUI fatal-exit/reconciliation path.
  • App-server transport test: fill a per-connection outbound queue and verify the disconnect behavior is observable by that client, or that terminal lifecycle notifications cannot leave the client in stale state.
  • Protocol test: turn/steer with expectedTurnId for a completed turn should return a response/error that allows the TUI to clear stale state and start a fresh turn.

Attachment package:

I am attaching codex-stale-tui-evidence-redacted.zip.

It contains:

  • triage-summary.md - one-page maintainer summary.
  • claim-to-evidence-map.md - each claim mapped to exact redacted evidence.
  • worker-run-matrix.md - all supervised worker runs and outcomes.
  • source-analysis.md - upstream source pointers and inferred failure path.
  • redaction-report.md - what was removed from raw evidence.
  • stale-run-relay-analysis.redacted.json
  • stale-run-relay-metadata.jsonl
  • stale-run-transport-events.redacted.json
  • stale-run-observer-summary.redacted.json
  • direct-turn-probe-summary.redacted.json
  • clean-control-relay-analysis.redacted.json
  • clean-control-relay-metadata.jsonl
  • clean-control-transport-analysis.redacted.json
  • screenshots-redacted/*.png

Raw logs are not attached because they contain local paths, prompt text, command transcripts, hook paths, private repository remotes, and unrelated large remote response bodies. The package contains derived redacted metadata, summaries, and aggressively redacted screenshots.

Redacted screenshots can be inlined separately if useful; the protocol evidence in the zip is the primary evidence.

codex-stale-tui-evidence-redacted.zip

extent analysis

TL;DR

The TUI may be fixed by ensuring it properly handles the turn/completed event and clears stale active-turn state after a websocket disconnect.

Guidance

  1. Review websocket close/EOF propagation: Examine codex-rs/app-server/src/transport/websocket.rs and codex-rs/app-server-client/src/remote.rs to ensure that the TUI correctly handles websocket closure and propagates the event to clear stale state.
  2. Implement TUI routing fix: Modify codex-rs/tui/src/app/thread_routing.rs to prevent routing the next input as turn/steer based on a cached active turn id when the turn is completed.
  3. Verify turn/completed handling: Check codex-rs/tui/src/app/thread_events.rs and codex-rs/tui/src/chatwidget.rs to ensure that the TUI correctly handles the turn/completed event and clears the active turn state.
  4. Test with simulated disconnect: Test the TUI with a simulated websocket disconnect to verify that it correctly recovers and does not remain in a stale state.

Example

No specific code example is provided, as the issue requires a thorough review of the codebase and modifications to multiple components.

Notes

The provided evidence package and redacted screenshots may be useful in further debugging and testing the issue. Additionally, the suggested regression tests can help ensure that the fix does not introduce new issues.

Recommendation

Apply a workaround to ensure the TUI properly handles websocket disconnects and clears stale state, as the root cause is likely related to the app-server outbound queue and slow-client disconnect.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING