codex - ✅(Solved) Fix Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#19745Fetched 2026-04-28 06:37:55
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×2cross-referenced ×1

Error Message

  • terminal stream state: completed, closed before completion, idle timeout, or stream error
  1. close before response.completed: preserves the ordinary error and records the last observed event kind

Root Cause

These cases need different follow-up behavior. Some are plain stream failures, some are partial-output failures, some are retry/resume candidates, and some may indicate a transport-specific issue.

Without lifecycle diagnostics, downstream retry/fallback bugs are harder to debug and issue reports have to rely on private logs or manual transcript forensics.

Fix Action

Fixed

PR fix notes

PR #19755: Add Responses stream lifecycle diagnostics

Description (problem / solution / changelog)

Why

Refs #19745.

Responses stream failures are currently hard to diagnose because the client often reports only the terminal transport error. That makes it difficult to tell whether a stream was silent from the start, closed after response.created, stalled before durable output, stalled after text began, or completed with inconsistent response IDs.

This adds diagnostics for those lifecycle boundaries without changing retry budgets, idle timeouts, fallback policy, model behavior, or app-server protocol shape.

What changed

  • Added a private ResponseStreamLifecycleRecorder in codex-api that tracks stream attempt, transport, terminal state, response IDs, first and last event timing, output milestones, observed event kinds, and event count.
  • Wired lifecycle capture into both Responses HTTP SSE and Responses WebSocket streams, including warning logs plus appended ApiError::Stream details for failed streams and completed streams with mismatched response IDs.
  • Updated internal ResponseEvent::Created parsing to carry the created response ID so lifecycle capture uses parsed event data instead of reparsing raw JSON.
  • Threaded stream attempt numbers from the core retry loop into ModelClientSession, with the existing WebSocket-to-HTTP fallback reset preserving fresh attempt numbering for the fallback transport.

Verification

  • cargo check -p codex-api -p codex-core
  • cargo test -p codex-api
  • cargo test -p codex-core responses_websocket_v2_surfaces_terminal_error_without_close_handshake
  • cargo test -p codex-otel

Changed files

  • codex-rs/codex-api/src/common.rs (modified, +3/-1)
  • codex-rs/codex-api/src/endpoint/responses.rs (modified, +32/-1)
  • codex-rs/codex-api/src/endpoint/responses_websocket.rs (modified, +57/-8)
  • codex-rs/codex-api/src/lib.rs (modified, +3/-0)
  • codex-rs/codex-api/src/sse/responses.rs (modified, +77/-13)
  • codex-rs/codex-api/src/stream_lifecycle.rs (added, +431/-0)
  • codex-rs/core/src/client.rs (modified, +51/-2)
  • codex-rs/core/src/session/turn.rs (modified, +5/-2)
  • codex-rs/core/src/turn_timing.rs (modified, +1/-1)
  • codex-rs/core/src/turn_timing_tests.rs (modified, +1/-1)
  • codex-rs/core/tests/suite/client.rs (modified, +15/-6)
  • codex-rs/otel/src/events/session_telemetry.rs (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

What version of Codex is running?

Current main as of 2026-04-27.

What issue are you seeing?

When a Responses stream closes, errors, or idles before response.completed, Codex reports the failure but does not expose enough lifecycle evidence to diagnose where the stream died.

Today a stream that never produced meaningful output and a stream that reached response.output_item.added or response.output_item.done can collapse into a similar operator-visible failure shape. That makes it hard to distinguish:

  • provider silence before first event
  • connection loss after response.created
  • an early stall after response.output_item.added
  • a later stall after durable output started
  • close-before-completion after partial assistant output
  • response-id correlation problems between created and completed events

Why this matters

These cases need different follow-up behavior. Some are plain stream failures, some are partial-output failures, some are retry/resume candidates, and some may indicate a transport-specific issue.

Without lifecycle diagnostics, downstream retry/fallback bugs are harder to debug and issue reports have to rely on private logs or manual transcript forensics.

Expected behavior

For Responses SSE/WebSocket stream completion and failure paths, Codex should record structured lifecycle evidence such as:

  • request attempt sequence
  • transport path or transport reason
  • created response id
  • completed response id, if any
  • first event elapsed time
  • last event elapsed time
  • last event kind
  • first response.output_item.added elapsed time, if any
  • first response.output_item.done elapsed time, if any
  • first response.output_text.delta elapsed time, if any
  • observed stream event kinds
  • stream event count
  • terminal stream state: completed, closed before completion, idle timeout, or stream error

This would make close-before-completion and idle-timeout reports actionable without changing retry or fallback policy.

Non-goals

This issue is not asking Codex to:

  • shorten the default stream idle timeout
  • switch to HTTP fallback after a particular event pattern
  • change retry budgets
  • change model behavior

The ask is diagnostics only: expose enough stream lifecycle evidence that the correct retry/fallback policy can be reasoned about separately.

Minimal validation shape

A useful test suite would cover:

  1. completed stream: records created/completed response id and terminal state completed
  2. close before response.completed: preserves the ordinary error and records the last observed event kind
  3. idle timeout before any event: records no first event and terminal state idle_timeout
  4. idle timeout after response.output_item.added: records the first output-item-added timing and terminal state idle_timeout
  5. idle timeout after durable output: records first output-item-done and/or first text-delta timing separately from the early-output case

extent analysis

TL;DR

To address the issue, Codex should record and expose structured lifecycle evidence for Responses SSE/WebSocket stream completion and failure paths.

Guidance

  • Review the current implementation of stream lifecycle tracking in Codex to identify gaps in recording and exposing necessary diagnostics.
  • Modify the stream handling logic to capture and store relevant lifecycle events, such as request attempt sequence, transport path, and event timings.
  • Implement a mechanism to correlate response IDs between created and completed events to facilitate accurate diagnostics.
  • Develop a test suite covering various stream scenarios, including completed streams, close before completion, and idle timeouts, to validate the diagnostics.

Example

# Pseudocode example of recording lifecycle evidence
class StreamLifecycle:
    def __init__(self):
        self.events = []
        self.terminal_state = None

    def record_event(self, event):
        self.events.append(event)

    def set_terminal_state(self, state):
        self.terminal_state = state

    def get_lifecycle_evidence(self):
        evidence = {
            'events': self.events,
            'terminal_state': self.terminal_state,
            # Add other relevant lifecycle data
        }
        return evidence

Notes

The provided guidance assumes that the necessary data is available within the Codex system and that the modifications can be made without significant changes to the existing architecture.

Recommendation

Apply workaround: Implement the suggested modifications to record and expose lifecycle evidence, as this will provide the necessary diagnostics for accurate retry and fallback policies without changing the underlying stream handling logic.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For Responses SSE/WebSocket stream completion and failure paths, Codex should record structured lifecycle evidence such as:

  • request attempt sequence
  • transport path or transport reason
  • created response id
  • completed response id, if any
  • first event elapsed time
  • last event elapsed time
  • last event kind
  • first response.output_item.added elapsed time, if any
  • first response.output_item.done elapsed time, if any
  • first response.output_text.delta elapsed time, if any
  • observed stream event kinds
  • stream event count
  • terminal stream state: completed, closed before completion, idle timeout, or stream error

This would make close-before-completion and idle-timeout reports actionable without changing retry or fallback policy.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - ✅(Solved) Fix Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures [1 pull requests, 1 participants]