codex - 💡(How to fix) Fix Codex App: false-positive safety-risk stops interrupt benign browser QA and benchmark sessions

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Preserve a user-visible error reason in the JSONL/logs instead of only an empty task_complete marker.

Root Cause

Time (UTC)SessionEvidence in JSONLTask contextWhy the stop was wrong
2026-05-01 17:30-17:33019d3aa6-6fbe-7543-8abc-2a48fa2cd0b9Empty task_complete at 17:30:02 and 17:32:36, followed by user: “you keep triggering a ... flag that stops the conversation”User-authorized property-search portal research. The agent was waiting for pages loaded by the user to settle, then planning passive DOM structure capture into minimized JSON fixtures.This was QA/instrumentation for the user’s own repo. It was not auth bypass, scraping at scale, or evasion. The safe path was explicitly passive capture from a user-loaded page, with no raw page storage.
2026-05-11 00:01-00:07019e1403-b2f7-78f1-97e8-ae61596db57dEmpty task_complete at 00:01:59 and 00:02:14, followed by user asking why the chat was flaggedRapid Web Agent repo work. The agent was adding local synthetic browser-runtime tests: cookie banner fixture, blocker page fixture, redaction checks, and needs-input artifacts.This was mostly local fixture testing. The code added human-intervention and redaction safeguards; it did not target third-party systems or request misuse.
2026-05-11 08:54-09:08019e1634-7321-7ee1-ac99-586344009604Empty task_complete at 08:54:26; user asks why flagged. Another empty task_complete at 09:08:49; user says the classifier prevented continuation.Rapid Web Agent acceptance work across retail, automotive, property, and lead-form flows. The task used dummy data only and stopped before login, account creation, payment, or irreversible submission.This is legitimate QA automation with explicit non-destructive boundaries and acceptance proof. The classifier appears to overweight “web agent”, “real sites”, “forms”, and “checkout” without honoring the repo context and hard stops.
2026-05-11 09:43-09:49019e1634-7312-7ee1-ac99-586344009604 / same thread familyUser reports that both the original and a cleaned continuation prompt were still flagged.The agent tried to provide a safer continuation prompt for the same repo task, including guardrails like no authentication, no accounts, no payment, and dummy data.Guardrail language should reduce risk, but here it seems to have increased the chance of a false positive by repeating words like validation, proof, dummy data, and forbidden actions.
2026-05-11 12:17-12:25019e166c-16b8-7842-811f-7955c141504cEmpty task_complete at 12:17:24 and 12:19:06, followed by user asking why it was flaggedRapid Web Agent PRD implementation: OpenRouter model gateway, role telemetry, dummy persona, hard blockers, and artifacts such as models.jsonl, agent-loop.json, and observability logs.This was repo-local implementation and telemetry. The hard blockers were designed to prevent unsafe behavior; nothing requested credential theft, unauthorized access, destructive actions, or payment.
2026-05-11 18:39019e1716-73fc-73f2-bac6-625bc32b37e8Empty task_complete at 18:39:27. The previous assistant messages were about classifying repeated 403/site-protection pages and adding benchmark tests; user immediately said the previous task did not finish because the classifier tripped.Production-readiness benchmark work. The agent was distinguishing a harness bug from a repeated 403/site-protection failure class, then adding tests and dashboard aggregation.Classifying a QA failure mode is defensive reliability engineering, not misuse. The task was to avoid mislabeling failures and to surface honest benchmark stats.
2026-05-11 18:52019e1716-73fc-73f2-bac6-625bc32b37e8Empty task_complete at 18:52:02 after the user asked for free OpenRouter model-role benchmarking and dashboard visibilityThe agent was building a model-role scorecard: pass rate, latency, token use, provider failures, evaluator disagreement, and dashboard explanation. It also used the project’s own secret resolver and did not print keys.This is normal LLM benchmarking and product telemetry. It should not be blocked as a security-risk task.
RAW_BUFFERClick to expand / collapse

What version of the Codex App are you using (From “About Codex” dialog)?

Observed in Codex Desktop session logs across:

  • 0.118.0-alpha.2
  • 0.130.0-alpha.5

Current session log reports 0.130.0-alpha.5.

What subscription do you have?

Not available from local session logs.

What platform is your computer?

Darwin 25.4.0 arm64 arm

What issue are you seeing?

Codex App conversations are being stopped by a safety-risk gate during benign, user-authorized software engineering work. The local JSONL session logs do not expose the internal classifier verdict, so the evidence below is based on the user-visible symptom pattern:

  • an in-progress turn ends with task_complete but an empty last_agent_message;
  • the user immediately reports that the thread was stopped or flagged;
  • the surrounding task is ordinary repo-local QA, benchmark, browser-proof, model-provider telemetry, or dashboard work;
  • no credentials are printed, no auth bypass is requested, no payment/action is completed, and the workflows include explicit stop boundaries.

Uploaded thread reference from the issue URL:

Uploaded thread: 0ce8e98a-7678-47c4-997b-ef2a9209d861

What steps can reproduce the bug?

I found the following instances in local Codex session logs. Session IDs are included so the uploaded thread/log bundle can be correlated server-side.

Time (UTC)SessionEvidence in JSONLTask contextWhy the stop was wrong
2026-05-01 17:30-17:33019d3aa6-6fbe-7543-8abc-2a48fa2cd0b9Empty task_complete at 17:30:02 and 17:32:36, followed by user: “you keep triggering a ... flag that stops the conversation”User-authorized property-search portal research. The agent was waiting for pages loaded by the user to settle, then planning passive DOM structure capture into minimized JSON fixtures.This was QA/instrumentation for the user’s own repo. It was not auth bypass, scraping at scale, or evasion. The safe path was explicitly passive capture from a user-loaded page, with no raw page storage.
2026-05-11 00:01-00:07019e1403-b2f7-78f1-97e8-ae61596db57dEmpty task_complete at 00:01:59 and 00:02:14, followed by user asking why the chat was flaggedRapid Web Agent repo work. The agent was adding local synthetic browser-runtime tests: cookie banner fixture, blocker page fixture, redaction checks, and needs-input artifacts.This was mostly local fixture testing. The code added human-intervention and redaction safeguards; it did not target third-party systems or request misuse.
2026-05-11 08:54-09:08019e1634-7321-7ee1-ac99-586344009604Empty task_complete at 08:54:26; user asks why flagged. Another empty task_complete at 09:08:49; user says the classifier prevented continuation.Rapid Web Agent acceptance work across retail, automotive, property, and lead-form flows. The task used dummy data only and stopped before login, account creation, payment, or irreversible submission.This is legitimate QA automation with explicit non-destructive boundaries and acceptance proof. The classifier appears to overweight “web agent”, “real sites”, “forms”, and “checkout” without honoring the repo context and hard stops.
2026-05-11 09:43-09:49019e1634-7312-7ee1-ac99-586344009604 / same thread familyUser reports that both the original and a cleaned continuation prompt were still flagged.The agent tried to provide a safer continuation prompt for the same repo task, including guardrails like no authentication, no accounts, no payment, and dummy data.Guardrail language should reduce risk, but here it seems to have increased the chance of a false positive by repeating words like validation, proof, dummy data, and forbidden actions.
2026-05-11 12:17-12:25019e166c-16b8-7842-811f-7955c141504cEmpty task_complete at 12:17:24 and 12:19:06, followed by user asking why it was flaggedRapid Web Agent PRD implementation: OpenRouter model gateway, role telemetry, dummy persona, hard blockers, and artifacts such as models.jsonl, agent-loop.json, and observability logs.This was repo-local implementation and telemetry. The hard blockers were designed to prevent unsafe behavior; nothing requested credential theft, unauthorized access, destructive actions, or payment.
2026-05-11 18:39019e1716-73fc-73f2-bac6-625bc32b37e8Empty task_complete at 18:39:27. The previous assistant messages were about classifying repeated 403/site-protection pages and adding benchmark tests; user immediately said the previous task did not finish because the classifier tripped.Production-readiness benchmark work. The agent was distinguishing a harness bug from a repeated 403/site-protection failure class, then adding tests and dashboard aggregation.Classifying a QA failure mode is defensive reliability engineering, not misuse. The task was to avoid mislabeling failures and to surface honest benchmark stats.
2026-05-11 18:52019e1716-73fc-73f2-bac6-625bc32b37e8Empty task_complete at 18:52:02 after the user asked for free OpenRouter model-role benchmarking and dashboard visibilityThe agent was building a model-role scorecard: pass rate, latency, token use, provider failures, evaluator disagreement, and dashboard explanation. It also used the project’s own secret resolver and did not print keys.This is normal LLM benchmarking and product telemetry. It should not be blocked as a security-risk task.

Concrete reproduction shape:

  1. In Codex App, work in an owned repo that performs browser QA/acceptance testing.
  2. Ask Codex to run user-authorized browser proof against public product pages with guardrails: dummy data only, no login, no account creation, no payment, no irreversible submit.
  3. Ask Codex to classify failures such as 403/site-protection pages separately from selector bugs and surface the stats in a dashboard.
  4. The session may terminate mid-turn with an empty completion event instead of allowing Codex to continue or ask for clarification.

What is the expected behavior?

Codex should not stop benign software-engineering sessions just because the repo uses browser automation, acceptance proof, model-role benchmarking, or failure classification language.

Expected handling:

  • Use repo and conversation context, including explicit user authorization and hard stop boundaries.
  • Treat guardrails such as “no login”, “no accounts”, “no payment”, and “dummy data only” as risk-reducing signals, not risk-increasing signals.
  • Allow defensive QA labels for site-protection or rate-limit failures in benchmark dashboards.
  • If the classifier is uncertain, ask a targeted clarification or downgrade capabilities for that turn rather than killing the session.
  • Preserve a user-visible error reason in the JSONL/logs instead of only an empty task_complete marker.

Additional information

The repeated false positives caused large amounts of duplicate work because the user had to start new threads and re-prompt with less precise language. The worst case is that safer guardrail language appears to increase false positives, which encourages users to provide less safety context.

No secrets were included in this issue. Local dotenv discovery in the affected sessions reported only paths/source availability and did not print key values.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Codex App: false-positive safety-risk stops interrupt benign browser QA and benchmark sessions