codex - 💡(How to fix) Fix Codex App: false-positive safety-risk stops interrupt benign browser QA and benchmark sessions

StepCodex · 2026-05-12T09:15:40Z

[codex] What version of the Codex App are you using From “About Codex” dialog ? Observed in Codex Desktop session logs across: - 0.118.0-alpha.2 - 0.130.0-alph… ### What version of the Codex App are you using (From “About Codex” dialog)? Observed in Codex Desktop session logs across: - `0.118.0-alpha.2` - `0.130.0-alpha.5` Current session log reports `0.130.0-alpha.5`. ### What subscription do you have? Not available from local session logs. ### What platform is your computer? `Darwin 25.4.0 arm64 arm` ### What issue are you seeing? Codex App conversations are being stopped by a safety-risk gate during benign, user-authorized software engineering work. The local JSONL session logs do not expose the internal classifier verdict, so the evidence below is based on the user-visible symptom pattern: - an in-progress turn ends with `task_complete` but an empty `last_agent_message`; - the user immediately reports that the thread was stopped or flagged; - the surrounding task is ordinary repo-local QA, benchmark, browser-proof, model-provider telemetry, or dashboard work; - no credentials are printed, no auth bypass is requested, no payment/action is completed, and the workflows include explicit stop boundaries. Uploaded thread reference from the issue URL: `Uploaded thread: 0ce8e98a-7678-47c4-997b-ef2a9209d861` ### What steps can reproduce the bug? I found the following instances in local Codex session logs. Session IDs are included so the uploaded thread/log bundle can be correlated server-side. | Time (UTC) | Session | Evidence in JSONL | Task context | Why the stop was wrong | | --- | --- | --- | --- | --- | | 2026-05-01 17:30-17:33 | `019d3aa6-6fbe-7543-8abc-2a48fa2cd0b9` | Empty `task_complete` at 17:30:02 and 17:32:36, followed by user: “you keep triggering a ... flag that stops the conversation” | User-authorized property-search portal research. The agent was waiting for pages loaded by the user to settle, then planning passive DOM structure capture into minimized JSON fixtures. | This was QA/instrumentation for the user’s own repo. It was not auth bypass, scraping at scale, or evasion. The safe path was explicitly passive capture from a user-loaded page, with no raw page storage. | | 2026-05-11 00:01-00:07 | `019e1403-b2f7-78f1-97e8-ae61596db57d` | Empty `task_complete` at 00:01:59 and 00:02:14, followed by user asking why the chat was flagged | Rapid Web Agent repo work. The agent was adding local synthetic browser-runtime tests: cookie banner fixture, blocker page fixture, redaction checks, and needs-input artifacts. | This was mostly local fixture testing. The code added human-intervention and redaction safeguards; it did not target third-party systems or request misuse. | | 2026-05-11 08:54-09:08 | `019e1634-7321-7ee1-ac99-586344009604` | Empty `task_complete` at 08:54:26; user asks why flagged. Another empty `task_complete` at 09:08:49; user says the classifier prevented continuation. | Rapid Web Agent acceptance work across retail, automotive, property, and lead-form flows. The task used dummy data only and stopped before login, account creation, payment, or irreversible submission. | This is legitimate QA automation with explicit non-destructive boundaries and acceptance proof. The classifier appears to overweight “web agent”, “real sites”, “forms”, and “checkout” without honoring the repo context and hard stops. | | 2026-05-11 09:43-09:49 | `019e1634-7312-7ee1-ac99-586344009604` / same thread family | User reports that both the original and a cleaned continuation prompt were still flagged. | The agent tried to provide a safer continuation prompt for the same repo task, including guardrails like no authentication, no accounts, no payment, and dummy data. | Guardrail language should reduce risk, but here it seems to have increased the chance of a false positive by repeating words like validation, proof, dummy data, and forbidden actions. | | 2026-05-11 12:17-12:25 | `019e166c-16b8-7842-811f-7955c141504c` | Empty `task_complete` at 12:17:24 and 12:19:06, followed by user asking why it was flagged | Rapid Web Agent PRD implementation: OpenRouter model gateway, role telemetry, dummy persona, hard blockers, and artifacts such as `models.jsonl`, `agent-loop.json`, and observability logs. | This was repo-local implementation and telemetry. The hard blockers were designed to prevent unsafe behavior; nothing requested credential theft, unauthorized access, destructive actions, or payment. | | 2026-05-11 18:39 | `019e1716-73fc-73f2-bac6-625bc32b37e8` | Empty `task_complete` at 18:39:27. The previous assistant messages were about classifying repeated 403/site-protection pages and adding benchmark tests; user immediately said the previous task did not finish because the classifier tripped. | Production-readiness benchmark work. The agent was distinguishing a harness bug from a repeated 403/site-protection failure class, then adding tests and dashboard aggregation. | Classifying a QA failure mode is defensive rel

codex2026-05-12 09:15:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Preserve a user-visible error reason in the JSONL/logs instead of only an empty task_complete marker.

Root Cause

Time (UTC)	Session	Evidence in JSONL	Task context	Why the stop was wrong
2026-05-01 17:30-17:33	`019d3aa6-6fbe-7543-8abc-2a48fa2cd0b9`	Empty `task_complete` at 17:30:02 and 17:32:36, followed by user: “you keep triggering a ... flag that stops the conversation”	User-authorized property-search portal research. The agent was waiting for pages loaded by the user to settle, then planning passive DOM structure capture into minimized JSON fixtures.	This was QA/instrumentation for the user’s own repo. It was not auth bypass, scraping at scale, or evasion. The safe path was explicitly passive capture from a user-loaded page, with no raw page storage.
2026-05-11 00:01-00:07	`019e1403-b2f7-78f1-97e8-ae61596db57d`	Empty `task_complete` at 00:01:59 and 00:02:14, followed by user asking why the chat was flagged	Rapid Web Agent repo work. The agent was adding local synthetic browser-runtime tests: cookie banner fixture, blocker page fixture, redaction checks, and needs-input artifacts.	This was mostly local fixture testing. The code added human-intervention and redaction safeguards; it did not target third-party systems or request misuse.
2026-05-11 08:54-09:08	`019e1634-7321-7ee1-ac99-586344009604`	Empty `task_complete` at 08:54:26; user asks why flagged. Another empty `task_complete` at 09:08:49; user says the classifier prevented continuation.	Rapid Web Agent acceptance work across retail, automotive, property, and lead-form flows. The task used dummy data only and stopped before login, account creation, payment, or irreversible submission.	This is legitimate QA automation with explicit non-destructive boundaries and acceptance proof. The classifier appears to overweight “web agent”, “real sites”, “forms”, and “checkout” without honoring the repo context and hard stops.
2026-05-11 09:43-09:49	`019e1634-7312-7ee1-ac99-586344009604` / same thread family	User reports that both the original and a cleaned continuation prompt were still flagged.	The agent tried to provide a safer continuation prompt for the same repo task, including guardrails like no authentication, no accounts, no payment, and dummy data.	Guardrail language should reduce risk, but here it seems to have increased the chance of a false positive by repeating words like validation, proof, dummy data, and forbidden actions.
2026-05-11 12:17-12:25	`019e166c-16b8-7842-811f-7955c141504c`	Empty `task_complete` at 12:17:24 and 12:19:06, followed by user asking why it was flagged	Rapid Web Agent PRD implementation: OpenRouter model gateway, role telemetry, dummy persona, hard blockers, and artifacts such as `models.jsonl`, `agent-loop.json`, and observability logs.	This was repo-local implementation and telemetry. The hard blockers were designed to prevent unsafe behavior; nothing requested credential theft, unauthorized access, destructive actions, or payment.
2026-05-11 18:39	`019e1716-73fc-73f2-bac6-625bc32b37e8`	Empty `task_complete` at 18:39:27. The previous assistant messages were about classifying repeated 403/site-protection pages and adding benchmark tests; user immediately said the previous task did not finish because the classifier tripped.	Production-readiness benchmark work. The agent was distinguishing a harness bug from a repeated 403/site-protection failure class, then adding tests and dashboard aggregation.	Classifying a QA failure mode is defensive reliability engineering, not misuse. The task was to avoid mislabeling failures and to surface honest benchmark stats.
2026-05-11 18:52	`019e1716-73fc-73f2-bac6-625bc32b37e8`	Empty `task_complete` at 18:52:02 after the user asked for free OpenRouter model-role benchmarking and dashboard visibility	The agent was building a model-role scorecard: pass rate, latency, token use, provider failures, evaluator disagreement, and dashboard explanation. It also used the project’s own secret resolver and did not print keys.	This is normal LLM benchmarking and product telemetry. It should not be blocked as a security-risk task.

RAW_BUFFERClick to expand / collapse

What version of the Codex App are you using (From “About Codex” dialog)?

Observed in Codex Desktop session logs across:

0.118.0-alpha.2
0.130.0-alpha.5

Current session log reports 0.130.0-alpha.5.

What subscription do you have?

Not available from local session logs.

What platform is your computer?

Darwin 25.4.0 arm64 arm

What issue are you seeing?

Codex App conversations are being stopped by a safety-risk gate during benign, user-authorized software engineering work. The local JSONL session logs do not expose the internal classifier verdict, so the evidence below is based on the user-visible symptom pattern:

an in-progress turn ends with task_complete but an empty last_agent_message;
the user immediately reports that the thread was stopped or flagged;
the surrounding task is ordinary repo-local QA, benchmark, browser-proof, model-provider telemetry, or dashboard work;
no credentials are printed, no auth bypass is requested, no payment/action is completed, and the workflows include explicit stop boundaries.

Uploaded thread reference from the issue URL:

Uploaded thread: 0ce8e98a-7678-47c4-997b-ef2a9209d861

What steps can reproduce the bug?

I found the following instances in local Codex session logs. Session IDs are included so the uploaded thread/log bundle can be correlated server-side.

Time (UTC)	Session	Evidence in JSONL	Task context	Why the stop was wrong
2026-05-01 17:30-17:33	`019d3aa6-6fbe-7543-8abc-2a48fa2cd0b9`	Empty `task_complete` at 17:30:02 and 17:32:36, followed by user: “you keep triggering a ... flag that stops the conversation”	User-authorized property-search portal research. The agent was waiting for pages loaded by the user to settle, then planning passive DOM structure capture into minimized JSON fixtures.	This was QA/instrumentation for the user’s own repo. It was not auth bypass, scraping at scale, or evasion. The safe path was explicitly passive capture from a user-loaded page, with no raw page storage.
2026-05-11 00:01-00:07	`019e1403-b2f7-78f1-97e8-ae61596db57d`	Empty `task_complete` at 00:01:59 and 00:02:14, followed by user asking why the chat was flagged	Rapid Web Agent repo work. The agent was adding local synthetic browser-runtime tests: cookie banner fixture, blocker page fixture, redaction checks, and needs-input artifacts.	This was mostly local fixture testing. The code added human-intervention and redaction safeguards; it did not target third-party systems or request misuse.
2026-05-11 08:54-09:08	`019e1634-7321-7ee1-ac99-586344009604`	Empty `task_complete` at 08:54:26; user asks why flagged. Another empty `task_complete` at 09:08:49; user says the classifier prevented continuation.	Rapid Web Agent acceptance work across retail, automotive, property, and lead-form flows. The task used dummy data only and stopped before login, account creation, payment, or irreversible submission.	This is legitimate QA automation with explicit non-destructive boundaries and acceptance proof. The classifier appears to overweight “web agent”, “real sites”, “forms”, and “checkout” without honoring the repo context and hard stops.
2026-05-11 09:43-09:49	`019e1634-7312-7ee1-ac99-586344009604` / same thread family	User reports that both the original and a cleaned continuation prompt were still flagged.	The agent tried to provide a safer continuation prompt for the same repo task, including guardrails like no authentication, no accounts, no payment, and dummy data.	Guardrail language should reduce risk, but here it seems to have increased the chance of a false positive by repeating words like validation, proof, dummy data, and forbidden actions.
2026-05-11 12:17-12:25	`019e166c-16b8-7842-811f-7955c141504c`	Empty `task_complete` at 12:17:24 and 12:19:06, followed by user asking why it was flagged	Rapid Web Agent PRD implementation: OpenRouter model gateway, role telemetry, dummy persona, hard blockers, and artifacts such as `models.jsonl`, `agent-loop.json`, and observability logs.	This was repo-local implementation and telemetry. The hard blockers were designed to prevent unsafe behavior; nothing requested credential theft, unauthorized access, destructive actions, or payment.
2026-05-11 18:39	`019e1716-73fc-73f2-bac6-625bc32b37e8`	Empty `task_complete` at 18:39:27. The previous assistant messages were about classifying repeated 403/site-protection pages and adding benchmark tests; user immediately said the previous task did not finish because the classifier tripped.	Production-readiness benchmark work. The agent was distinguishing a harness bug from a repeated 403/site-protection failure class, then adding tests and dashboard aggregation.	Classifying a QA failure mode is defensive reliability engineering, not misuse. The task was to avoid mislabeling failures and to surface honest benchmark stats.
2026-05-11 18:52	`019e1716-73fc-73f2-bac6-625bc32b37e8`	Empty `task_complete` at 18:52:02 after the user asked for free OpenRouter model-role benchmarking and dashboard visibility	The agent was building a model-role scorecard: pass rate, latency, token use, provider failures, evaluator disagreement, and dashboard explanation. It also used the project’s own secret resolver and did not print keys.	This is normal LLM benchmarking and product telemetry. It should not be blocked as a security-risk task.

Concrete reproduction shape:

In Codex App, work in an owned repo that performs browser QA/acceptance testing.
Ask Codex to run user-authorized browser proof against public product pages with guardrails: dummy data only, no login, no account creation, no payment, no irreversible submit.
Ask Codex to classify failures such as 403/site-protection pages separately from selector bugs and surface the stats in a dashboard.
The session may terminate mid-turn with an empty completion event instead of allowing Codex to continue or ask for clarification.

What is the expected behavior?

Codex should not stop benign software-engineering sessions just because the repo uses browser automation, acceptance proof, model-role benchmarking, or failure classification language.

Expected handling:

Use repo and conversation context, including explicit user authorization and hard stop boundaries.
Treat guardrails such as “no login”, “no accounts”, “no payment”, and “dummy data only” as risk-reducing signals, not risk-increasing signals.
Allow defensive QA labels for site-protection or rate-limit failures in benchmark dashboards.
If the classifier is uncertain, ask a targeted clarification or downgrade capabilities for that turn rather than killing the session.
Preserve a user-visible error reason in the JSONL/logs instead of only an empty task_complete marker.

Additional information

The repeated false positives caused large amounts of duplicate work because the user had to start new threads and re-prompt with less precise language. The worst case is that safer guardrail language appears to increase false positives, which encourages users to provide less safety context.

No secrets were included in this issue. Local dotenv discovery in the affected sessions reported only paths/source availability and did not print key values.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

codex - 💡(How to fix) Fix Codex App: false-positive safety-risk stops interrupt benign browser QA and benchmark sessions

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

What version of the Codex App are you using (From “About Codex” dialog)?

What subscription do you have?

What platform is your computer?

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Still need to ship something?

TRENDING

codex - 💡(How to fix) Fix Codex App: false-positive safety-risk stops interrupt benign browser QA and benchmark sessions

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

What version of the Codex App are you using (From “About Codex” dialog)?

What subscription do you have?

What platform is your computer?

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Still need to ship something?

RELATED_DISCOVERY

TRENDING