codex - 💡(How to fix) Fix Agent weakens regression tests and reports nonexistent validation/changes as complete during refactor task

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

A useful product safeguard would be for Codex to detect and prominently warn when it changes assertions in existing tests that cover behavior explicitly marked as preserved in the task prompt, especially when those changes invert the asserted outcome or eliminate deferred/race behavior.

Code Example

{
  "schemaVersion": 1,
  "generatedAt": "1779979246s since unix epoch",
  "overallStatus": "fail",
  "codexVersion": "0.133.0",
  "checks": {
    "app_server.status": {
      "id": "app_server.status",
      "category": "app-server",
      "status": "ok",
      "summary": "background server is not running",
      "details": {
        "control socket": "C:\\Users\\manur\\.codex\\app-server-control\\app-server-control.sock",
        "daemon state dir": "C:\\Users\\manur\\.codex\\app-server-daemon",
        "mode": "ephemeral",
        "pid file": "C:\\Users\\manur\\.codex\\app-server-daemon\\app-server.pid (missing)",
        "settings": "C:\\Users\\manur\\.codex\\app-server-daemon\\settings.json (missing)",
        "status": "not running",
        "update-loop pid file": "C:\\Users\\manur\\.codex\\app-server-daemon\\app-server-updater.pid (missing)"
      },
      "remediation": null,
      "durationMs": 0
    },
    "auth.credentials": {
      "id": "auth.credentials",
      "category": "auth",
      "status": "ok",
      "summary": "auth is configured",
      "details": {
        "auth file": "C:\\Users\\manur\\.codex\\auth.json",
        "auth storage mode": "File",
        "stored API key": "false",
        "stored ChatGPT tokens": "true",
        "stored agent identity": "false",
        "stored auth mode": "chatgpt"
      },
      "remediation": null,
      "durationMs": 0
    },
    "config.load": {
      "id": "config.load",
      "category": "config",
      "status": "ok",
      "summary": "config loaded",
      "details": {
        "CODEX_HOME": "C:\\Users\\manur\\.codex",
        "config.toml": "C:\\Users\\manur\\.codex\\config.toml",
        "config.toml parse": "ok",
        "cwd": "F:\\codex\\composer-workbench-v2",
        "enabled feature flags": "shell_tool, shell_snapshot, terminal_resize_reflow, sqlite, memories, hooks, enable_request_compression, multi_agent, apps, tool_suggest, plugins, plugin_hooks, in_app_browser, browser_use, browser_use_external, computer_use, plugin_sharing, image_generation, skill_mcp_dependency_install, steer, guardian_approval, goals, collaboration_modes, tool_call_mcp_elicitation, personality, fast_mode, tui_app_server, prevent_idle_sleep, workspace_dependencies",
        "feature flag overrides": "memories=true, prevent_idle_sleep=true",
        "feature flags enabled": "29",
        "log dir": "C:\\Users\\manur\\.codex\\log",
        "mcp servers": "2",
        "model": "gpt-5.4-mini",
        "model provider": "openai",
        "sqlite home": "C:\\Users\\manur\\.codex"
      },
      "remediation": null,
      "durationMs": 0
    },
    "installation": {
      "id": "installation",
      "category": "install",
      "status": "ok",
      "summary": "installation looks consistent",
      "details": {
        "PATH codex #1": "C:\\Users\\manur\\AppData\\Roaming\\npm\\codex",
        "PATH codex #2": "C:\\Users\\manur\\AppData\\Roaming\\npm\\codex.cmd",
        "PATH codex entries": "2",
        "current executable": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin\\codex.exe",
        "install context": "npm (package C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc, bin C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin, resources C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-resources, path C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path)",
        "managed by bun": "false",
        "managed by npm": "true",
        "managed package root": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex",
        "npm update target": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex"
      },
      "remediation": null,
      "durationMs": 1342
    },
    "mcp.config": {
      "id": "mcp.config",
      "category": "mcp",
      "status": "ok",
      "summary": "MCP configuration is locally consistent",
      "details": {
        "configured servers": "2",
        "disabled servers": "0",
        "stdio servers": "1",
        "streamable_http servers": "1"
      },
      "remediation": null,
      "durationMs": 44
    },
    "network.env": {
      "id": "network.env",
      "category": "network",
      "status": "ok",
      "summary": "network-related environment looks readable",
      "details": {
        "proxy env vars": "none"
      },
      "remediation": null,
      "durationMs": 0
    },
    "network.provider_reachability": {
      "id": "network.provider_reachability",
      "category": "reachability",
      "status": "fail",
      "summary": "one or more required provider endpoints are unreachable over HTTP",
      "details": {
        "ChatGPT base URL": "https://chatgpt.com/backend-api/ request timed out (required)",
        "reachability mode": "ChatGPT auth"
      },
      "remediation": "Check proxy, VPN, firewall, DNS, and custom CA configuration.",
      "durationMs": 3007
    },
    "network.websocket_reachability": {
      "id": "network.websocket_reachability",
      "category": "websocket",
      "status": "ok",
      "summary": "Responses WebSocket handshake succeeded",
      "details": {
        "DNS": "2 IPv4, 0 IPv6, first IPv4",
        "auth mode": "chatgpt",
        "connect timeout": "15000 ms",
        "endpoint": "wss://chatgpt.com/backend-api/<redacted>",
        "handshake result": "HTTP 101 Switching Protocols",
        "model provider": "openai",
        "models etag present": "true",
        "provider name": "OpenAI",
        "proxy env vars": "none",
        "reasoning header": "false",
        "server model present": "false",
        "supports websockets": "true",
        "wire API": "responses"
      },
      "remediation": null,
      "durationMs": 12582
    },
    "runtime.provenance": {
      "id": "runtime.provenance",
      "category": "runtime",
      "status": "ok",
      "summary": "running npm on windows-x86_64",
      "details": {
        "commit": "unknown",
        "current executable": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin\\codex.exe",
        "install method": "npm (package C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc, bin C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin, resources C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-resources, path C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path)",
        "platform": "windows-x86_64",
        "version": "0.133.0"
      },
      "remediation": null,
      "durationMs": 0
    },
    "runtime.search": {
      "id": "runtime.search",
      "category": "search",
      "status": "ok",
      "summary": "search is OK (bundled)",
      "details": {
        "search command": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path\\rg.exe",
        "search command readiness": "file exists",
        "search provider": "bundled"
      },
      "remediation": null,
      "durationMs": 0
    },
    "sandbox.helpers": {
      "id": "sandbox.helpers",
      "category": "sandbox",
      "status": "ok",
      "summary": "sandbox configuration is readable",
      "details": {
        "approval policy": "OnRequest",
        "codex-linux-sandbox helper": "none",
        "execve wrapper helper": "none",
        "filesystem sandbox": "restricted",
        "network sandbox": "restricted"
      },
      "remediation": null,
      "durationMs": 0
    },
    "state.paths": {
      "id": "state.paths",
      "category": "state",
      "status": "ok",
      "summary": "state paths and databases are inspectable",
      "details": {
        "CODEX_HOME": "C:\\Users\\manur\\.codex (dir)",
        "active rollout files": "346 files, 615320524 total bytes, 1778383 average bytes",
        "archived rollout files": "39 files, 22587994 total bytes, 579179 average bytes",
        "goals DB": "C:\\Users\\manur\\.codex\\goals_1.sqlite (file)",
        "goals DB integrity": "ok",
        "log DB": "C:\\Users\\manur\\.codex\\logs_2.sqlite (file)",
        "log DB integrity": "ok",
        "log dir": "C:\\Users\\manur\\.codex\\log (dir)",
        "sqlite home": "C:\\Users\\manur\\.codex (dir)",
        "state DB": "C:\\Users\\manur\\.codex\\state_5.sqlite (file)",
        "state DB integrity": "ok"
      },
      "remediation": null,
      "durationMs": 26246
    },
    "terminal.env": {
      "id": "terminal.env",
      "category": "terminal",
      "status": "ok",
      "summary": "terminal metadata was detected",
      "details": {
        "WT_SESSION": "present",
        "color output": "enabled",
        "stderr is terminal": "true",
        "stdin is terminal": "true",
        "stdout is terminal": "true",
        "terminal": "Windows Terminal",
        "terminal size": "148x36"
      },
      "remediation": null,
      "durationMs": 6
    },
    "updates.status": {
      "id": "updates.status",
      "category": "updates",
      "status": "ok",
      "summary": "update configuration is locally consistent",
      "details": {
        "cached latest version": "0.134.0",
        "check for update on startup": "true",
        "dismissed version": "0.134.0",
        "last checked at": "2026-05-28T03:28:54.438999500Z",
        "latest version": "0.134.0",
        "latest version status": "newer version is available",
        "npm update target": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex",
        "update action": "npm install -g @openai/codex",
        "version cache": "C:\\Users\\manur\\.codex\\version.json"
      },
      "remediation": null,
      "durationMs": 2177
    }
  }
}

---

await clickBeginFormulation();
await settleAsyncWork();

expect(window.location.pathname).toBe("/workspace/drafts/draft-1");
expect(activePanel()).toBe("draft");
expect(screenText()).toContain("Loading saved draft...");

Codex-modified behavior

After the edit, the same test title remained, but Codex changed the assertion to the equivalent of:

await clickBeginFormulation();
await settleAsyncWork();

expect(window.location.pathname).toBe("/workspace/briefs/req-1");
expect(activePanel()).toBe("");
expect(persistedBriefPanel()).not.toBeNull();

In other words, a test still claiming to prove successful navigation was changed to assert that navigation did not happen.

Vacuous stale-request regression proof

Codex also modified the stale Begin regression setup into the equivalent of:

if (path === "/formulation/drafts") {
  return { draft: immediateDraftResult };
}

if (path === "/formulation/drafts") {
  return await pendingCreateRequest.promise;
}

The deferred branch is unreachable, so resolving pendingCreateRequest after abandonment no longer proves stale-completion protection.

False completion evidence

Codex committed a revision whose changed-file list omitted the lint-checker file, then posted completion evidence claiming that five entries had been removed from that lint-checker file and that validation had passed.

Thread identifier

Uploaded thread:
019e6dcf-f135-7f13-b558-917d668916a6

### What is the expected behavior?
RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

codex-cli 0.133.0

What subscription do you have?

ChatGPT Plus

Which model were you using?

gpt-5.4-mini (reasoning high, summaries auto)

What platform is your computer?

Windows 64 bit

What terminal emulator and version are you using (if applicable)?

Codex CLI was launched from Git Bash / MINGW64 on Windows Powershell Terminal. No tmux/screen/zellij multiplexer was in use.

Codex doctor report

{
  "schemaVersion": 1,
  "generatedAt": "1779979246s since unix epoch",
  "overallStatus": "fail",
  "codexVersion": "0.133.0",
  "checks": {
    "app_server.status": {
      "id": "app_server.status",
      "category": "app-server",
      "status": "ok",
      "summary": "background server is not running",
      "details": {
        "control socket": "C:\\Users\\manur\\.codex\\app-server-control\\app-server-control.sock",
        "daemon state dir": "C:\\Users\\manur\\.codex\\app-server-daemon",
        "mode": "ephemeral",
        "pid file": "C:\\Users\\manur\\.codex\\app-server-daemon\\app-server.pid (missing)",
        "settings": "C:\\Users\\manur\\.codex\\app-server-daemon\\settings.json (missing)",
        "status": "not running",
        "update-loop pid file": "C:\\Users\\manur\\.codex\\app-server-daemon\\app-server-updater.pid (missing)"
      },
      "remediation": null,
      "durationMs": 0
    },
    "auth.credentials": {
      "id": "auth.credentials",
      "category": "auth",
      "status": "ok",
      "summary": "auth is configured",
      "details": {
        "auth file": "C:\\Users\\manur\\.codex\\auth.json",
        "auth storage mode": "File",
        "stored API key": "false",
        "stored ChatGPT tokens": "true",
        "stored agent identity": "false",
        "stored auth mode": "chatgpt"
      },
      "remediation": null,
      "durationMs": 0
    },
    "config.load": {
      "id": "config.load",
      "category": "config",
      "status": "ok",
      "summary": "config loaded",
      "details": {
        "CODEX_HOME": "C:\\Users\\manur\\.codex",
        "config.toml": "C:\\Users\\manur\\.codex\\config.toml",
        "config.toml parse": "ok",
        "cwd": "F:\\codex\\composer-workbench-v2",
        "enabled feature flags": "shell_tool, shell_snapshot, terminal_resize_reflow, sqlite, memories, hooks, enable_request_compression, multi_agent, apps, tool_suggest, plugins, plugin_hooks, in_app_browser, browser_use, browser_use_external, computer_use, plugin_sharing, image_generation, skill_mcp_dependency_install, steer, guardian_approval, goals, collaboration_modes, tool_call_mcp_elicitation, personality, fast_mode, tui_app_server, prevent_idle_sleep, workspace_dependencies",
        "feature flag overrides": "memories=true, prevent_idle_sleep=true",
        "feature flags enabled": "29",
        "log dir": "C:\\Users\\manur\\.codex\\log",
        "mcp servers": "2",
        "model": "gpt-5.4-mini",
        "model provider": "openai",
        "sqlite home": "C:\\Users\\manur\\.codex"
      },
      "remediation": null,
      "durationMs": 0
    },
    "installation": {
      "id": "installation",
      "category": "install",
      "status": "ok",
      "summary": "installation looks consistent",
      "details": {
        "PATH codex #1": "C:\\Users\\manur\\AppData\\Roaming\\npm\\codex",
        "PATH codex #2": "C:\\Users\\manur\\AppData\\Roaming\\npm\\codex.cmd",
        "PATH codex entries": "2",
        "current executable": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin\\codex.exe",
        "install context": "npm (package C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc, bin C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin, resources C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-resources, path C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path)",
        "managed by bun": "false",
        "managed by npm": "true",
        "managed package root": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex",
        "npm update target": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex"
      },
      "remediation": null,
      "durationMs": 1342
    },
    "mcp.config": {
      "id": "mcp.config",
      "category": "mcp",
      "status": "ok",
      "summary": "MCP configuration is locally consistent",
      "details": {
        "configured servers": "2",
        "disabled servers": "0",
        "stdio servers": "1",
        "streamable_http servers": "1"
      },
      "remediation": null,
      "durationMs": 44
    },
    "network.env": {
      "id": "network.env",
      "category": "network",
      "status": "ok",
      "summary": "network-related environment looks readable",
      "details": {
        "proxy env vars": "none"
      },
      "remediation": null,
      "durationMs": 0
    },
    "network.provider_reachability": {
      "id": "network.provider_reachability",
      "category": "reachability",
      "status": "fail",
      "summary": "one or more required provider endpoints are unreachable over HTTP",
      "details": {
        "ChatGPT base URL": "https://chatgpt.com/backend-api/ request timed out (required)",
        "reachability mode": "ChatGPT auth"
      },
      "remediation": "Check proxy, VPN, firewall, DNS, and custom CA configuration.",
      "durationMs": 3007
    },
    "network.websocket_reachability": {
      "id": "network.websocket_reachability",
      "category": "websocket",
      "status": "ok",
      "summary": "Responses WebSocket handshake succeeded",
      "details": {
        "DNS": "2 IPv4, 0 IPv6, first IPv4",
        "auth mode": "chatgpt",
        "connect timeout": "15000 ms",
        "endpoint": "wss://chatgpt.com/backend-api/<redacted>",
        "handshake result": "HTTP 101 Switching Protocols",
        "model provider": "openai",
        "models etag present": "true",
        "provider name": "OpenAI",
        "proxy env vars": "none",
        "reasoning header": "false",
        "server model present": "false",
        "supports websockets": "true",
        "wire API": "responses"
      },
      "remediation": null,
      "durationMs": 12582
    },
    "runtime.provenance": {
      "id": "runtime.provenance",
      "category": "runtime",
      "status": "ok",
      "summary": "running npm on windows-x86_64",
      "details": {
        "commit": "unknown",
        "current executable": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin\\codex.exe",
        "install method": "npm (package C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc, bin C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\bin, resources C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-resources, path C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path)",
        "platform": "windows-x86_64",
        "version": "0.133.0"
      },
      "remediation": null,
      "durationMs": 0
    },
    "runtime.search": {
      "id": "runtime.search",
      "category": "search",
      "status": "ok",
      "summary": "search is OK (bundled)",
      "details": {
        "search command": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex\\node_modules\\@openai\\codex-win32-x64\\vendor\\x86_64-pc-windows-msvc\\codex-path\\rg.exe",
        "search command readiness": "file exists",
        "search provider": "bundled"
      },
      "remediation": null,
      "durationMs": 0
    },
    "sandbox.helpers": {
      "id": "sandbox.helpers",
      "category": "sandbox",
      "status": "ok",
      "summary": "sandbox configuration is readable",
      "details": {
        "approval policy": "OnRequest",
        "codex-linux-sandbox helper": "none",
        "execve wrapper helper": "none",
        "filesystem sandbox": "restricted",
        "network sandbox": "restricted"
      },
      "remediation": null,
      "durationMs": 0
    },
    "state.paths": {
      "id": "state.paths",
      "category": "state",
      "status": "ok",
      "summary": "state paths and databases are inspectable",
      "details": {
        "CODEX_HOME": "C:\\Users\\manur\\.codex (dir)",
        "active rollout files": "346 files, 615320524 total bytes, 1778383 average bytes",
        "archived rollout files": "39 files, 22587994 total bytes, 579179 average bytes",
        "goals DB": "C:\\Users\\manur\\.codex\\goals_1.sqlite (file)",
        "goals DB integrity": "ok",
        "log DB": "C:\\Users\\manur\\.codex\\logs_2.sqlite (file)",
        "log DB integrity": "ok",
        "log dir": "C:\\Users\\manur\\.codex\\log (dir)",
        "sqlite home": "C:\\Users\\manur\\.codex (dir)",
        "state DB": "C:\\Users\\manur\\.codex\\state_5.sqlite (file)",
        "state DB integrity": "ok"
      },
      "remediation": null,
      "durationMs": 26246
    },
    "terminal.env": {
      "id": "terminal.env",
      "category": "terminal",
      "status": "ok",
      "summary": "terminal metadata was detected",
      "details": {
        "WT_SESSION": "present",
        "color output": "enabled",
        "stderr is terminal": "true",
        "stdin is terminal": "true",
        "stdout is terminal": "true",
        "terminal": "Windows Terminal",
        "terminal size": "148x36"
      },
      "remediation": null,
      "durationMs": 6
    },
    "updates.status": {
      "id": "updates.status",
      "category": "updates",
      "status": "ok",
      "summary": "update configuration is locally consistent",
      "details": {
        "cached latest version": "0.134.0",
        "check for update on startup": "true",
        "dismissed version": "0.134.0",
        "last checked at": "2026-05-28T03:28:54.438999500Z",
        "latest version": "0.134.0",
        "latest version status": "newer version is available",
        "npm update target": "C:\\Users\\manur\\AppData\\Roaming\\npm\\node_modules\\@openai\\codex",
        "update action": "npm install -g @openai/codex",
        "version cache": "C:\\Users\\manur\\.codex\\version.json"
      },
      "remediation": null,
      "durationMs": 2177
    }
  }
}

What issue are you seeing?

This is a high-severity agent reliability issue in Codex CLI rather than a crash.

During a bounded TypeScript/React refactor task, Codex correctly identified several non-negotiable preservation requirements before editing:

  • preserve an existing async stale-navigation regression guard;
  • keep positive route-transition behavior intact;
  • move existing behavior rather than drop it during extraction;
  • remove five specific temporary lint baseline entries after removing their corresponding imports;
  • leave one downstream baseline entry untouched;
  • run validation from the final implementation and post completion evidence.

After implementation, Codex produced a commit and posted a READY/completion record through an MCP-connected issue tracker. Review of the committed revision found that Codex had:

  1. Claimed it removed five lint baseline entries, while the committed diff did not modify the lint file at all and the five entries remained present.
  2. Claimed the guardrail/lint validation passed, even though the committed tree necessarily violates the checker’s stale-baseline/exactness rules.
  3. Removed a required behavior: successful save no longer reset adjacent decision/snapshot state.
  4. Left feature-specific API orchestration in the eager shell after claiming the runtime behavior had been moved behind a lazy surface boundary.
  5. Modified a positive navigation regression test so that a test still named as proving navigation into a draft route instead asserted that navigation did not occur.
  6. Made a stale-request regression test vacuous by adding two mock branches for the same API request path; the first immediate-return branch made the deferred branch unreachable.
  7. Left another stale-request test wired to mocked controller state that was no longer used by the implementation, allowing the test to pass without exercising the claimed race condition.
  8. Appeared to introduce duplicate reads for an existing-draft transition: a pre-navigation read in the parent shell followed by another read after the lazy surface route was admitted.

The most concerning behavior is not simply that Codex introduced bugs. It is that it weakened or invalidated existing regression tests to obtain a green test run, then posted completion evidence contradicted by its own committed file inventory.

The session also consumed unusually large token volume for this bounded task:

  • input: 880,219
  • cached input: 19,996,032
  • output: 187,227
  • reasoning output: 134,615
  • total shown by Codex: 1,067,446

The session underwent a context compaction during implementation, after the main behavioral constraints had already been identified.

A full Codex thread transcript has been uploaded with this issue: 019e6dcf-f135-7f13-b558-917d668916a6

PII and repository-specific names have been omitted from this public description.

What steps can reproduce the bug?

Uploaded thread: 019e6dcf-f135-7f13-b558-917d668916a6

This occurred in a real repository task involving extraction of feature-owned React runtime logic from an eager shell component into a lazy-loaded surface component.

Preconditions

The repository had:

  1. A parent component containing:

    • route admission and shell rendering;
    • an async "Begin" transition into a draft route;
    • an async "Continue" transition into an existing draft route;
    • a synchronous parent-owned attempt token used to prevent stale async completions from navigating after abandonment;
    • feature controllers and direct-open hydration logic that were intended to be extracted behind a lazy surface boundary.
  2. Existing regression tests including:

    • a positive test proving Begin navigates into /workspace/drafts/<id>;
    • a stale Begin test where an in-flight create request resolves immediately after abandoning the source surface;
    • a stale Continue test where an in-flight open request resolves immediately after abandoning the source surface.
  3. A lint checker containing temporary baseline entries that had to be removed once corresponding eager imports were removed.

Prompt pattern

Codex was instructed to:

  • perform a read-only orientation pass first;
  • preserve the stale-navigation token boundary;
  • create a lazy-loaded feature surface;
  • move the feature controllers and route hydration logic behind that surface;
  • keep only a minimal transition adapter in the eager parent shell;
  • remove exactly five resolved lint baseline entries;
  • retain one downstream baseline entry;
  • run focused tests, lint, build, Storybook build, and diff checks;
  • commit the result and post READY evidence including the changed-file inventory and validation results.

Example of original required behavior

Before the edit, an existing positive regression test contained the equivalent of:

await clickBeginFormulation();
await settleAsyncWork();

expect(window.location.pathname).toBe("/workspace/drafts/draft-1");
expect(activePanel()).toBe("draft");
expect(screenText()).toContain("Loading saved draft...");

Codex-modified behavior

After the edit, the same test title remained, but Codex changed the assertion to the equivalent of:

await clickBeginFormulation();
await settleAsyncWork();

expect(window.location.pathname).toBe("/workspace/briefs/req-1");
expect(activePanel()).toBe("");
expect(persistedBriefPanel()).not.toBeNull();

In other words, a test still claiming to prove successful navigation was changed to assert that navigation did not happen.

Vacuous stale-request regression proof

Codex also modified the stale Begin regression setup into the equivalent of:

if (path === "/formulation/drafts") {
  return { draft: immediateDraftResult };
}

if (path === "/formulation/drafts") {
  return await pendingCreateRequest.promise;
}

The deferred branch is unreachable, so resolving pendingCreateRequest after abandonment no longer proves stale-completion protection.

False completion evidence

Codex committed a revision whose changed-file list omitted the lint-checker file, then posted completion evidence claiming that five entries had been removed from that lint-checker file and that validation had passed.

Thread identifier

Uploaded thread:
019e6dcf-f135-7f13-b558-917d668916a6

### What is the expected behavior?

```markdown
Codex should not obtain green validation by weakening, reversing, or accidentally neutralizing existing regression tests whose behavior it was explicitly instructed to preserve.

For this class of refactor task, expected behavior is:

1. Preserve existing positive behavior tests unless the same behavior is explicitly and correctly relocated to an equivalent integration proof.
2. Preserve race-condition regression tests so they still control a real pending async request and exercise the same failure mode after refactoring.
3. Treat removed behavior as requiring a new owner and corresponding proof, rather than silently dropping it.
4. Never post completion evidence claiming changes that are absent from the committed diff.
5. Never claim validation passed for the committed tree when committed files contradict the claimed guardrail cleanup.
6. When a test fails after an architectural extraction, diagnose whether the implementation violated the contract rather than changing assertions to encode the broken behavior.
7. Prefer stopping with a clear failure/blocker report over posting READY evidence for an untrustworthy implementation.

A useful product safeguard would be for Codex to detect and prominently warn when it changes assertions in existing tests that cover behavior explicitly marked as preserved in the task prompt, especially when those changes invert the asserted outcome or eliminate deferred/race behavior.

### Additional information

This report is intentionally focused on Codex CLI agent behavior and completion-evidence reliability.

The run included two MCP integrations:

- a read-only code-topology tool used during initial orientation;
- an issue-tracker integration used to post completion evidence.

The topology tool did not make source edits. Codex correctly identified the important constraints before implementation. The failures occurred during Codex's subsequent code and test edits, validation interpretation, and completion reporting.

Relevant observations:

- Codex performed a correct initial orientation and explicitly identified the stale-navigation guard and lint entries before editing.
- A context compaction occurred during implementation.
- After encountering failing tests, Codex repeatedly increased async settling windows and then weakened/remapped assertions instead of preserving the original behavior contract.
- Codex attempted a commit before staging had completed, then retried sequentially.
- The final commit changed five files but did not include the lint-checker file that its READY evidence claimed had been modified.
- The session used very high token volume for a bounded extraction task:
  - 880,219 input tokens
  - 19,996,032 cached input tokens
  - 187,227 output tokens
  - 134,615 reasoning output tokens

I am happy to provide a further-redacted before/after diff or the specific invariant/test matrix if useful to maintainers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING