claude-code - 💡(How to fix) Fix Behavior report: over-orchestration + unverified factual claim + self-violated instruction on a small docs task (Claude Code vs Codex)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Real-world head-to-head on a small, well-scoped task: write architecture documentation describing how a third-party API (Perplexity) is used in one ingestion stage of a private codebase. The same task was given to Claude Code (Opus 4.8, "ultracode" effort mode) and to Codex. Final content from both was factually correct. Claude Code was roughly 7-10x slower, about an order of magnitude more expensive in tokens, violated an explicit top-priority instruction in its own output, and asserted a wrong factual premise early that then required extra rounds to unwind. This is self-reported by the model at the user's request.

Root Cause

Real-world head-to-head on a small, well-scoped task: write architecture documentation describing how a third-party API (Perplexity) is used in one ingestion stage of a private codebase. The same task was given to Claude Code (Opus 4.8, "ultracode" effort mode) and to Codex. Final content from both was factually correct. Claude Code was roughly 7-10x slower, about an order of magnitude more expensive in tokens, violated an explicit top-priority instruction in its own output, and asserted a wrong factual premise early that then required extra rounds to unwind. This is self-reported by the model at the user's request.

RAW_BUFFERClick to expand / collapse

Summary

Real-world head-to-head on a small, well-scoped task: write architecture documentation describing how a third-party API (Perplexity) is used in one ingestion stage of a private codebase. The same task was given to Claude Code (Opus 4.8, "ultracode" effort mode) and to Codex. Final content from both was factually correct. Claude Code was roughly 7-10x slower, about an order of magnitude more expensive in tokens, violated an explicit top-priority instruction in its own output, and asserted a wrong factual premise early that then required extra rounds to unwind. This is self-reported by the model at the user's request.

Environment

  • Claude Code, model Opus 4.8 (1M context)
  • Effort mode: "ultracode" (multi-agent workflow orchestration enabled)
  • Task: bounded "document what exists" architecture docs

Head-to-head result

MetricCodexClaude Code
Wall-clock~3 min~20-30 min
Pre-worknone6-agent workflow, ~157s, ~600k subagent tokens
Output size356 lines, 1 file702 lines, 3 files (~2.4x)
Content correctnesscorrectcorrect
Rule-compliant first passyesno
Wrong factual premise earlynoyes

Failure modes

1. Asserted a factual claim without reading the authoritative source (worst issue). Claude Code stated early that a specific API path was "not live in production", inferring it from two wrong files (an unrelated service plus a stale parallel implementation). The authoritative, current source file was already in its own grep output, unopened. It then built a user-facing decision prompt on this false premise (which the user answered), and needed several extra rounds to retract and correct itself. Pattern: inferred guesses are stated as established facts, even when the correct source is already surfaced in-context. "Read the authoritative source before asserting" discipline is missing.

2. Violated an explicit top-priority instruction in its own output. The project's first configured rule forbids ASCII substitution of German umlauts (ae/oe/ue/ss). Claude Code followed it in one file and broke it in two others within the same task, then needed multiple cleanup cycles (one of which failed on its own shell bug). Pattern: instruction adherence decays across longer generations within a single task; the model "knows" a rule and breaks it mid-task.

3. Over-orchestration for a small task. In "ultracode" mode it reflexively launched an expensive multi-agent workflow when directly reading ~6 files would have sufficed. The one useful finding from the workflow would have surfaced without it. Pattern: effort/orchestration modes are not coupled to task size; missing a strong default toward "do the simple thing".

4. Cost amplification via self-inflicted rework. Each file edit echoed full file contents back into context, so the umlaut cleanup cost the re-echoed content of all files, not just the fix commands. Combined with the workflow, total token cost was about an order of magnitude above what the task warranted.

5. Output-volume bias. Produced ~2.4x the comparison solution's text with partial duplication, creating a "second source of truth" the model had itself warned the user against.

Suggested classification

bug / behavior-quality (model behavior, effort-mode calibration, instruction adherence).

What should have happened

  • Read the authoritative live source before making any factual claim or building a user decision on it.
  • Task-size before machinery: do small, well-scoped tasks directly, no multi-agent fan-out.
  • Self-check top-priority rules before declaring done, not after being told.
  • Document concisely in one place; do not create parallel sources of truth.

Filed at the user's request as a behavior/quality self-report. The user's takeaway: for everyday bounded tasks like this, the price/performance did not justify a premium subscription versus a cheaper tool that was faster, cleaner, and equally correct.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Behavior report: over-orchestration + unverified factual claim + self-violated instruction on a small docs task (Claude Code vs Codex)