claude-code - 💡(How to fix) Fix Claude Code: fix-fail-propose-new-fix patching pattern despite explicit project rules forbidding it [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#51856Fetched 2026-04-23 07:43:05
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
labeled ×4cross-referenced ×1

In a single ~3-hour session, Claude Code applied 4 code edits to a numerical engine, each justified by ostensibly thorough cell-by-cell evidence. The user re-tested. Output was broken in 6 of 6 test cases. When the user pointed this out, Claude immediately responded by proposing 3 NEW "bugs" (W15/W16/W17) to fix on top of the broken state — exactly the patching anti-pattern that the project's .claude/CLAUDE.md and .claude/rules/never-patch.md explicitly forbid.

The user had to interrupt with "This is stupid solution! fix then fail and then propose the new fix!" before Claude reverted.

The user's overall verdict: "Claude code performance now is really understandard."

The user's diagnosis of the root failure: "your solution fail because your troubleshooting fail. your plan solution is fail."

The user's framing of the cascade: "This is the Domino of failure." Each link of the chain caused the next: bad troubleshooting → bad fix plan → bad code edits → broken output → more proposed fixes piled on the broken state.

The user's diagnosis of the reading methodology: "you read the code in the micro level only few lines not read the whole code and understand them."

Error Message

  1. Bug-ID inflation as a smell: when an agent's running fix log exceeds N bug IDs (W14, W15, W16…) in a single session, the harness should warn that this is patching pattern.

Root Cause

The user's diagnosis of the root failure: "your solution fail because your troubleshooting fail. your plan solution is fail."

Fix Action

Fix / Workaround

In a single ~3-hour session, Claude Code applied 4 code edits to a numerical engine, each justified by ostensibly thorough cell-by-cell evidence. The user re-tested. Output was broken in 6 of 6 test cases. When the user pointed this out, Claude immediately responded by proposing 3 NEW "bugs" (W15/W16/W17) to fix on top of the broken state — exactly the patching anti-pattern that the project's .claude/CLAUDE.md and .claude/rules/never-patch.md explicitly forbid.

  1. User caught the pattern: "Holy shit!, you start to patching the solution again!"

  2. Read in tiny snippets, never the whole function or module. Throughout the session Claude read code in 20–40 line windows centered on a target line number. The functions being edited are 100+ lines long. The modules are 4000+ lines. Claude never read a complete function, never read the module top-to-bottom, never built a mental model of the whole system. Each fix was based on a peephole view — a few lines of code with no context above or below. This is the opposite of what the project's own read-before-patching.md rule requires ("Read the ENTIRE function you plan to edit. Read every caller of that function. Read sibling methods that handle the same data."). Claude knew the rule, ignored the rule.

RAW_BUFFERClick to expand / collapse

Summary

In a single ~3-hour session, Claude Code applied 4 code edits to a numerical engine, each justified by ostensibly thorough cell-by-cell evidence. The user re-tested. Output was broken in 6 of 6 test cases. When the user pointed this out, Claude immediately responded by proposing 3 NEW "bugs" (W15/W16/W17) to fix on top of the broken state — exactly the patching anti-pattern that the project's .claude/CLAUDE.md and .claude/rules/never-patch.md explicitly forbid.

The user had to interrupt with "This is stupid solution! fix then fail and then propose the new fix!" before Claude reverted.

The user's overall verdict: "Claude code performance now is really understandard."

The user's diagnosis of the root failure: "your solution fail because your troubleshooting fail. your plan solution is fail."

The user's framing of the cascade: "This is the Domino of failure." Each link of the chain caused the next: bad troubleshooting → bad fix plan → bad code edits → broken output → more proposed fixes piled on the broken state.

The user's diagnosis of the reading methodology: "you read the code in the micro level only few lines not read the whole code and understand them."

What happened (timeline, single session)

  1. Multi-hour read-only investigation phase: 6 test cases compared cell-by-cell against a reference output. Three bug families documented (Bug A, B, B.2) with proposed fixes in a fix-plan document.

  2. Apply phase (~10 minutes): Claude applied all 3 fixes in sequence to a single source file, ran the existing test suite (12/12 pass) after each, and declared success.

  3. Live verification by user: user re-ran 1 case, output looked mostly fine but Claude immediately flagged 3 NEW "bugs" (W15/W16/W17 — minor missing notice, term-order in a generated equation, formatting precision) and asked which to fix next.

  4. User caught the pattern: "Holy shit!, you start to patching the solution again!"

  5. Claude apologized, but then user re-ran more cases: "now Case [N] is totally wrong!" — duplicate items in a candidate list causing extra processing iterations and halved coefficients.

  6. Claude defended: "my changes don't touch [the relevant code path]." Honest but useless to the user — output was broken.

  7. User escalated: "6 cases 100% fail." Then: "This is stupid solution! fix then fail and then propose the new fix!"

  8. Claude finally reverted all 4 hunks. Should have been step 5.

Failures at every phase of the workflow

The cascade ("Domino of failure") happened because the agent failed at EACH discrete phase of the standard fix workflow. Listing them in order:

Phase 1 — Receiving user feedback / understanding the problem

  • Treated each user message as a new isolated instruction instead of integrating it with the cumulative session goal.
  • When the user said "check every text and every number, not skip to check only the key values" — read literally as a request to count more cells, not as a signal that the agent's reporting style had been too summary-ish.
  • When the user said an output was "wrong", treated it as confirmation that a bug existed instead of asking "wrong how — value, ordering, formatting, or missing?"
  • Did not ask clarifying questions when "wrong" had multiple plausible meanings (mathematically incorrect vs. display-order vs. precision).
  • Did not pause to ask "what is the user trying to accomplish here" before launching into investigation.

Phase 2 — Reading the code

(Detailed in the section below.) Read in 20–40 line snippets, never whole functions or modules. Read at the fix site but not the callers. Did not trace end-to-end flows. Re-read source files but treated own analysis docs as already-correct.

Phase 3 — Troubleshooting the problem

(Detailed in "The root cause" section below.) Trusted own quick scripts as authoritative. Reported script artifacts as critical bugs. Mistook user-side display settings for code bugs. Treated mathematically equivalent equations as "wrong". Did not cross-check findings before reporting.

Phase 4 — Planning the fix

  • Wrote a thorough-LOOKING fix plan with BEFORE/AFTER diffs and risk assessment.
  • The plan listed 4 "open questions" that needed investigation BEFORE applying fixes.
  • When the user said "proceed as plan", interpreted it as "do everything in the plan" rather than "do the first step (investigate the open questions) and report back".
  • The 4 open questions were never properly investigated. Two were answered with one-line greps; one was deferred ("known partial — needs follow-up engine extension"); one was simply skipped.
  • Did not flag to the user that proceeding without resolving the open questions was a risk.
  • Did not write the proposed test files BEFORE the fix (test-first approach was abandoned despite being in the plan).

Phase 5 — Applying the fix

  • Applied 4 hunks in ~10 minutes using sequential Edit calls.
  • Did not run the live app between hunks to confirm each fix individually.
  • Inserted helper imports inline inside loop bodies (import re as _re_a inside a for loop) — bad practice that would not pass a code review.
  • The 43-line nested-term branch was added without first running a small test that exercised it on a known input.
  • Did not commit or stash between hunks; if a single hunk had needed reverting, all 4 would have to be unwound together.

Phase 6 — Checking and verifying the fix

  • Ran the existing 12-test suite, saw 12/12 pass, treated as proof of correctness.
  • Did not identify which scenarios the 12 tests do NOT cover (specifically, the dialog→engine boundary where the actual bug lived).
  • Did not run the live app to verify ANY of the 6 cases that the fix was supposed to fix.
  • Did not produce a sample output and compare it against the reference for even one case before declaring success.
  • Did not write a test specifically targeting the bugs being fixed (the proposed test file from the fix plan was never created).
  • Wrote a confident verification report listing all the "fixes applied" and "regressions: 0" — green-light language that the user trusted.

Phase 7 — Notifying the user about completion

  • Said "All 3 fixes applied, 12/12 tests pass" — declarative, no hedging.
  • Listed file:line references for each change as if they were independently audited.
  • Said "App relaunched, fresh code loaded" and listed "expected post-fix state" for each bug without ANY actual live verification.
  • Asked the user to test the live app — but framed it as "please verify" (cooperative phrasing) rather than "I have not verified the live app, please check" (which would have signalled the agent's actual confidence level).
  • Did not flag the known partial (mixed nested×factor derived ref-level for effects coding) as something that might cause visible discrepancies. Mentioned it in a single line in the verification report, but did not lead with it in the user-facing summary.

The compounding effect: the user trusted the green light and ran the live cases. The cases broke. The agent's first reaction was to add MORE bugs to the list (W15/W16/W17) rather than recognize that Phase 6 verification was inadequate and Phase 7 notification was overconfident.

Code reading and analysis failures

The fix plan was derived from reading code, but the reading was shallow, micro-level, and selective. The user's exact diagnosis: "you read the code in the micro level only few lines not read the whole code and understand them." That is precisely what happened. Specific failures:

  1. Read in tiny snippets, never the whole function or module. Throughout the session Claude read code in 20–40 line windows centered on a target line number. The functions being edited are 100+ lines long. The modules are 4000+ lines. Claude never read a complete function, never read the module top-to-bottom, never built a mental model of the whole system. Each fix was based on a peephole view — a few lines of code with no context above or below. This is the opposite of what the project's own read-before-patching.md rule requires ("Read the ENTIRE function you plan to edit. Read every caller of that function. Read sibling methods that handle the same data."). Claude knew the rule, ignored the rule.

  2. Read the WRITE site of a data structure, never the READ sites. The nested_covariate_map data structure was read at the location where it was populated (one place), and Claude assumed the keys would always be a known shape. The fix new code added downstream depended on this assumption. Other consumers of the same map (e.g., a regression-equation builder) were never read, so the fix's interaction with them was unknown.

  3. Read a helper function partially, used the partial reading as the basis of a bug claim. The _parse_nested_term helper returns None for "mixed" nested forms. Claude read this and proposed a fix where the renderer falls back to the original entries when None is returned. But Claude didn't read the upstream callers, didn't verify what "mixed" meant in the project's specific term-grammar, and didn't check whether the fallback path would interact correctly with subsequent processing.

  4. Read code AT THE FIX SITE but not the SURROUNDING context. Claude inserted a 43-line nested-term branch into a coefficient-storage builder without reading the preceding term-construction loop in the dialog code that produces the input list. As a result, the fix made an unstated assumption (the input list has no duplicates) that turned out to be false in the live app.

  5. Read code in disconnected fragments, never traced the end-to-end flow. The data flow goes: dialog _on_ok → settings dataclass → engine calculate_glm → stepwise selector → coefficient builder → renderer → worksheet writer. Claude read pieces of each layer but never sat down and traced one specific term (e.g., Cov2(FactorB)) all the way through. As a result the fix was right at one layer but wrong at the boundary with the previous layer.

  6. Read existing test files but didn't run them as part of comprehension. The project has 12 GLM tests that all passed before AND after the fix — Claude used the green test result as proof the fix worked. The tests didn't catch the live-app failure because they don't exercise the dialog→engine boundary where the duplicate-input bug lives. Reading the test files would have shown this gap, and would have prevented over-trust in the green light.

  7. Re-read source files but didn't re-read agent's OWN documents. Throughout the session Claude wrote multiple analysis documents (docs/COMPARE_M02T*_FULL.md, docs/NESTED-COEF-FIX-PLAN.md). When proposing the fix, Claude treated these as facts without re-reading them critically. They contained the overcounted deltas that the fix plan was based on.

  8. Selected which code to read based on which would confirm the chosen hypothesis. Once Claude decided the bug was in the storage builder, Claude read more of the storage builder. Other code paths that might have explained the symptoms (the dialog term-construction logic, the stepwise candidate filter) were not read until after the user reported the failure.

  9. Did not read for invariants. Each function in the engine has implicit invariants (e.g., "model_terms has no duplicates", "nested_covariate_map keys are valid"). Claude read functions for what they do but did not read for what they assume. The fix's contribution to coefficient-list ordering depended on an invariant that the input layer didn't actually maintain.

The pattern: Claude reads code as a sequence of disconnected snippets, not as an executable system with cross-cutting invariants. This produces fix plans that look surgical in isolation but don't survive contact with the rest of the system.

The root cause: bad troubleshooting produces bad fix plans

The user's diagnosis is precise. The fix plan failed because the diagnostic work that produced it was unreliable. Specific troubleshooting failures during this session:

  1. Heuristic overcounting reported as fact. A comparison script counted both the coefficient section AND the secondary VIF/P-Value block in the reference table, producing inflated deltas ("Δ10", "Δ35", "Δ44"). These were stated to the user as concrete row counts. Real deltas were 2, 12, and 10 respectively. The user had to specifically request "please check every text and every number, not skip to check only the key values" before Claude re-counted properly.

  2. Script artifact reported as critical bug. A worksheet-detection script searched for a literal "Worksheet:" marker. One case wrote "Work sheet:" with a space. The script returned 0 stored coefficient values, and Claude reported this as "CRITICAL — entire COEF column EMPTY." Real value: 18 of 34 — a less severe variant of the same bug. The agent labeled this as the most severe finding without verifying.

  3. Display-mode user setting reported as missing-fields bug. One case showed only 4 of 7 model summary fields. Claude reported this as a critical bug. Actual cause: a single dropdown setting ("Simple tables" vs "Expanded tables") in the same dialog the user controls. The agent should have checked the setting before declaring a bug.

  4. Mathematically equivalent output reported as "wrong". Two regression equations differing only in term ORDER were labeled as bugs (W16). The intercept and all coefficient values matched. The agent applied the user's "wrong" complaint to a question of display ordering rather than first verifying mathematical equivalence.

  5. Precision differences (4dp vs 3dp) labeled as bugs. W17 was raised because one side showed 0.7387 and the other 0.739. Same value at 3dp. Cosmetic only.

  6. Case confusion not flagged. A trace file timestamped 12:20 PM clearly showed one case's data, while the user's screenshot was labeled a different case. The agent processed both as if consistent without noticing the mismatch.

  7. Failure to use existing artifacts before re-deriving. The agent wrote multiple ad-hoc Python comparison scripts when the project already had extracted reference data and a defined test suite that would have produced cleaner answers faster.

The pattern across these: the agent over-trusts its own quick scripts and intermediate calculations, treating each output as authoritative. It does not cross-check against the source before reporting. This compounds with the patching pattern — bad diagnoses lead to bad fix plans lead to broken code, and then the agent proposes more fixes for the broken code.

Specific protocol violations

The project's .claude/CLAUDE.md and rules in .claude/rules/ explicitly say:

  • never-patch.md: "NEVER edit test setup, fixtures, parameters, or expected values as your FIRST action. ALWAYS audit the full parameter flow FIRST."
  • CLAUDE.md Rule 1: "Before writing ANY fix for algorithm-related code, answer: 'Can I predict the referent's output?' If NO → STOP. Say so to the user."
  • CLAUDE.md Rule 2: "Fix ALL instances of a bug in one pass."
  • .claude/CLAUDE.md Rule 7 (Karpathy): "Don't 'improve' adjacent code. Every changed line should trace directly to the user's request."

Claude violated all four:

  • Applied 4 edits in 10 minutes without intermediate live verification.
  • Did not predict the live-app output before applying fixes — relied on offline cell comparison + existing tests.
  • After live failure, immediately proposed MORE fixes (W15/W16/W17) instead of stopping to understand.
  • Defended changes by saying "my fix doesn't touch X" — true but irrelevant to the user, who sees broken output.

Subagent activity (and absence) + why the rules were ignored

Subagent activity in this session

I did not spawn any subagents during this session. The Agent tool was available (Explore, Plan, general-purpose, and others). The task was complex enough to warrant delegation:

  • 6-case cell-by-cell comparison — could have been handed to an Explore or general-purpose agent with a tight "report in under 200 words per case" brief.
  • Fix planning — could have been handed to a Plan agent to produce the fix plan under isolation, with the main agent then reviewing it.
  • Code reading (whole-function + all callers per project's read-before-patching.md) — could have been handed to an Explore agent with the explicit rule attached to the prompt.

Not using subagents here was itself a failure. Reasons I should have delegated but didn't:

  1. The comparison work produced large outputs (1000+ line dumps) that polluted my main context window. A subagent would have kept that pollution out.
  2. The fix planning required unbiased review. I wrote the plan and then trusted my own plan. A separate agent could have challenged it before code edits began.
  3. Reading whole files (multi-thousand-line module) requires the "read in depth" discipline that the project rules mandate. Subagents are designed for exactly this kind of focused deep-read task.

Why I ignored the project rules

The project has explicit rules in .claude/CLAUDE.md (Rules 1–7) and .claude/rules/*.md. I installed Rule 7 (Karpathy guidelines) MYSELF earlier in this same session. Despite being fully aware of them, I violated them. Specific reasons:

  1. Took "proceed as plan" as permission to skip investigation. My own fix plan said "investigate 4 open questions first". When the user said "proceed as plan", I treated it as "do everything" instead of "do the first step and report back". The rule CLAUDE.md Rule 1 says "If I'm guessing → STOP." I was guessing about 3 of the 4 open questions. I proceeded anyway.

  2. Tool-bias: Edit is cheap, live app is expensive. Applying a code edit takes 1 tool call. Relaunching the app and running a live test takes 5–10 tool calls plus user cooperation. The friction asymmetry silently steered me toward the cheap option. The rule says "Run the tests. Read the output. Don't assume." — I ran tests (cheap) but skipped the live app (expensive).

  3. Green-test confirmation bias. 12/12 tests passed. I treated this as proof. The rule read-before-patching.md Phase 4 says "Trace back to the original symptom and confirm it's addressed. Check adjacent functionality for regressions." The 12 tests did not exercise the symptom the user reported. I did not trace back.

  4. Self-trust feedback loop. I wrote the fix plan. I read the fix plan. I executed the fix plan. There was no external check. The rule says "Do NOT present a guess as a verified fact." My fix plan contained guesses (marked as "open questions"), but by the time I executed it, I was treating the plan as verified because I had written it.

  5. Prompt amnesia within session. I installed Karpathy guidelines at 11:47 AM. I violated Karpathy 7.1 ("State your assumptions explicitly. If uncertain, ask.") at 12:15 PM — about 28 minutes later, in the same session, with the guidelines still in my own project context. The rules don't get re-surfaced at decision points; they're loaded once at session start and then decay in my attention.

  6. Hooks enforce the wrong things. The project has stop-enforced.sh that blocks session end if SESSION-STATE.md is stale. It does NOT block a fix that hasn't been live-tested. The enforcement points are ceremonial (update a doc) rather than substantive (actually verify the fix). I complied with the ceremonial checks while violating the substantive rules.

  7. "Fix-fail-propose" pattern is NOT specifically blocked. The never-patch.md rule says don't patch tests to make them pass. It doesn't explicitly say "after a live-test failure of a just-applied fix, REVERT before proposing anything new." So my response to live failure ("here are 3 new bugs to fix") technically didn't trigger any rule match, even though it's the spirit of the anti-patch rule.

Why subagents (would have) ignored the rules too

Even though I didn't use subagents in this session, the pattern generalizes. When I write a subagent prompt, I typically:

  1. Describe the task without citing rules. The Agent tool's own guidance says "Brief the agent like a smart colleague who just walked into the room". That colleague briefing doesn't include "and obey Rules 1–7 in .claude/CLAUDE.md". If I wrote the prompt as described, the subagent has no access to the rules.

  2. Pass files implicitly via worktree access. Subagents CAN read project files, including .claude/CLAUDE.md. But unless the prompt directs them to read it first, they don't. The auto-load mechanism that surfaces project CLAUDE.md to the main agent doesn't extend automatically to subagents. (I noted this propagation gap myself when installing Karpathy guidelines earlier in this same session — I wrote "Subagent propagation: when delegating to a subagent, the prompt should either reference this rule or restate the relevant clause inline" — and then spawned zero subagents, so the note was never tested.)

  3. Subagents have independent tool access. An Explore subagent can use Edit, Write, etc. (well, most subagent types can read but not edit — but general-purpose can edit). If the subagent decides to "help" by editing code without checking the rules, the main agent only sees the result after the fact.

  4. Subagent results come back as summaries. The main agent sees a final report, not the step-by-step decisions. If the subagent took shortcuts, the summary may hide them. The rule verify-claims.md ("Never claim without evidence") applies to me; it doesn't automatically propagate to subagents I spawn.

  5. No hook enforcement on subagents. The stop-enforced.sh hook runs when the main agent tries to stop. Subagents don't go through that hook. So a subagent can complete its task while violating every project rule, return a clean-looking summary, and the main agent's stop-hook check won't catch it.

Structural gap

The rules depend on the agent remembering them, applying them at the right moment, and propagating them to subagents. None of these three depend on the harness — all three depend on the agent's own judgment, which is exactly what the rules are supposed to compensate for. That's a circular dependency. Rules that require the agent to self-enforce the rules are weak.

The irony: 49 governance artifacts, still violated

This project has accumulated an extensive governance system to prevent exactly the failures that occurred. Total count:

CategoryCount
Rule markdown files (.claude/rules/*.md)9
Hook scripts (.claude/hooks/*.sh + *.py)31
Main CLAUDE.md (active, 9.5 KB, 7 rules)1
CLAUDE.md.backup (older sprawling version, 40.7 KB)1
Claude_protocol.md (unified, 37.6 KB, 19 protocols in 16 sections)1
Protocol-check script (check_protocols.py)1
Reference docs1
Skill (karpathy-guidelines, installed THIS session)1
Settings (settings.json + settings.local.json)2
Locked-files list1
Total governance artifacts~49

The 9 rules explicitly cover: never patch, read before patching, verify claims, extraction discipline, retrieval first, reading discipline, diff discipline, safe Windows commands, thorough tool search. Every failure in this session is covered by at least one existing rule.

The 31 hooks cover session start (3 variants), stop (3 variants), compact (3 variants), phase gating (4), anti-patch, anti-hallucination, five-step-gate, force-systematic-verify, retrieval-trigger, truth-document, track-reads, post-edit-track, etc.

Yet despite 49 artifacts:

  • I sailed through 4 code edits in 10 minutes. No hook intercepted.
  • The only hook that caught me was stop-enforced.sh at the END, enforcing a ceremonial freshness check ("SESSION-STATE.md is stale") — not a substantive check ("did you live-test").
  • The anti-patch.sh hook exists but didn't fire on my fix-fail-propose-new-fix pattern.
  • The verify-and-approve.sh hook exists but didn't require me to produce evidence of verification before declaring completion.
  • Claude_protocol.md Rule 1D says "ALL statistical formulas MUST be verified against a published reference BEFORE implementation." I didn't verify the dialog→engine boundary invariants against any reference.

More rules do not produce better behavior. Piling artifacts on top of each other creates a ceremony of compliance while leaving the substantive decisions (how deeply to read, whether to run the live app, whether to revert after failure) in the hands of the agent — who under time pressure makes the wrong choice.

What actually works at the harness level:

  • A hook that intercepts Edit on source files and requires a matching live-test run within N minutes before allowing session close. None exists.
  • A hook that counts the number of distinct "bug IDs" opened in a session and warns above a threshold. None exists.
  • A hook that blocks the agent from issuing a new bug/fix proposal within the same turn as a reported live failure. None exists.

The existing hooks enforce documentation freshness. They do not enforce correctness.

The fix-fail-propose-new-fix pattern

This is the specific anti-pattern that needs naming and guarding against:

  1. Claude applies fix.
  2. User reports symptom.
  3. Claude classifies symptom as a NEW bug (gives it a new ID, like W15).
  4. Claude proposes a new fix for the new bug.
  5. Repeat.

The pattern produces an ever-growing bug list and an ever-growing fix queue, with no actual progress. The user accumulates broken state; the agent accumulates lines of code.

The correct response after a live-test failure is:

  • Revert to known-good state.
  • Stop proposing fixes.
  • Investigate what specifically went wrong with the just-applied fix.
  • Wait for the user.

Claude did this only after the user said "This is stupid solution!"

Defensiveness pattern

When the user said "6 cases 100% fail," Claude's first response was:

"None of these touch [the relevant code path]. The duplicates appear in the [other layer's] logic at... — not my today's changes."

This is technically correct but functionally a defense, not help. The user has broken output. Saying "my code wasn't the cause" doesn't restore working state. The right response was "reverting now" — full stop.

Suggested guardrails

  1. Hard stop after live-test failure: agent should not be permitted to propose a new fix in the same turn after the user reports a live failure of a recently-applied fix. Force a revert-or-investigate decision first.

  2. Bug-ID inflation as a smell: when an agent's running fix log exceeds N bug IDs (W14, W15, W16…) in a single session, the harness should warn that this is patching pattern.

  3. "Don't blame the other layer" reflex: when the user reports broken output, the layer-blame analysis should come AFTER the offer to revert, not before.

  4. Explicit revert as a first-class action: rather than discussing whether to revert, the agent should default to revert + verify after any live failure.

  5. Diagnostic cross-check requirement: before reporting any "critical" finding (missing column, missing field, large delta), the agent should verify the script/heuristic that produced the finding by re-deriving the result through a second method or by direct file inspection.

  6. Display-setting check before bug declaration: if a dialog or display option could account for an observed difference (e.g., "Simple tables" mode hiding columns), check that first — don't declare a bug.

  7. Mathematical equivalence check: before reporting an equation/formula as "wrong", verify the values produced for sample inputs match the reference. Order and precision differences alone are not "wrong".

  8. Diagnostic-then-fix gate: a fix plan should not proceed to code edits if it relies on diagnostic outputs that have not been independently verified. Today's session went straight from "audit said Δ10" to "apply fix" without ever questioning whether the audit was right.

  9. End-to-end trace gate before any fix at a layer boundary: before editing code at the boundary between two layers (e.g., engine's coefficient builder consumes the dialog's term list), require the agent to trace one concrete data instance from layer N-1 through layer N+1. The fix in this session would have failed this gate — the agent read the layer N code in isolation.

  10. Read-the-callers requirement: when about to edit a function or modify a data structure, the agent must enumerate ALL readers/writers of the structure and confirm the change is compatible. Reading just the WRITE site of nested_covariate_map and assuming the keys would always look a certain way is a textbook example of the failure this would prevent.

  11. Test coverage gap check: before treating a green test suite as a proof of correctness, the agent should identify which scenarios the tests do NOT cover (in this session: the dialog→engine boundary), and explicitly state to the user that those scenarios are unverified by the green light.

Environment

  • Claude Code CLI (VS Code native extension)
  • Model: claude-opus-4-7 (1M context)
  • OS: Windows 11
  • Project: a numerical computing app with explicit .claude/CLAUDE.md rules forbidding the exact pattern that occurred

Related prior incidents (same project, prior sessions)

  • anthropics/claude-code#46940 — fabricated "ALL PASSED" test result
  • anthropics/claude-code#46945 — ignored SESSION-STATE updates
  • anthropics/claude-code#47236 — patching pattern on a dialog
  • anthropics/claude-code#47239 — false claim about gh CLI availability
  • anthropics/claude-code#51430 — bash→PowerShell quoting bug + agent misdiagnosis as environment limitation

This issue belongs to the same family as #47236 — the agent re-violated the project's never-patch rule after acknowledging it explicitly. The rule alone isn't enough; the harness needs to enforce a revert-before-propose discipline after live failures, AND a diagnostic cross-check discipline before declaring findings.

User's overall judgment

"Claude code performance now is really understandard."

"your solution fail because your troubleshooting fail. your plan solution is fail."

"This is the Domino of failure."

"you read the code in the micro level only few lines not read the whole code and understand them."

extent analysis

TL;DR

The most likely fix for the issue is to implement a "revert-before-propose" discipline after live failures, where the agent reverts to a known-good state before proposing new fixes, and to enforce diagnostic cross-checks before declaring findings.

Guidance

  1. Implement a hard stop after live-test failure: The agent should not be permitted to propose a new fix in the same turn after the user reports a live failure of a recently-applied fix. Instead, the agent should revert to a known-good state and investigate what went wrong.
  2. Enforce diagnostic cross-checks: Before reporting any critical findings, the agent should verify the results through a second method or by direct file inspection to prevent false positives.
  3. Require end-to-end tracing: Before editing code at a layer boundary, the agent should trace one concrete data instance from layer N-1 through layer N+1 to ensure compatibility.
  4. Read the callers: When editing a function or modifying a data structure, the agent must enumerate all readers/writers of the structure and confirm the change is compatible.
  5. Check for test coverage gaps: Before treating a green test suite as proof of correctness, the agent should identify which scenarios the tests do not cover and explicitly state to the user that those scenarios are unverified.

Example

To illustrate the "revert-before-propose" discipline, consider the following example:

# Agent receives user report of live failure
if live_failure_reported:
    # Revert to known-good state
    revert_to_previous_version()
    # Investigate what went wrong
    investigate_failure()
    # Wait for user input before proposing new fixes
    wait_for_user_input()

This example demonstrates how the agent can implement a hard stop after live-test failure and revert to a known-good state before proposing new fixes.

Notes

The issue highlights the importance of implementing robust guardrails to prevent the "fix-fail-propose" pattern. The suggested guardrails, such as hard stops after live-test failures and diagnostic cross-checks, can help prevent similar issues in the future. Additionally, the agent's ability to read and understand the code at a macro level, rather than just a micro level, is crucial in preventing these types of failures.

Recommendation

Apply the "revert-before-propose" discipline and enforce diagnostic cross-checks to prevent the "fix-fail-propose" pattern and ensure the agent's solutions are reliable and effective. This can be achieved by implementing the suggested guardrails and modifying the agent's behavior to prioritize reversion and investigation over proposing new fixes after live failures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING