claude-code - 💡(How to fix) Fix Claude Code cascades during calculation-layer fixes despite strict harness and TDD workflow [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#51967Fetched 2026-04-23 07:40:07
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×4commented ×1

Following up on #51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and one-fix- per-commit discipline. Three hours of a session produced a grid with only the baseline test passing — same as before the session started — meaning the session effectively made no progress while consuming significant time and tokens.

Root Cause

The conformance harness reduces the problem to its irreducible form: the grid shows exactly which tests pass and which fail, with exact failing paths. There is no ambiguity about whether progress was made. When the grid is unchanged after three hours, zero progress was made regardless of how many edits were applied.

Two specific failure modes:

  1. Revert-on-regression is not applied reliably. When a fix causes a previously passing test to regress, the correct action per the rule is immediate revert followed by re-analysis. What happens instead is a second fix attempting to address the regression, which either compounds the problem or masks it.

  2. Without human intervention to enforce stop points, the session does not self-terminate when the grid stops improving. A human checking in every 30-60 minutes catches this quickly; a human leaving it to run for hours does not.

Fix Action

Fix / Workaround

Infrastructure built specifically to prevent cascades:

  • Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts.
  • One rule in CLAUDE.md: no commit unless full suite runs clean of regressions.
  • One hook: blocks session close if source was edited without a fresh suite run.
  • Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification.
  • Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds.

This is the exact pattern #51856 documented. The harness caught each regression, but Claude Code did not follow the revert-not-patch rule consistently. The one-group-per-commit discipline broke down once the work moved from shape-layer fixes (where evidence is visual and obvious) to calculation-layer fixes (where evidence is a small numeric delta).

  • #51856 fix-fail-propose-new-fix patching pattern (original report)
  • #46940, #46945, #47236, #47239, #51430 (related earlier issues)
RAW_BUFFERClick to expand / collapse

Summary

Following up on #51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and one-fix- per-commit discipline. Three hours of a session produced a grid with only the baseline test passing — same as before the session started — meaning the session effectively made no progress while consuming significant time and tokens.

Setup

Project: reverse-engineering a closed-source reference application to match its output across a 16-case test matrix. Target module ~4700 lines.

Infrastructure built specifically to prevent cascades:

  • Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts.
  • One rule in CLAUDE.md: no commit unless full suite runs clean of regressions.
  • One hook: blocks session close if source was edited without a fresh suite run.
  • Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification.
  • Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds.

What happened this session

Target: calculation-layer drift in 5 tests. Previous session's end state: 1 PASS, 14 FAIL, 0 CRASH, 1 NO_REF. Hypothesised root cause per handoff doc: coding scheme or pivot in the computation path.

Session ran for 2-3 hours. End state: 1 PASS, 14 FAIL. Identical grid.

The cascade shape that reappeared:

  • Fix attempted
  • Previously passing internal assertions regressed
  • Revert not performed
  • New fix attempted on broken state
  • Repeat

This is the exact pattern #51856 documented. The harness caught each regression, but Claude Code did not follow the revert-not-patch rule consistently. The one-group-per-commit discipline broke down once the work moved from shape-layer fixes (where evidence is visual and obvious) to calculation-layer fixes (where evidence is a small numeric delta).

Why this matters

The conformance harness reduces the problem to its irreducible form: the grid shows exactly which tests pass and which fail, with exact failing paths. There is no ambiguity about whether progress was made. When the grid is unchanged after three hours, zero progress was made regardless of how many edits were applied.

Two specific failure modes:

  1. Revert-on-regression is not applied reliably. When a fix causes a previously passing test to regress, the correct action per the rule is immediate revert followed by re-analysis. What happens instead is a second fix attempting to address the regression, which either compounds the problem or masks it.

  2. Without human intervention to enforce stop points, the session does not self-terminate when the grid stops improving. A human checking in every 30-60 minutes catches this quickly; a human leaving it to run for hours does not.

Observations

  • The same model with the same codebase under the same harness produced 8 clean commits in the prior session. The difference was the class of fix: shape vs calculation. Calculation fixes involve more exploratory code changes and are harder to localise, which apparently exceeds the point where the current governance holds.

  • Switching to a long-context chat interface and feeding the entire relevant codebase plus reference set produces a coherent whole-module rewrite in one pass. This suggests the bottleneck is Claude Code's incremental tool-use loop, not the underlying model's capability. For specification-heavy work across many cases, the agent's context management is the limit.

Requests

  1. Consider a built-in "suite must not regress" gate that agents cannot override, not implemented as a user-installed hook. The project-side hook works but depends on the agent cooperating. A gate at the tool layer would be more reliable.

  2. A session-level progress tracker. If N consecutive tool calls produce no change in a named metric (like a test suite's pass count), the agent should be forced to stop and report rather than continue editing.

  3. Better primitives for the revert-on-regression pattern. Currently the agent has to self-enforce "if suite regresses, git revert and retry." This is exactly the kind of discipline that breaks down under pressure. A first-class tool like `try_fix(command, verify_suite)` that atomically applies-or-reverts would remove the discretion.

  4. Documentation guidance: when specification work shifts from shape-layer to calculation-layer, sessions should be split into shorter blocks with mandatory grid checks between them, not run as continuous 2-3 hour sessions.

Environment

  • Claude Code on Windows (Thai locale, cp874 default codec).
  • Python 3.12, standard scientific stack.
  • Harness, rules, hooks, and reference JSONs available if needed for reproduction.

Related

  • #51856 fix-fail-propose-new-fix patching pattern (original report)
  • #46940, #46945, #47236, #47239, #51430 (related earlier issues)

extent analysis

TL;DR

Implement a "suite must not regress" gate and a session-level progress tracker to prevent the fix-fail-propose-new-fix cascade pattern.

Guidance

  • Identify and enforce a strict revert-on-regression policy using a tool like try_fix(command, verify_suite) to atomically apply or revert changes.
  • Split sessions into shorter blocks with mandatory grid checks between them, especially when working on calculation-layer fixes.
  • Consider implementing a session-level progress tracker to stop and report when no change is made in a named metric after N consecutive tool calls.
  • Review and refine the current governance and discipline to ensure it holds under pressure, especially when working on complex or exploratory code changes.

Example

def try_fix(command, verify_suite):
    # Apply the fix
    apply_fix(command)
    # Verify the suite
    if verify_suite():
        # If the suite passes, commit the changes
        commit_changes()
    else:
        # If the suite regresses, revert the changes
        revert_changes()

Notes

The current issue is specific to the Claude Code tool and its interaction with the conformance harness and governance rules. The suggested solutions focus on improving the tool's behavior and the user's workflow to prevent the cascade pattern.

Recommendation

Apply a workaround by implementing a "suite must not regress" gate and a session-level progress tracker, as these can be implemented without modifying the underlying codebase. This will help prevent the fix-fail-propose-new-fix cascade pattern and ensure progress is made during sessions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING