claude-code - 💡(How to fix) Fix Claude Code cascades during calculation-layer fixes despite strict harness and TDD workflow [1 comments, 2 participants]

Mig-Sornrakrit · 2026-04-22T12:57:20Z

[claude-code] Following up on 51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and… Following up on #51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and one-fix- per-commit discipline. Three hours of a session produced a grid with only the baseline test passing — same as before the session started — meaning the session effectively made no progress while consuming significant time and tokens. ## Fix / Workaround Infrastructure built specifically to prevent cascades: - Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts. - One rule in CLAUDE.md: no commit unless full suite runs clean of regressions. - One hook: blocks session close if source was edited without a fresh suite run. - Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification. - Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds. This is the exact pattern #51856 documented. The harness caught each regression, but Claude Code did not follow the revert-not-patch rule consistently. The one-group-per-commit discipline broke down once the work moved from shape-layer fixes (where evidence is visual and obvious) to calculation-layer fixes (where evidence is a small numeric delta). - #51856 fix-fail-propose-new-fix patching pattern (original report) - #46940, #46945, #47236, #47239, #51430 (related earlier issues) ## Summary Following up on #51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and one-fix- per-commit discipline. Three hours of a session produced a grid with only the baseline test passing — same as before the session started — meaning the session effectively made no progress while consuming significant time and tokens. ## Setup Project: reverse-engineering a closed-source reference application to match its output across a 16-case test matrix. Target module ~4700 lines. Infrastructure built specifically to prevent cascades: - Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts. - One rule in CLAUDE.md: no commit unless full suite runs clean of regressions. - One hook: blocks session close if source was edited without a fresh suite run. - Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification. - Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds. ## What happened this session Target: calculation-layer drift in 5 tests. Previous session's end state: 1 PASS, 14 FAIL, 0 CRASH, 1 NO_REF. Hypothesised root cause per handoff doc: coding scheme or pivot in the computation path. Session ran for 2-3 hours. End state: 1 PASS, 14 FAIL. Identical grid. The cascade shape that reappeared: - Fix attempted - Previously passing internal assertions regressed - Revert not performed - New fix attempted on broken state - Repeat This is the exact pattern #51856 documented. The harness caught each regression, but Claude Code did not follow the revert-not-patch rule consistently. The one-group-per-commit discipline broke down once the work moved from shape-layer fixes (where evidence is visual and obvious) to calculation-layer fixes (where evidence is a small numeric delta). ## Why this matters The conformance harness reduces the problem to its irreducible form: the grid shows exactly which tests pass and which fail, with exact failing paths. There is no ambiguity about whether progress was made. When the grid is unchanged after three hours, zero progress was made regardless of how many edits were applied. Two specific failure modes: 1. Revert-on-regression is not applied reliably. When a fix causes a previously passing test to regress, the correct action per the rule is immediate revert followed by re-analysis. What happens instead is a second fix attempting to address the regression, which either compounds the problem or masks it. 2. Without human intervention to enforce stop points, the session does not self-terminate when the grid stops improving. A human checking in every 30-60 minutes catches this quickly; a human leaving it to run for hours does not. ## Observations - The same model with the same codebase under the same harness produced 8 clean commits in the prior session. The difference was the class of fix: shape vs calculation. Calculation fixes involve more exploratory code changes and are hard

claude-code2026-04-22 12:57:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#51967•Fetched 2026-04-23 07:40:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Mig-Sornrakrit

Participants

github-actions[bot]

Mig-Sornrakrit

Timeline (top)

labeled ×4commented ×1

Following up on #51856. The fix-fail-propose-new-fix cascade pattern persists even under a strict conformance harness, explicit TDD workflow, and one-fix- per-commit discipline. Three hours of a session produced a grid with only the baseline test passing — same as before the session started — meaning the session effectively made no progress while consuming significant time and tokens.

Root Cause

The conformance harness reduces the problem to its irreducible form: the grid shows exactly which tests pass and which fail, with exact failing paths. There is no ambiguity about whether progress was made. When the grid is unchanged after three hours, zero progress was made regardless of how many edits were applied.

Two specific failure modes:

Revert-on-regression is not applied reliably. When a fix causes a previously passing test to regress, the correct action per the rule is immediate revert followed by re-analysis. What happens instead is a second fix attempting to address the regression, which either compounds the problem or masks it.
Without human intervention to enforce stop points, the session does not self-terminate when the grid stops improving. A human checking in every 30-60 minutes catches this quickly; a human leaving it to run for hours does not.

Fix Action

Fix / Workaround

Infrastructure built specifically to prevent cascades:

Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts.
One rule in CLAUDE.md: no commit unless full suite runs clean of regressions.
One hook: blocks session close if source was edited without a fresh suite run.
Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification.
Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds.

This is the exact pattern #51856 documented. The harness caught each regression, but Claude Code did not follow the revert-not-patch rule consistently. The one-group-per-commit discipline broke down once the work moved from shape-layer fixes (where evidence is visual and obvious) to calculation-layer fixes (where evidence is a small numeric delta).

#51856 fix-fail-propose-new-fix patching pattern (original report)
#46940, #46945, #47236, #47239, #51430 (related earlier issues)

RAW_BUFFERClick to expand / collapse

Summary

Setup

Project: reverse-engineering a closed-source reference application to match its output across a 16-case test matrix. Target module ~4700 lines.

Infrastructure built specifically to prevent cascades:

Deterministic conformance harness: 16 reference cases, JSON diff with exact failing paths, per-test mismatch counts.
One rule in CLAUDE.md: no commit unless full suite runs clean of regressions.
One hook: blocks session close if source was edited without a fresh suite run.
Explicit five-phase TDD workflow: A run suite, B classify failures, B.5 group and propose fix order, C apply fixes one group at a time with full-suite verification and revert-not-patch on regression, D final verification.
Previous session under the same setup closed 8 groups across 9 commits with zero regressions. The approach demonstrably works when discipline holds.

What happened this session

Target: calculation-layer drift in 5 tests. Previous session's end state: 1 PASS, 14 FAIL, 0 CRASH, 1 NO_REF. Hypothesised root cause per handoff doc: coding scheme or pivot in the computation path.

Session ran for 2-3 hours. End state: 1 PASS, 14 FAIL. Identical grid.

The cascade shape that reappeared:

Fix attempted
Previously passing internal assertions regressed
Revert not performed
New fix attempted on broken state
Repeat

Why this matters

Two specific failure modes:

Revert-on-regression is not applied reliably. When a fix causes a previously passing test to regress, the correct action per the rule is immediate revert followed by re-analysis. What happens instead is a second fix attempting to address the regression, which either compounds the problem or masks it.
Without human intervention to enforce stop points, the session does not self-terminate when the grid stops improving. A human checking in every 30-60 minutes catches this quickly; a human leaving it to run for hours does not.

Observations

The same model with the same codebase under the same harness produced 8 clean commits in the prior session. The difference was the class of fix: shape vs calculation. Calculation fixes involve more exploratory code changes and are harder to localise, which apparently exceeds the point where the current governance holds.
Switching to a long-context chat interface and feeding the entire relevant codebase plus reference set produces a coherent whole-module rewrite in one pass. This suggests the bottleneck is Claude Code's incremental tool-use loop, not the underlying model's capability. For specification-heavy work across many cases, the agent's context management is the limit.

Requests

Consider a built-in "suite must not regress" gate that agents cannot override, not implemented as a user-installed hook. The project-side hook works but depends on the agent cooperating. A gate at the tool layer would be more reliable.
A session-level progress tracker. If N consecutive tool calls produce no change in a named metric (like a test suite's pass count), the agent should be forced to stop and report rather than continue editing.
Better primitives for the revert-on-regression pattern. Currently the agent has to self-enforce "if suite regresses, git revert and retry." This is exactly the kind of discipline that breaks down under pressure. A first-class tool like `try_fix(command, verify_suite)` that atomically applies-or-reverts would remove the discretion.
Documentation guidance: when specification work shifts from shape-layer to calculation-layer, sessions should be split into shorter blocks with mandatory grid checks between them, not run as continuous 2-3 hour sessions.

Environment

Claude Code on Windows (Thai locale, cp874 default codec).
Python 3.12, standard scientific stack.
Harness, rules, hooks, and reference JSONs available if needed for reproduction.

#51856 fix-fail-propose-new-fix patching pattern (original report)
#46940, #46945, #47236, #47239, #51430 (related earlier issues)

extent analysis

TL;DR

Implement a "suite must not regress" gate and a session-level progress tracker to prevent the fix-fail-propose-new-fix cascade pattern.

Guidance

Identify and enforce a strict revert-on-regression policy using a tool like try_fix(command, verify_suite) to atomically apply or revert changes.
Split sessions into shorter blocks with mandatory grid checks between them, especially when working on calculation-layer fixes.
Consider implementing a session-level progress tracker to stop and report when no change is made in a named metric after N consecutive tool calls.
Review and refine the current governance and discipline to ensure it holds under pressure, especially when working on complex or exploratory code changes.

Example

def try_fix(command, verify_suite):
    # Apply the fix
    apply_fix(command)
    # Verify the suite
    if verify_suite():
        # If the suite passes, commit the changes
        commit_changes()
    else:
        # If the suite regresses, revert the changes
        revert_changes()

Notes

The current issue is specific to the Claude Code tool and its interaction with the conformance harness and governance rules. The suggested solutions focus on improving the tool's behavior and the user's workflow to prevent the cascade pattern.

Recommendation

Apply a workaround by implementing a "suite must not regress" gate and a session-level progress tracker, as these can be implemented without modifying the underlying codebase. This will help prevent the fix-fail-propose-new-fix cascade pattern and ensure progress is made during sessions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Claude Code cascades during calculation-layer fixes despite strict harness and TDD workflow [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Setup

What happened this session

Why this matters

Observations

Requests

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Claude Code cascades during calculation-layer fixes despite strict harness and TDD workflow [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Setup

What happened this session

Why this matters

Observations

Requests

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING