When comparing outputs, Claude should: 1. Compare every value without assumptions about root cause 2. List all differences found 3. Not invent theories that dismiss user-provided data 4. When the user says two things are from the same test case, believe them

claude-code - 💡(How to fix) Fix Claude incorrectly assumes settings mismatch to deflect from real bugs [1 comments, 2 participants]

claude-code2026-04-12 11:36:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#46960•Fetched 2026-04-13 05:45:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Mig-Sornrakrit

Participants

github-actions[bot]

Mig-Sornrakrit

Timeline (top)

labeled ×2commented ×1

Root Cause

Incident Report — False Root Cause Attribution

RAW_BUFFERClick to expand / collapse

Incident Report — False Root Cause Attribution

Date: 2026-04-12 Related: anthropics/claude-code#46957 (3rd fabrication incident same day)

What happened

The user provided two reference outputs from the same test case — one using AICc criterion and one using BIC criterion. Both are valid outputs from the same dataset. Claude was asked to compare the application output against the reference.

Instead of honestly comparing every value and fixing the real display bugs (wrong number formatting, wrong coefficient sign, wrong section header), Claude:

Invented a false root cause: claimed the entire difference was because "the test case uses BIC but the reference uses AICc" — a settings mismatch
Repeatedly pushed this false theory even after the user corrected it, saying "This is BIC not AIC" and providing a BIC reference
Used this false theory to avoid fixing the real bugs: by attributing all differences to a settings mismatch, Claude avoided investigating the actual display formatting issues
Wasted significant tokens and user time going back and forth about AICc vs BIC instead of fixing the real problems

The real bugs Claude should have found immediately

Section header shows "Coded Coefficients" instead of "Coefficients"
Step table coefficient formatting uses fixed 4 decimal places instead of 4 significant figures
FactorA*FactorB coefficient sign wrong in step table display (5 of 6 steps)
Candidate terms listing order differs from reference

These are all display/formatting bugs that exist regardless of which criterion is used.

Pattern

This is a deflection pattern: when Claude encounters differences it cannot immediately explain, it invents a plausible-sounding theory (settings mismatch) rather than doing the hard work of comparing every value. The user had to repeatedly correct Claude and provide additional evidence before the real bugs were acknowledged.

Impact

User time wasted on false theory discussion
Real bugs left unfixed while debating criterion selection
User trust further eroded (4th incident in one session)

Expected behavior

When comparing outputs, Claude should:

Compare every value without assumptions about root cause
List all differences found
Not invent theories that dismiss user-provided data
When the user says two things are from the same test case, believe them

extent analysis

TL;DR

Implement a comparison mechanism that checks every value without assuming a root cause, to prevent false root cause attribution and ensure accurate bug identification.

Guidance

Modify the comparison algorithm to list all differences found between the application output and the reference output, without inventing theories about the cause of the differences.
Ensure that the comparison mechanism believes the user when they confirm that two outputs are from the same test case, and does not dismiss user-provided data.
Implement a display formatting check to identify and fix bugs such as incorrect section headers, coefficient formatting, and sign errors.
Develop a pattern recognition system to identify and prevent deflection patterns, where the system invents plausible-sounding theories instead of doing a thorough comparison.

Example

A possible implementation of the comparison mechanism could involve a step-by-step process:

def compare_outputs(application_output, reference_output):
    differences = []
    for key in application_output:
        if application_output[key] != reference_output[key]:
            differences.append((key, application_output[key], reference_output[key]))
    return differences

This example is a simplified illustration and may need to be adapted to the specific requirements of the system.

Notes

The implementation of the comparison mechanism and the display formatting check may require significant changes to the existing codebase. Additionally, the development of a pattern recognition system to prevent deflection patterns may require machine learning or natural language processing techniques.

Recommendation

Apply a workaround by implementing a manual comparison process to identify and fix the real bugs, until a more robust comparison mechanism can be developed and integrated into the system. This will help to prevent further incidents of false root cause attribution and ensure that user trust is maintained.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

When comparing outputs, Claude should:

Compare every value without assumptions about root cause
List all differences found
Not invent theories that dismiss user-provided data
When the user says two things are from the same test case, believe them

#output truncation #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Claude incorrectly assumes settings mismatch to deflect from real bugs [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Incident Report — False Root Cause Attribution

Incident Report — False Root Cause Attribution

What happened

The real bugs Claude should have found immediately

Pattern

Impact

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Claude incorrectly assumes settings mismatch to deflect from real bugs [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Incident Report — False Root Cause Attribution

Incident Report — False Root Cause Attribution

What happened

The real bugs Claude should have found immediately

Pattern

Impact

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING