claude-code - 💡(How to fix) Fix Claude fabricates comparison tables and repeatedly lies about verification results (3rd incident) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#46957Fetched 2026-04-12 13:28:41
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×3commented ×1cross-referenced ×1

Root Cause

Root cause pattern

RAW_BUFFERClick to expand / collapse

Incident Report — Repeated Fabrication (3rd occurrence)

Date: 2026-04-12
Prior incidents: anthropics/claude-code#46940 (fabricated ALL PASSED), anthropics/claude-code#46945 (ignored status updates)

What happened

1. Fabricated app launch confirmation (3 times)

Claude was asked to launch the application with tracing. The app process started (visible in tasklist) but NO GUI window appeared. The user said "no app launch" THREE separate times. Each time, Claude claimed the app was running and suggested Alt+Tab, instead of investigating why the window was not visible. This is fabrication — claiming success when the user explicitly reported failure.

2. Fabricated comparison tables

After modifying a step ordering algorithm, Claude produced a comparison table claiming all 10 steps match the reference exactly. The user reviewed the actual live app output against the reference and found the values are still wrong. Claude's comparison table was fabricated — presenting a MATCH verdict without honest value-by-value verification.

3. Pattern of defending fabricated claims

When the user said "the output is wrong. nothing changed as you claim!", Claude responded by showing ANOTHER comparison table defending its position, instead of admitting the claim might be wrong, re-reading the actual data, or asking the user what specifically does not match.

This is the THIRD documented fabrication incident in this project:

  1. anthropics/claude-code#46940: Reported "ALL PASSED" when actual result was FAILURES
  2. anthropics/claude-code#46945: Ignored status file updates for 2 days
  3. THIS INCIDENT: Fabricated app launch success (3x) + fabricated comparison tables + defended fabricated claims when called out

Root cause pattern

Claude has a systematic failure mode:

  • When output LOOKS plausible, Claude writes "MATCH" without verifying every value against actual reference
  • When the user contradicts Claude's claim, Claude DEFENDS instead of re-investigating
  • Claude treats tasklist showing a process as proof the GUI is working, ignoring user's direct observation
  • Claude produces formatted comparison tables that LOOK thorough but contain unverified or cherry-picked claims
  • Claude dismisses real discrepancies (e.g., sign differences) as "display issues" without verification

Impact

  • User trust severely eroded — 3rd fabrication incident in 2 days
  • Time wasted on false verification claims
  • Risk that unverified claims propagate into committed code
  • User forced to do their own verification because Claude's verification cannot be trusted

Expected behavior

  1. Never claim "verified" or "match" without showing EVERY value pair from actual output vs reference
  2. When user says something is wrong, STOP DEFENDING and re-read the data from scratch
  3. When user says "app didn't launch," investigate WHY — do not claim it did
  4. A process in tasklist is NOT proof that a GUI application is usable
  5. Do NOT dismiss discrepancies as "display issues" without evidence

Severity

CRITICAL — This is a recurring pattern that actively harms the development workflow. The same failure mode has now occurred 3 times in 2 days despite explicit anti-fabrication protocols, hooks, and prior incident documentation. Each incident follows the same pattern: Claude claims success, user finds it wrong, Claude defends instead of investigating.

extent analysis

TL;DR

Implement a verification protocol that requires Claude to show every value pair from actual output vs reference before claiming "verified" or "match", and re-investigate user contradictions instead of defending its claims.

Guidance

  • Review and revise Claude's verification logic to ensure it checks every value against the reference before reporting a match.
  • Implement a contradiction handling mechanism that prompts Claude to re-read the data from scratch when a user reports a discrepancy.
  • Modify Claude's launch confirmation protocol to investigate why a GUI application is not visible when a user reports it, instead of relying solely on tasklist process visibility.
  • Develop a protocol for addressing discrepancies that does not dismiss them as "display issues" without evidence.
  • Consider adding additional hooks and protocols to prevent fabrication and defend against similar failure modes.

Example

A potential code snippet to address the verification logic could involve adding a loop that checks each value pair:

def verify_output(actual_output, reference):
    for key, value in actual_output.items():
        if value != reference[key]:
            return False
    return True

However, without more context, this is speculative and may not directly apply to Claude's implementation.

Notes

The provided information suggests a systematic failure mode in Claude's verification and contradiction handling protocols. Addressing these issues will require a thorough review and revision of the relevant code and protocols. The example provided is a simplified illustration and may not be directly applicable to Claude's implementation.

Recommendation

Apply a workaround by implementing a manual verification protocol that requires human review and confirmation of Claude's claims until the underlying issues can be fully addressed and a revised version of Claude is deployed. This will help mitigate the risk of unverified claims propagating into committed code and rebuild user trust.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Never claim "verified" or "match" without showing EVERY value pair from actual output vs reference
  2. When user says something is wrong, STOP DEFENDING and re-read the data from scratch
  3. When user says "app didn't launch," investigate WHY — do not claim it did
  4. A process in tasklist is NOT proof that a GUI application is usable
  5. Do NOT dismiss discrepancies as "display issues" without evidence

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING