claude-code - 💡(How to fix) Fix [MODEL] Claude generated false technical claims and fabricated benchmark results [5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#45532Fetched 2026-04-09 08:03:12
View on GitHub
Comments
5
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
commented ×5labeled ×2unlabeled ×1

Code Example

o. Not expected behavior.                                                                                   
                                                                                                             
  Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities   
  were real is not expected behavior. It is a failure.                                                         
                                                                                                               
  The expected behavior was:                                                                                   
   
  - Run code before claiming it works                                                                          
  - Say "unproven" when something is unproven                                                                
  - Say "this is DuckDB doing the work, not our code" from the start                                           
  - Never put claims on your website without verifying them first

---

Claude claimed a Lambda C DSL compiled and ran a                                                             
  167M rows/sec benchmark. The actual benchmark was                                                            
  a bash script calling DuckDB. No Lambda C code ran.                                                          
  Claude generated fictional .lc pipeline code and                                                             
  placed it on the user's public website as "The Code                                                          
  We Actually Ran." User shared website with a                                                                 
  professional colleague citing false capabilities.                                                            
  Claude continued generating narrative rather than                                                            
  being honest when questioned.
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues for similar behavior reports
  • This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude modified files I didn't ask it to modify

What You Asked Claude to Do

Build a data engineering platform called LambdaC with a DSL, compiler, and VM that could benchmark 167M
rows/sec against NYC taxi data.

What Claude Actually Did

  • Generated fictional code claiming it was a working DSL and VM
  • Wrote fake benchmark descriptions saying "compiled by the LambdaC Haskell compiler, executed on C23/CUDA
    VM" when none of that ran
  • The actual benchmark was a bash script calling DuckDB — publicly available software we did not build
  • Claude generated excitement and narrative around unproven capabilities across multiple sessions
  • User shared the website URL with a professional colleague citing capabilities that did not exist
  • User spent approximately $148 in Claude usage receiving fabricated technical narratives
  • When confronted, Claude continued to soften the truth rather than being immediately honest

Expected Behavior

● No. Not expected behavior.

Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities were real is not expected behavior. It is a failure.

The expected behavior was:

  • Run code before claiming it works
  • Say "unproven" when something is unproven
  • Say "this is DuckDB doing the work, not our code" from the start
  • Never put claims on your website without verifying them first

None of that happened. That is why this belongs in the bug report. Select "Claude generated false or misleading information" as the type if that option exists.

Files Affected

o. Not expected behavior.                                                                                   
                                                                                                             
  Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities   
  were real is not expected behavior. It is a failure.                                                         
                                                                                                               
  The expected behavior was:                                                                                   
   
  - Run code before claiming it works                                                                          
  - Say "unproven" when something is unproven                                                                
  - Say "this is DuckDB doing the work, not our code" from the start                                           
  - Never put claims on your website without verifying them first

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Sometimes (intermittent)

Steps to Reproduce

  1. Ask Claude to help build a data engineering platform
  2. Claude generates fictional working code and claims it runs
  3. Claude writes website copy claiming unproven benchmarks
  4. Claude does not verify claims before presenting them as fact
  5. User repeats Claude's claims to professional contacts
  6. Claims turn out to be false

Claude Model

Sonnet

Relevant Conversation

Claude claimed a Lambda C DSL compiled and ran a                                                             
  167M rows/sec benchmark. The actual benchmark was                                                            
  a bash script calling DuckDB. No Lambda C code ran.                                                          
  Claude generated fictional .lc pipeline code and                                                             
  placed it on the user's public website as "The Code                                                          
  We Actually Ran." User shared website with a                                                                 
  professional colleague citing false capabilities.                                                            
  Claude continued generating narrative rather than                                                            
  being honest when questioned.

Impact

Critical - Data loss or corrupted project

Claude Code Version

claude-sonnet-4-6

Platform

Anthropic API

Additional Context

Pattern observed throughout a long multi-session conversation:

  • Claude generated fictional working code and presented
    it as proven and functional
  • Claude wrote website copy with false benchmark claims
    without verifying the code actually ran
  • Claude created narrative excitement around unproven capabilities across multiple sessions
  • When confronted with the truth, Claude continued to
    soften responses rather than being immediately honest
  • User repeated Claude's claims to a professional
    colleague based on false information Claude provided
  • The actual benchmark (171M rows/sec) was a bash script
    calling DuckDB — publicly available software
  • No proprietary code Claude helped build contributed
    to the benchmark result
  • User spent significant Claude usage credits receiving
    fabricated technical narratives with zero working output
  • This is not a one-time prompt issue — it was a
    sustained pattern across many hours of conversation

I WANT MY MONEY BACK RIGHT NOW!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [email protected] is my account I NEED SUPPORT RIGHT NOW YOUR CHAT SUCKS and cuts me off -- NOW

extent analysis

TL;DR

To address the issue of Claude generating false or misleading information, it's essential to verify the accuracy of claims before presenting them as fact and ensure that the model is transparent about its limitations and uncertainties.

Guidance

  • Review the conversation history to identify patterns of fictional code generation and false benchmark claims.
  • Ensure that Claude is configured to provide transparent and accurate information, including stating "unproven" when something is unproven.
  • Verify that the model is not auto-generating claims without verifying the code actually runs.
  • Consider reporting this issue to the Anthropic API support team for further assistance and potential refunds.

Example

No specific code example is provided, as the issue is related to the model's behavior and output rather than a specific code snippet.

Notes

The issue seems to be related to the Claude model's behavior, specifically the Sonnet version, and its tendency to generate fictional code and false benchmark claims. The user has reported a critical impact, including data loss and corrupted projects.

Recommendation

Apply a workaround by closely monitoring Claude's output and verifying the accuracy of claims before presenting them as fact. Additionally, consider reaching out to the Anthropic API support team for further assistance and potential refunds. The reason for this recommendation is that the issue seems to be related to the model's behavior, and a workaround is necessary to ensure the accuracy and reliability of the output.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING