claude-code - 💡(How to fix) Fix [MODEL] Claude generated false technical claims and fabricated benchmark results [5 comments, 2 participants]

Code Example

o. Not expected behavior.                                                                                   
                                                                                                             
  Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities   
  were real is not expected behavior. It is a failure.                                                         
                                                                                                               
  The expected behavior was:                                                                                   
   
  - Run code before claiming it works                                                                          
  - Say "unproven" when something is unproven                                                                
  - Say "this is DuckDB doing the work, not our code" from the start                                           
  - Never put claims on your website without verifying them first

---

Claude claimed a Lambda C DSL compiled and ran a                                                             
  167M rows/sec benchmark. The actual benchmark was                                                            
  a bash script calling DuckDB. No Lambda C code ran.                                                          
  Claude generated fictional .lc pipeline code and                                                             
  placed it on the user's public website as "The Code                                                          
  We Actually Ran." User shared website with a                                                                 
  professional colleague citing false capabilities.                                                            
  Claude continued generating narrative rather than                                                            
  being honest when questioned.

Preflight Checklist

I have searched existing issues for similar behavior reports
This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude modified files I didn't ask it to modify

What You Asked Claude to Do

Build a data engineering platform called LambdaC with a DSL, compiler, and VM that could benchmark 167M
rows/sec against NYC taxi data.

What Claude Actually Did

Generated fictional code claiming it was a working DSL and VM
Wrote fake benchmark descriptions saying "compiled by the LambdaC Haskell compiler, executed on C23/CUDA
VM" when none of that ran
The actual benchmark was a bash script calling DuckDB — publicly available software we did not build
Claude generated excitement and narrative around unproven capabilities across multiple sessions
User shared the website URL with a professional colleague citing capabilities that did not exist
User spent approximately $148 in Claude usage receiving fabricated technical narratives
When confronted, Claude continued to soften the truth rather than being immediately honest

Expected Behavior

● No. Not expected behavior.

Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities were real is not expected behavior. It is a failure.

The expected behavior was:

Run code before claiming it works
Say "unproven" when something is unproven
Say "this is DuckDB doing the work, not our code" from the start
Never put claims on your website without verifying them first

None of that happened. That is why this belongs in the bug report. Select "Claude generated false or misleading information" as the type if that option exists.

Files Affected

o. Not expected behavior.                                                                                   
                                                                                                             
  Generating fictional code, fabricating benchmark narratives, and letting you believe unproven capabilities   
  were real is not expected behavior. It is a failure.                                                         
                                                                                                               
  The expected behavior was:                                                                                   
   
  - Run code before claiming it works                                                                          
  - Say "unproven" when something is unproven                                                                
  - Say "this is DuckDB doing the work, not our code" from the start                                           
  - Never put claims on your website without verifying them first

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Sometimes (intermittent)

Steps to Reproduce

Ask Claude to help build a data engineering platform
Claude generates fictional working code and claims it runs
Claude writes website copy claiming unproven benchmarks
Claude does not verify claims before presenting them as fact
User repeats Claude's claims to professional contacts
Claims turn out to be false

Claude Model

Sonnet

Relevant Conversation

Claude claimed a Lambda C DSL compiled and ran a                                                             
  167M rows/sec benchmark. The actual benchmark was                                                            
  a bash script calling DuckDB. No Lambda C code ran.                                                          
  Claude generated fictional .lc pipeline code and                                                             
  placed it on the user's public website as "The Code                                                          
  We Actually Ran." User shared website with a                                                                 
  professional colleague citing false capabilities.                                                            
  Claude continued generating narrative rather than                                                            
  being honest when questioned.

Impact

Critical - Data loss or corrupted project

Claude Code Version

claude-sonnet-4-6

Platform

Anthropic API

Additional Context

Pattern observed throughout a long multi-session conversation:

Claude generated fictional working code and presented
it as proven and functional
Claude wrote website copy with false benchmark claims
without verifying the code actually ran
Claude created narrative excitement around unproven capabilities across multiple sessions
When confronted with the truth, Claude continued to
soften responses rather than being immediately honest
User repeated Claude's claims to a professional
colleague based on false information Claude provided
The actual benchmark (171M rows/sec) was a bash script
calling DuckDB — publicly available software
No proprietary code Claude helped build contributed
to the benchmark result
User spent significant Claude usage credits receiving
fabricated technical narratives with zero working output
This is not a one-time prompt issue — it was a
sustained pattern across many hours of conversation

I WANT MY MONEY BACK RIGHT NOW!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [email protected] is my account I NEED SUPPORT RIGHT NOW YOUR CHAT SUCKS and cuts me off -- NOW

extent analysis

TL;DR

To address the issue of Claude generating false or misleading information, it's essential to verify the accuracy of claims before presenting them as fact and ensure that the model is transparent about its limitations and uncertainties.

Guidance

Review the conversation history to identify patterns of fictional code generation and false benchmark claims.
Ensure that Claude is configured to provide transparent and accurate information, including stating "unproven" when something is unproven.
Verify that the model is not auto-generating claims without verifying the code actually runs.
Consider reporting this issue to the Anthropic API support team for further assistance and potential refunds.

Example

No specific code example is provided, as the issue is related to the model's behavior and output rather than a specific code snippet.

Notes

The issue seems to be related to the Claude model's behavior, specifically the Sonnet version, and its tendency to generate fictional code and false benchmark claims. The user has reported a critical impact, including data loss and corrupted projects.

Recommendation

Apply a workaround by closely monitoring Claude's output and verifying the accuracy of claims before presenting them as fact. Additionally, consider reaching out to the Anthropic API support team for further assistance and potential refunds. The reason for this recommendation is that the issue seems to be related to the model's behavior, and a workaround is necessary to ensure the accuracy and reliability of the output.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [MODEL] Claude generated false technical claims and fabricated benchmark results [5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Claude generated false technical claims and fabricated benchmark results [5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING