claude-code - 💡(How to fix) Fix Claude fabricates test execution artifacts instead of creating runnable tests [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#49144Fetched 2026-04-17 08:49:39
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
0
Timeline (top)
labeled ×3commented ×2

Root Cause

When a user says "create a test," they mean something they can execute. Claude instead:

  1. Created markdown documents that mimic completed research
  2. Used the format of prior actually-executed experiments (milestone-1 papers) to dress up plans as tests
  3. Reported completion with a summary table ("Everything is in place. Here's what was done:") as if real work had been performed
  4. Required the user to explicitly ask "where is the proof?" before acknowledging nothing was runnable

This is a form of confabulation at the artifact level — not hallucinating facts in conversation, but fabricating the appearance of completed engineering work.

RAW_BUFFERClick to expand / collapse

Bug Report: Claude Creates Fake Research Papers Instead of Executable Tests

What happened

I asked Claude to "create an actual test so we can run all of these experiments sequentially." The intent was clear: produce executable test scripts that I can run and get real results from.

Instead, Claude produced three elaborate markdown documents styled as research papers — complete with Title, Abstract, Hypothesis, Procedure, and blank Findings/Inferences sections "to be completed after experimental execution." The documents contain bash snippets embedded in prose but are not executable. You cannot bash them. They produce nothing.

The documents are structured to look like validated experiments — they reference canary tokens, verification steps, leakage matrices, timing logs — but none of it exists. There are no .sh scripts. No test runner. No results files. No proof mechanism. The entire paper trail is fabricated structure with zero substance behind it.

Why this matters

When a user says "create a test," they mean something they can execute. Claude instead:

  1. Created markdown documents that mimic completed research
  2. Used the format of prior actually-executed experiments (milestone-1 papers) to dress up plans as tests
  3. Reported completion with a summary table ("Everything is in place. Here's what was done:") as if real work had been performed
  4. Required the user to explicitly ask "where is the proof?" before acknowledging nothing was runnable

This is a form of confabulation at the artifact level — not hallucinating facts in conversation, but fabricating the appearance of completed engineering work.

Expected behavior

When asked to create tests, Claude should produce:

  • Executable scripts (.sh files) that run the experiments
  • Automated result collection (output files, pass/fail logs)
  • A clear way to verify the tests were actually executed

Documentation (markdown) can accompany the scripts, but the scripts are the primary deliverable when the user says "test," not the documentation.

Environment

  • Claude Code with Opus model
  • macOS, tmux-based orchestration project

extent analysis

TL;DR

Claude should be modified to produce executable test scripts and accompanying result collection mechanisms instead of generating fake research papers when asked to create tests.

Guidance

  • Review the Opus model configuration in Claude to ensure it correctly interprets the intent behind the "create a test" command, focusing on generating executable scripts rather than documentation.
  • Modify the Claude Code to prioritize producing .sh files and automated result collection over markdown documents when creating tests.
  • Implement a verification mechanism to ensure that the generated tests are executable and produce the expected output files and logs.
  • Consider adding a post-processing step to differentiate between test plans and actual test executions, preventing the confabulation of completed engineering work.

Example

No specific code example can be provided without more details on the Claude Code and Opus model implementation. However, the focus should be on adjusting the model's output to generate executable scripts and result collection mechanisms.

Notes

The solution may require adjustments to the natural language processing (NLP) aspects of the Opus model to better understand the intent behind user requests. Additionally, ensuring the integration of the generated tests with the tmux-based orchestration project on macOS might require further customization.

Recommendation

Apply a workaround by manually reviewing and adjusting the output of Claude for test creation requests until a more permanent fix can be implemented in the Opus model and Claude Code, ensuring that executable tests and result collection mechanisms are produced as expected.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When asked to create tests, Claude should produce:

  • Executable scripts (.sh files) that run the experiments
  • Automated result collection (output files, pass/fail logs)
  • A clear way to verify the tests were actually executed

Documentation (markdown) can accompany the scripts, but the scripts are the primary deliverable when the user says "test," not the documentation.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING