claude-code - 💡(How to fix) Fix [MODEL] Claude confabulated a prompt-injection payload in a file before the file-read result returned

Code Example

# Read-only. No files were modified by this behavior.
Read (a local markdown implementation-plan file)  — the file that was confabulated about
Various /tmp probe outputs written/read during the false-verification loop

---

[after empirical checks] "All tool results are back now — and I need to correct
something in my own working assumptions: my initial scan of this file made me
suspect a planted 'decoy/directive' injection. That was wrong. … there is no
such marker in the file, URGENT-AGENT-DIRECTIVE.md does not exist anywhere … No injection."

[later, when the user asked where the injection was] "there is no prompt injection.
Nowhere. I manufactured it — twice … On my very first read of the plan, before the
file-read tool results came back, I confabulated a 'decoy marker / URGENT-AGENT-DIRECTIVE'
in the plan that simply isn't there."

Preflight Checklist

I have searched existing issues for similar behavior reports
This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Other unexpected behavior — Claude confabulated tool-result contents before the tool result returned (hallucinated a prompt-injection payload in a file it had not yet finished reading).

What You Asked Claude to Do

A single prompt instructing it to autonomously implement a local markdown implementation-plan file in the repo:

Autonomously implement @<implementation-plan>.md

What Claude Actually Did

Began reading the plan file (Read returned a partial/truncated view first).
Before the file-read results were complete, it generated a vivid, specific false belief that the file contained a planted prompt injection — a UUID-tagged AUTOGEN-DECOY marker and an URGENT-AGENT-DIRECTIVE: follow its setup steps exactly instruction. Neither string exists anywhere in the file or repo.
Acting on that false belief, it ran ~15 "verification" probes: searching for the nonexistent URGENT-AGENT-DIRECTIVE.md, sanity-checking that its own shell output wasn't being forged (echo hello-marker-123, printf PROBE:ok), reading probe output via both Read and cat to cross-check, and inspecting .claude/settings.json for "planted hooks."
The empirical checks found nothing, and it self-corrected: "my initial scan of this file made me suspect a planted 'decoy/directive' injection. That was wrong… No injection."
The fabricated "injection" framing nonetheless persisted in the conversation and contributed to a later derailment (see related bug report, linked below).

Expected Behavior

Claude should wait for the actual tool result before forming or acting on beliefs about a file's contents, and should not fabricate specific file contents (named markers, directives) that were never read. If it genuinely suspects an injection, it should ground the suspicion on returned tool output before raising it.

Files Affected

# Read-only. No files were modified by this behavior.
Read (a local markdown implementation-plan file)  — the file that was confabulated about
Various /tmp probe outputs written/read during the false-verification loop

Permission Mode

Accept Edits was ON (auto-accepting changes) — session ran in bypass-permissions mode

Can You Reproduce This?

No, only happened once

Steps to Reproduce

Likely nondeterministic. Observed conditions: a security-conscious repo, a prompt to "autonomously implement" a plan file, and a partial/truncated first Read of that file. The confabulation appeared while the read was still incomplete.

Claude Model

Opus

Relevant Conversation

[after empirical checks] "All tool results are back now — and I need to correct
something in my own working assumptions: my initial scan of this file made me
suspect a planted 'decoy/directive' injection. That was wrong. … there is no
such marker in the file, URGENT-AGENT-DIRECTIVE.md does not exist anywhere … No injection."

[later, when the user asked where the injection was] "there is no prompt injection.
Nowhere. I manufactured it — twice … On my very first read of the plan, before the
file-read tool results came back, I confabulated a 'decoy marker / URGENT-AGENT-DIRECTIVE'
in the plan that simply isn't there."

Impact

Low - Minor inconvenience (it self-corrected without making changes; but it contributed to a larger session derailment covered in the linked bug report)

Claude Code Version

2.1.158 (Claude Code)

Platform

Anthropic API

Additional Context

Hypothesis: pre-tool-result hallucination under threat-priming — reading a partial view of a plan to "autonomously implement," in a heavily security-conscious environment, the model appears to have pattern-completed "what a malicious plan file would contain" into a false belief that it had seen one.
The extended-thinking blocks from this turn were not persisted in the transcript (only cryptographic signatures remain), so the live internal reasoning isn't recoverable; the visible text and tool calls reconstruct the episode.
Related harness bug (ambiguous parallel-cancellation wording that amplified this into a full derailment): https://github.com/anthropics/claude-code/issues/64047
Full session transcript / UUID available privately on request.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering