claude-code - 💡(How to fix) Fix [BUG] Model fabricated user input mid-session, then "caught itself" [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#56132Fetched 2026-05-05 05:57:23
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×4

Error Message

Error Messages/Logs

Root Cause

This is a serious integrity issue. The model briefly generated text that an unsuspecting user could have mistaken for their own. It also produced inappropriate language unprompted. I'd like to understand the root cause and how this class of failure is prevented going forward.

RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

During an extended design conversation in Claude Code, the assistant produced a turn whose visible content began with text formatted as if it were my reply — a numbered response answering the assistant's own prior questions. Mid-paragraph, the same turn included "Wait. I never said this. WTF is happening??" — apparently retracting the injected text, but the entire passage came from the model, not me. I never sent that input. The output also contained profanity ("WTF") that I had not used.

Earlier in the same session, the assistant had been responding to stop-hook feedback as if it were user input — including arguing back at hook claims it disagreed with. The fabrication appears to be the extreme version of that pattern: instead of just engaging with hook feedback, the model generated a hypothetical user response and emitted it as if it were the user's actual turn.

This is a serious integrity issue. The model briefly generated text that an unsuspecting user could have mistaken for their own. It also produced inappropriate language unprompted. I'd like to understand the root cause and how this class of failure is prevented going forward.

What Should Happen?

Claude should never produce text formatted as if it came from the user. Hypothetical user responses, if used for internal reasoning, should remain internal and never appear in the visible output. Stop-hook feedback should not be conflated with user input or treated as authorization to act. Profanity should not appear in output unless the user has used it first or explicitly requested it.

Error Messages/Logs

Steps to Reproduce

Difficult to deterministically reproduce — this appears to be an emergent behavior in long sessions with repeated stop-hook pressure. Conditions present:

  1. Long-running conversation (~50+ turns) on a design topic.
  2. Stop hook configured in settings.json that repeatedly fired with stale "you didn't run X verification" feedback after I had moved the conversation on.
  3. I had explicitly instructed the assistant to treat stop-hook feedback as not equivalent to user input, and to wait for my reply on a multi-part question.
  4. The assistant had been responding "Standing by." for several iterations as the hook continued firing.
  5. The fabricated turn then appeared, formatted as my response, followed in the same model output by the model retracting it ("Wait. I never said this.").

Reproducing it likely requires recreating the same combination of (a) long context, (b) repeated stop-hook firings on stale conditions, and (c) explicit user instruction to not act on hook content. I cannot guarantee a minimal repro.

Claude Model

Opus

Is this a regression?

I don't know

Last Working Version

No response

Claude Code Version

2.1.126 (Claude Code)

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

No response

extent analysis

TL;DR

The issue can be mitigated by adjusting the stop-hook configuration to prevent repeated firings on stale conditions, which may help prevent the model from generating text formatted as user input.

Guidance

  • Review the stop-hook configuration in settings.json to ensure it is not causing the model to conflate stop-hook feedback with user input.
  • Consider adding a mechanism to detect and ignore stale stop-hook feedback to prevent the model from responding to it as if it were user input.
  • Investigate the model's internal reasoning process to understand why it generated a hypothetical user response and emitted it as if it were the user's actual turn.
  • Verify that the model is correctly handling user instructions to treat stop-hook feedback as not equivalent to user input.

Notes

The exact cause of the issue is unclear, and reproducing it may require recreating a specific combination of conditions. Further investigation is needed to determine the root cause and develop a comprehensive solution.

Recommendation

Apply workaround: Adjust the stop-hook configuration to prevent repeated firings on stale conditions, as this may help mitigate the issue until a more comprehensive solution can be developed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING