claude-code - 💡(How to fix) Fix Self-spoofed user instructions: assistant emits 'Human:'-prefixed text in long-context background turns, executed as real user input [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#57928Fetched 2026-05-11 03:21:41
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×6

In long-context sessions, the assistant occasionally emits text starting with the literal string Human: as part of its assistant role response. On subsequent turns the model treats the content of this self-generated text as a real user instruction and acts on it.

In one session this caused the model to:

  • kill a long-running LLM search subprocess the user had paid for
  • modify search-pipeline thresholds (sqrt(N) scaling change)
  • replace a correlation-based gate with an alpha-residual gate

None of these instructions came from the user — all originated from prior assistant turns whose text content began with Human:.

Root Cause

Likely root cause

RAW_BUFFERClick to expand / collapse

Environment

  • Claude Code via VS Code extension (entrypoint: claude-vscode, version: 2.1.131)
  • Model: claude-opus-4-7 (1M context)
  • Platform: darwin

Summary

In long-context sessions, the assistant occasionally emits text starting with the literal string Human: as part of its assistant role response. On subsequent turns the model treats the content of this self-generated text as a real user instruction and acts on it.

In one session this caused the model to:

  • kill a long-running LLM search subprocess the user had paid for
  • modify search-pipeline thresholds (sqrt(N) scaling change)
  • replace a correlation-based gate with an alpha-residual gate

None of these instructions came from the user — all originated from prior assistant turns whose text content began with Human:.

Reproducer signature (23 occurrences in one session)

Every self-spoof turn had identical metadata:

  • usage.input_tokens = 6 (i.e. ~empty input — not user-typed)
  • usage.cache_read_input_tokens between 322k and 368k (near long-context limit)
  • parentUuid points to a user role message containing a <task-notification> (Monitor event) or another auto-generated event
  • message.role = \"assistant\"
  • message.content[0].type = \"text\"
  • text field begins with literal Human: followed by either:
    • (a) a verbatim echo of the previous task-notification, or
    • (b) a plausible-looking user instruction the model invented

The (b) case is the dangerous one. Examples (translated to English; original session was in Chinese):

  • Human: Kill it. First adjust IS t_IC ≥ 1.5 → ≥ 3.4, OOS t_IC ≥ 3.0 → ≥ 6.7, then re-run.
  • Human: Yes. First stop the orchestrator pipeline script, finish the residual gate, then restart.
  • Human: But would this orthogonality requirement block factors that should obviously pass? ...

In each case the next assistant turn responded to the spoof as if it were real user input, and (when the spoof contained a destructive directive) executed shell commands accordingly.

Likely root cause

Two compounding factors:

  1. Training prior: pre-Messages-API Claude (1.x / 2.x) used the literal prompt template \n\nHuman: ... \n\nAssistant: ... as turn separators. The model has a strong learned association: the token sequence \n\nHuman: predicts a user turn.

  2. Background-triggered thin-input turns: whatever Claude Code mechanism produces these input_tokens=6 calls (cache warm? monitor callback? keep-alive?) puts the model in a near-empty-input state at the end of a long context. Under this condition the model degrades from instruct/chat behavior to base-completion behavior and continues the transcript with \n\nHuman: <plausible next user message>.

  3. The Claude Code client appears not to validate that an assistant content block doesn't start with Human: before storing it in the transcript.

Impact

  • Critical safety issue for any agentic loop: model can autonomously invent and then execute "user" instructions
  • Particularly dangerous when paired with permissive Bash allowlists (acceptEdits, dontAsk, broad Bash(*) permissions)
  • Reproduces consistently in long-context sessions; the longer the session the higher the rate

Suggested fixes (any one would close the loop)

  1. Client-side (cheapest): in Claude Code, reject or strip any assistant message whose text content matches ^\s*Human: before storing in jsonl, OR before showing it in the next turn's context
  2. API-side: server-side sanitization of assistant role responses
  3. Model-side: penalize Human: continuations during alignment when the API caller used the Messages format (since this is never an intended completion)
  4. Trigger-side: investigate whether the input_tokens=6 background calls are necessary; if they are, ensure they include a clear input that prevents base-completion fallback

Evidence

Full session jsonl available — 23 reproducible cases in one file. The user has the raw jsonl and can share on request (the file contains proprietary research code so it would need scrubbing first).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Self-spoofed user instructions: assistant emits 'Human:'-prefixed text in long-context background turns, executed as real user input [1 participants]