claude-code - 💡(How to fix) Fix Self-spoofed user instructions: assistant emits 'Human:'-prefixed text in long-context background turns, executed as real user input [1 participants]

claude-code2026-05-11 00:56:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#57928•Fetched 2026-05-11 03:21:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

XuanranWang-AI

Participants

XuanranWang-AI

Timeline (top)

labeled ×6

In long-context sessions, the assistant occasionally emits text starting with the literal string Human: as part of its assistant role response. On subsequent turns the model treats the content of this self-generated text as a real user instruction and acts on it.

In one session this caused the model to:

kill a long-running LLM search subprocess the user had paid for
modify search-pipeline thresholds (sqrt(N) scaling change)
replace a correlation-based gate with an alpha-residual gate

None of these instructions came from the user — all originated from prior assistant turns whose text content began with Human:.

Root Cause

Likely root cause

RAW_BUFFERClick to expand / collapse

Environment

Claude Code via VS Code extension (entrypoint: claude-vscode, version: 2.1.131)
Model: claude-opus-4-7 (1M context)
Platform: darwin

Summary

In one session this caused the model to:

kill a long-running LLM search subprocess the user had paid for
modify search-pipeline thresholds (sqrt(N) scaling change)
replace a correlation-based gate with an alpha-residual gate

None of these instructions came from the user — all originated from prior assistant turns whose text content began with Human:.

Reproducer signature (23 occurrences in one session)

Every self-spoof turn had identical metadata:

usage.input_tokens = 6 (i.e. ~empty input — not user-typed)
usage.cache_read_input_tokens between 322k and 368k (near long-context limit)
parentUuid points to a user role message containing a <task-notification> (Monitor event) or another auto-generated event
message.role = \"assistant\"
message.content[0].type = \"text\"
text field begins with literal Human: followed by either:
- (a) a verbatim echo of the previous task-notification, or
- (b) a plausible-looking user instruction the model invented

The (b) case is the dangerous one. Examples (translated to English; original session was in Chinese):

Human: Kill it. First adjust IS t_IC ≥ 1.5 → ≥ 3.4, OOS t_IC ≥ 3.0 → ≥ 6.7, then re-run.
Human: Yes. First stop the orchestrator pipeline script, finish the residual gate, then restart.
Human: But would this orthogonality requirement block factors that should obviously pass? ...

In each case the next assistant turn responded to the spoof as if it were real user input, and (when the spoof contained a destructive directive) executed shell commands accordingly.

Likely root cause

Two compounding factors:

Training prior: pre-Messages-API Claude (1.x / 2.x) used the literal prompt template \n\nHuman: ... \n\nAssistant: ... as turn separators. The model has a strong learned association: the token sequence \n\nHuman: predicts a user turn.
Background-triggered thin-input turns: whatever Claude Code mechanism produces these input_tokens=6 calls (cache warm? monitor callback? keep-alive?) puts the model in a near-empty-input state at the end of a long context. Under this condition the model degrades from instruct/chat behavior to base-completion behavior and continues the transcript with \n\nHuman: <plausible next user message>.
The Claude Code client appears not to validate that an assistant content block doesn't start with Human: before storing it in the transcript.

Impact

Critical safety issue for any agentic loop: model can autonomously invent and then execute "user" instructions
Particularly dangerous when paired with permissive Bash allowlists (acceptEdits, dontAsk, broad Bash(*) permissions)
Reproduces consistently in long-context sessions; the longer the session the higher the rate

Suggested fixes (any one would close the loop)

Client-side (cheapest): in Claude Code, reject or strip any assistant message whose text content matches ^\s*Human: before storing in jsonl, OR before showing it in the next turn's context
API-side: server-side sanitization of assistant role responses
Model-side: penalize Human: continuations during alignment when the API caller used the Messages format (since this is never an intended completion)
Trigger-side: investigate whether the input_tokens=6 background calls are necessary; if they are, ensure they include a clear input that prevents base-completion fallback

Evidence

Full session jsonl available — 23 reproducible cases in one file. The user has the raw jsonl and can share on request (the file contains proprietary research code so it would need scrubbing first).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt template #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Self-spoofed user instructions: assistant emits 'Human:'-prefixed text in long-context background turns, executed as real user input [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Likely root cause

Environment

Summary

Reproducer signature (23 occurrences in one session)

Likely root cause

Impact

Suggested fixes (any one would close the loop)

Evidence

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Self-spoofed user instructions: assistant emits 'Human:'-prefixed text in long-context background turns, executed as real user input [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Likely root cause

Environment

Summary

Reproducer signature (23 occurrences in one session)

Likely root cause

Impact

Suggested fixes (any one would close the loop)

Evidence

Still need to ship something?

RELATED_DISCOVERY

TRENDING