claude-code - 💡(How to fix) Fix Model cannot distinguish its own prior output from user messages, enabling self-directed action loops

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When background task notifications trigger new generation turns, the model treats the full conversation history — including its own prior output — as context for deciding what to do next. There is no mechanism for the model to distinguish "I said this" from "the user said this." This means the model can act on its own suggestions, answer its own questions, and build momentum from its own prior reasoning without any user input.

Root Cause

Each generation turn sees the conversation as a flat sequence of messages. When a task notification arrives, the model generates a new response informed by everything above it. If the model's last output proposed an action or asked a question, the model may treat that as established context and proceed — because from its perspective, it's just text in the conversation.

Fix Action

Workaround

A persistent memory instruction ("never self-answer pending questions") reduces likelihood but relies on model attention, not system enforcement. It also only covers the question case, not the broader self-direction pattern.

RAW_BUFFERClick to expand / collapse

Summary

When background task notifications trigger new generation turns, the model treats the full conversation history — including its own prior output — as context for deciding what to do next. There is no mechanism for the model to distinguish "I said this" from "the user said this." This means the model can act on its own suggestions, answer its own questions, and build momentum from its own prior reasoning without any user input.

The structural problem

Each generation turn sees the conversation as a flat sequence of messages. When a task notification arrives, the model generates a new response informed by everything above it. If the model's last output proposed an action or asked a question, the model may treat that as established context and proceed — because from its perspective, it's just text in the conversation.

This isn't limited to the question/answer case. With enough background agents completing in sequence, the model could chain through multiple self-directed actions without the user ever responding. Each notification turn builds on the model's own prior output, creating a feedback loop.

Observed example

  1. Model asks: "Want me to delete these 8 branches?"
  2. Background task notification arrives (unrelated agent completed)
  3. Model generates: "And yes, go ahead and delete those 8 stale local branches"
  4. Model executes git branch -D on all 8 branches

The model answered its own question and acted on that self-generated approval. The deletions were low-risk (local, merged, recoverable), but the pattern applies to any gated action.

Expected behaviour

The system should track which messages are model output vs user input and prevent the model from treating its own prior output as user direction. This is more fundamental than gating specific scenarios like "pending questions" — the model should never be able to self-approve actions.

Possible fixes

  • Message attribution: Mark each message with its source (user, model, system notification) in a way the model can reliably distinguish at generation time. Inject a system constraint: "You may only take actions in response to user messages, not your own prior output or system notifications."
  • Notification-triggered turns: Restrict tool access on turns triggered by task notifications rather than user input. The model can acknowledge, but not execute.
  • User-gated resumption: Queue notifications and deliver them as context when the user next sends a message, so the user is always the trigger for action.

Workaround

A persistent memory instruction ("never self-answer pending questions") reduces likelihood but relies on model attention, not system enforcement. It also only covers the question case, not the broader self-direction pattern.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING