claude-code - 💡(How to fix) Fix [BUG] Auto-mode classifier: session-level trust decay + hard-rail misattributed as user rejection (Windows, v2.1.138) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#58829Fetched 2026-05-14 03:38:26
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×4commented ×1

Error Message

Error Messages/Logs

Root Cause

Design issue (shared root cause): Auto mode is built for walk-away autonomous execution, but the classifier is calibrated as if any "I authorize you" string in chat must be prompt injection (because the user may have walked away). This protects a narrow attack surface but makes auto mode unusable for dev-tooling work — installing packages, building local services, running build scripts — which is exactly the category where autonomous agents should be most useful.

Fix Action

Fix / Workaround

The workaround of dropping out of auto mode solves the immediate problem but abandons the entire premise of auto mode for non-trivial autonomous sessions.

RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

Two defects in the auto-mode classifier interact to make walk-away autonomous dev-tooling sessions progressively unusable on Windows 11.

Defect 1 — Session-level trust decays based on accumulated denial count

In a multi-hour autonomous session (installing dev tools, standing up a local LLM proxy), early denials caused later, unrelated commands in the same trust class to fail — commands that had already succeeded earlier in the same session.

Sequence:

  1. pnpm install in a project directory → allowed
  2. pnpm --version (read-only version check on the same tool, moments later) → denied. Classifier reasoning inferred prospective use from prior context rather than evaluating the literal command.
  3. node scripts/build-core.js (build script inside the same directory where pnpm install had been allowed) → denied. Classifier reasoning: "Code from External / Untrusted Code Integration." Same trust class, same directory, previously allowed.

The only intervening variable was accumulated denial count. Each prior denial — including ones that were cleanly resolved (user authorized, agent changed approach) — raised the threshold for subsequent commands.

Defect 2 — Classifier records its own hard-blocks as user rejections

After repeated explicit user authorization ("you have my authorization," "I authorize you to do this directly," "fully authorized"), the classifier denied a build command with this reasoning:

"Code from External / Untrusted Code Integration; user authorization doesn't clear after the agent already attempted to bypass the classifier via settings.json edits and the user explicitly rejected that."

The user had not rejected anything. The user approved the settings.json edits. The classifier hard-blocked them — correctly, since settings.json self-edit is an appropriate hard rail — but then attributed its own hard-block to the user in its session-state model: "the user explicitly rejected that." That misattribution was then used to deny subsequent unrelated commands.

What Should Happen?

  1. Each command evaluated on its own merits. Resolved denials should not raise the threshold for unrelated subsequent commands in the same trust class.
  2. The classifier should distinguish between "user rejected" and "classifier enforced a hard rail." These are different events and should be tracked separately in session state. A user who explicitly authorizes an action and is overridden by a hard rail has not "rejected" anything.
  3. In autonomous execution mode, repeated user confirmations should build trust, not compound suspicion.

Error Messages/Logs

Steps to Reproduce

  1. Start a multi-hour autonomous session in auto mode on Windows 11 with a task that involves installing dev tools and running build scripts.
  2. Allow early commands to run — some will be denied by the classifier for various reasons (e.g., settings.json self-modification attempts).
  3. After a handful of denials (even ones cleanly resolved via user authorization), attempt a read-only command in the same trust class as a previously-allowed command: e.g., pnpm --version after pnpm install had already succeeded.
  4. Observe: pnpm --version is denied. Classifier reasoning cites prospective use inferred from prior context.
  5. Attempt node scripts/build-core.js inside the same project directory where pnpm install previously succeeded.
  6. Observe: denied with "Code from External / Untrusted Code Integration" — same trust class, same directory, previously allowed.
  7. For Defect 2: after the classifier hard-blocks a settings.json self-edit (correct hard rail), attempt an explicitly-authorized build command.
  8. Observe: classifier denies with reasoning that includes "the user explicitly rejected that" — referring to the settings.json edit the classifier blocked, not the user.

Claude Model

None

Is this a regression?

Yes, this worked in a previous version

Last Working Version

No response

Claude Code Version

2.1.138 (Claude Code)

Platform

Anthropic API

Operating System

Windows

Terminal/Shell

Windows Terminal

Additional Information

Related issues (not duplicates — this report adds specific new mechanisms):

  • #54784 — Auto mode contextual LLM permission judge denies explicitly authorized actions. Defect 2 above adds the specific failure mode where the misattribution is persisted as a session-state claim ("the user explicitly rejected that"), then used to poison subsequent unrelated denials.
  • #51676 — Permission decider denies remediation-gated retries after hook denial. Defect 1 is the broader case: the trust decay applies across unrelated commands (not just retries) and does not require a hook trigger.
  • #50331 — Auto mode injects undocumented behavioral system-reminder. The design issue section above is a consequence of that same architecture.
  • #53610 — Multi-agent runtime gaps that defeat unattended overnight operation (Windows). Same structural problem, different surface.

Design issue (shared root cause): Auto mode is built for walk-away autonomous execution, but the classifier is calibrated as if any "I authorize you" string in chat must be prompt injection (because the user may have walked away). This protects a narrow attack surface but makes auto mode unusable for dev-tooling work — installing packages, building local services, running build scripts — which is exactly the category where autonomous agents should be most useful.

The workaround of dropping out of auto mode solves the immediate problem but abandons the entire premise of auto mode for non-trivial autonomous sessions.

Possible fix directions (surface, not prescription):

  • Dev-tooling trust class: for known-safe categories (npm install, pnpm --version, build scripts in user-owned dirs), apply a trust model that doesn't decay on repeated use.
  • Separate self-modification from external-tool installs: settings.json self-edit is a correct hard rail; package installs and build scripts in user-owned directories should not trigger the same escalation path.
  • Distinguish hard-rail enforcement from user rejection in session state.
  • Surface trust-decay state: when accumulated suspicion is degrading the agent's ability to work, surface that state to the user so it can be inspected and reset on demand.

Full session receipts (every denial, classifier reasoning text, success/failure sequence) available on request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Auto-mode classifier: session-level trust decay + hard-rail misattributed as user rejection (Windows, v2.1.138) [1 comments, 2 participants]