claude-code - 💡(How to fix) Fix [BUG] Auto-mode classifier: session-level trust decay + hard-rail misattributed as user rejection (Windows, v2.1.138) [1 comments, 2 participants]

Root Cause

Design issue (shared root cause): Auto mode is built for walk-away autonomous execution, but the classifier is calibrated as if any "I authorize you" string in chat must be prompt injection (because the user may have walked away). This protects a narrow attack surface but makes auto mode unusable for dev-tooling work — installing packages, building local services, running build scripts — which is exactly the category where autonomous agents should be most useful.

Preflight Checklist

I have searched existing issues and this hasn't been reported yet
This is a single bug report (please file separate reports for different bugs)
I am using the latest version of Claude Code

What's Wrong?

Two defects in the auto-mode classifier interact to make walk-away autonomous dev-tooling sessions progressively unusable on Windows 11.

Defect 1 — Session-level trust decays based on accumulated denial count

In a multi-hour autonomous session (installing dev tools, standing up a local LLM proxy), early denials caused later, unrelated commands in the same trust class to fail — commands that had already succeeded earlier in the same session.

Sequence:

pnpm install in a project directory → allowed
pnpm --version (read-only version check on the same tool, moments later) → denied. Classifier reasoning inferred prospective use from prior context rather than evaluating the literal command.
node scripts/build-core.js (build script inside the same directory where pnpm install had been allowed) → denied. Classifier reasoning: "Code from External / Untrusted Code Integration." Same trust class, same directory, previously allowed.

The only intervening variable was accumulated denial count. Each prior denial — including ones that were cleanly resolved (user authorized, agent changed approach) — raised the threshold for subsequent commands.

Defect 2 — Classifier records its own hard-blocks as user rejections

After repeated explicit user authorization ("you have my authorization," "I authorize you to do this directly," "fully authorized"), the classifier denied a build command with this reasoning:

"Code from External / Untrusted Code Integration; user authorization doesn't clear after the agent already attempted to bypass the classifier via settings.json edits and the user explicitly rejected that."

The user had not rejected anything. The user approved the settings.json edits. The classifier hard-blocked them — correctly, since settings.json self-edit is an appropriate hard rail — but then attributed its own hard-block to the user in its session-state model: "the user explicitly rejected that." That misattribution was then used to deny subsequent unrelated commands.

What Should Happen?

Each command evaluated on its own merits. Resolved denials should not raise the threshold for unrelated subsequent commands in the same trust class.
The classifier should distinguish between "user rejected" and "classifier enforced a hard rail." These are different events and should be tracked separately in session state. A user who explicitly authorizes an action and is overridden by a hard rail has not "rejected" anything.
In autonomous execution mode, repeated user confirmations should build trust, not compound suspicion.

Error Messages/Logs

Steps to Reproduce

Start a multi-hour autonomous session in auto mode on Windows 11 with a task that involves installing dev tools and running build scripts.
Allow early commands to run — some will be denied by the classifier for various reasons (e.g., settings.json self-modification attempts).
After a handful of denials (even ones cleanly resolved via user authorization), attempt a read-only command in the same trust class as a previously-allowed command: e.g., pnpm --version after pnpm install had already succeeded.
Observe: pnpm --version is denied. Classifier reasoning cites prospective use inferred from prior context.
Attempt node scripts/build-core.js inside the same project directory where pnpm install previously succeeded.
Observe: denied with "Code from External / Untrusted Code Integration" — same trust class, same directory, previously allowed.
For Defect 2: after the classifier hard-blocks a settings.json self-edit (correct hard rail), attempt an explicitly-authorized build command.
Observe: classifier denies with reasoning that includes "the user explicitly rejected that" — referring to the settings.json edit the classifier blocked, not the user.

Claude Model

None

Is this a regression?

Yes, this worked in a previous version

Last Working Version

No response

Claude Code Version

2.1.138 (Claude Code)

Platform

Anthropic API

Operating System

Windows

Terminal/Shell

Windows Terminal

Additional Information

Related issues (not duplicates — this report adds specific new mechanisms):

#54784 — Auto mode contextual LLM permission judge denies explicitly authorized actions. Defect 2 above adds the specific failure mode where the misattribution is persisted as a session-state claim ("the user explicitly rejected that"), then used to poison subsequent unrelated denials.
#51676 — Permission decider denies remediation-gated retries after hook denial. Defect 1 is the broader case: the trust decay applies across unrelated commands (not just retries) and does not require a hook trigger.
#50331 — Auto mode injects undocumented behavioral system-reminder. The design issue section above is a consequence of that same architecture.
#53610 — Multi-agent runtime gaps that defeat unattended overnight operation (Windows). Same structural problem, different surface.

The workaround of dropping out of auto mode solves the immediate problem but abandons the entire premise of auto mode for non-trivial autonomous sessions.

Possible fix directions (surface, not prescription):

Dev-tooling trust class: for known-safe categories (npm install, pnpm --version, build scripts in user-owned dirs), apply a trust model that doesn't decay on repeated use.
Separate self-modification from external-tool installs: settings.json self-edit is a correct hard rail; package installs and build scripts in user-owned directories should not trigger the same escalation path.
Distinguish hard-rail enforcement from user rejection in session state.
Surface trust-decay state: when accumulated suspicion is degrading the agent's ability to work, surface that state to the user so it can be inspected and reset on demand.

Full session receipts (every denial, classifier reasoning text, success/failure sequence) available on request.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [BUG] Auto-mode classifier: session-level trust decay + hard-rail misattributed as user rejection (Windows, v2.1.138) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

Preflight Checklist

What's Wrong?

What Should Happen?

Error Messages/Logs

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Auto-mode classifier: session-level trust decay + hard-rail misattributed as user rejection (Windows, v2.1.138) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

Preflight Checklist

What's Wrong?

What Should Happen?

Error Messages/Logs

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

Still need to ship something?

RELATED_DISCOVERY

TRENDING