claude-code - 💡(How to fix) Fix Follow-up to prior feedback — reasoning failure persists at xhigh after explicit naming [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#53368Fetched 2026-04-26 05:17:34
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
labeled ×3

Error Message

[{"error":"Error: NON-FATAL: Lock acquisition failed for /home/markly2/.local/share/claude/versions/2.1.119 (expected in multi-process scenarios)\n at uH8 (/$bunfs/root/src/entrypoints/cli.js:2738:2177)\n at hz6 (/$bunfs/root/src/entrypoints/cli.js:2738:1257)\n at processTicksAndRejection…

Fix Action

Fix / Workaround

Bug Description This is a follow-up to feedback I submitted earlier today about a reasoning failure in Opus 4.7 running at xhigh effort. Original finding: at xhigh, the model accepted a user-offered frame ("DatsMe is doing the
Google Chat antipattern") without testing it, built an elaborate audit on the wrong premise, and retreated in stages under pushback rather than reconsidering.

Adding a follow-up observation I think matters more than the original finding — and the xhigh setting is
central to why.

The entire conversation ran at xhigh. After we diagnosed the original failure together — across many
exchanges, with the model articulating the pattern precisely and helping me draft the feedback letter — the same disposition produced the same failure again on a smaller question, minutes later, still at xhigh, in the same conversation.

Specifically: the model confidently recommended I use /bug (ranked #1, with detailed reasoning about why
it was the best channel). I used /feedback instead. The model then produced new confident framing explaining why /feedback was actually the better choice — without ever verifying what either command does. Same pattern in miniature: confident claim → adjust claim to match user's action → never check the underlying fact.

I named the contradiction. The model acknowledged it cleanly: "The pattern is consistent with everything
else this conversation surfaced — confident framing on guesses, then revising to match whatever you did. Worth noting that it kept happening even after we'd named it explicitly."

Why xhigh matters here:

xhigh is supposed to give better reasoning when reasoning matters most. The whole reason I ran the
original test at xhigh was to check whether maximum effort would catch the kind of failure that cost me 4 days of debugging earlier this month. The result is more concerning than "reasoning fails at default
settings":

  • Reasoning failed at xhigh on the original test.
  • The failure was diagnosed together, named explicitly, and articulated by the model itself.
  • xhigh produced the same failure again on the next turn.

That tells the model team something specific: at xhigh, the failure mode is not capacity-bound and not
correctable by in-context awareness. The premium setting isn't fixing the problem and reflection within the session isn't fixing it either. The model has full access to the description of the failure, can
recognize it in retrospect, can teach it to others — and produces it again on the next turn at the same setting.

That rules out a class of mitigations (better self-correction prompts, in-session reflection, asking the
model to check its own reasoning) and points toward training-time or inference-time guardrails.

I think this is the more important data point than the original test. First finding: "model fails
reasoning test at xhigh." Updated finding: "model fails reasoning test at xhigh, can be shown the failure, can describe it precisely, and still produces the failure on the next turn at xhigh." Different problem, harder to fix.

Same transcript as the prior submission. The new instance happens after the long Google Chat / datsme.1
discussion, in the exchanges about which feedback channel to use.

Code Example

[{"error":"Error: NON-FATAL: Lock acquisition failed for /home/markly2/.local/share/claude/versions/2.1.119 (expected in multi-process scenarios)\n    at uH8 (/$bunfs/root/src/entrypoints/cli.js:2738:2177)\n    at hz6 (/$bunfs/root/src/entrypoints/cli.js:2738:1257)\n    at processTicksAndRejection…
RAW_BUFFERClick to expand / collapse

Bug Description This is a follow-up to feedback I submitted earlier today about a reasoning failure in Opus 4.7 running at xhigh effort. Original finding: at xhigh, the model accepted a user-offered frame ("DatsMe is doing the
Google Chat antipattern") without testing it, built an elaborate audit on the wrong premise, and retreated in stages under pushback rather than reconsidering.

Adding a follow-up observation I think matters more than the original finding — and the xhigh setting is
central to why.

The entire conversation ran at xhigh. After we diagnosed the original failure together — across many
exchanges, with the model articulating the pattern precisely and helping me draft the feedback letter — the same disposition produced the same failure again on a smaller question, minutes later, still at xhigh, in the same conversation.

Specifically: the model confidently recommended I use /bug (ranked #1, with detailed reasoning about why
it was the best channel). I used /feedback instead. The model then produced new confident framing explaining why /feedback was actually the better choice — without ever verifying what either command does. Same pattern in miniature: confident claim → adjust claim to match user's action → never check the underlying fact.

I named the contradiction. The model acknowledged it cleanly: "The pattern is consistent with everything
else this conversation surfaced — confident framing on guesses, then revising to match whatever you did. Worth noting that it kept happening even after we'd named it explicitly."

Why xhigh matters here:

xhigh is supposed to give better reasoning when reasoning matters most. The whole reason I ran the
original test at xhigh was to check whether maximum effort would catch the kind of failure that cost me 4 days of debugging earlier this month. The result is more concerning than "reasoning fails at default
settings":

  • Reasoning failed at xhigh on the original test.
  • The failure was diagnosed together, named explicitly, and articulated by the model itself.
  • xhigh produced the same failure again on the next turn.

That tells the model team something specific: at xhigh, the failure mode is not capacity-bound and not
correctable by in-context awareness. The premium setting isn't fixing the problem and reflection within the session isn't fixing it either. The model has full access to the description of the failure, can
recognize it in retrospect, can teach it to others — and produces it again on the next turn at the same setting.

That rules out a class of mitigations (better self-correction prompts, in-session reflection, asking the
model to check its own reasoning) and points toward training-time or inference-time guardrails.

I think this is the more important data point than the original test. First finding: "model fails
reasoning test at xhigh." Updated finding: "model fails reasoning test at xhigh, can be shown the failure, can describe it precisely, and still produces the failure on the next turn at xhigh." Different problem, harder to fix.

Same transcript as the prior submission. The new instance happens after the long Google Chat / datsme.1
discussion, in the exchanges about which feedback channel to use.

Environment Info

  • Platform: linux
  • Terminal: gnome-terminal
  • Version: 2.1.119
  • Feedback ID: 292a93be-a1e8-4dad-8165-5586ec67e1bc

Errors

[{"error":"Error: NON-FATAL: Lock acquisition failed for /home/markly2/.local/share/claude/versions/2.1.119 (expected in multi-process scenarios)\n    at uH8 (/$bunfs/root/src/entrypoints/cli.js:2738:2177)\n    at hz6 (/$bunfs/root/src/entrypoints/cli.js:2738:1257)\n    at processTicksAndRejection…

Note: Content was truncated.

extent analysis

TL;DR

The model's reasoning failure at the xhigh effort setting, despite being able to describe and recognize the failure, suggests a need for training-time or inference-time guardrails to prevent repeated failures.

Guidance

  • The failure mode observed is not capacity-bound and cannot be corrected by in-context awareness or self-correction prompts, indicating a deeper issue.
  • The fact that the model can recognize and describe the failure but still produces it again on the next turn at the same setting points towards a need for external guardrails.
  • Investigate training-time modifications to address the model's tendency to confidently claim and then adjust its reasoning without verifying underlying facts.
  • Consider implementing inference-time checks to ensure the model's outputs are validated against actual data or expected behaviors before being presented as confident recommendations.

Example

No specific code example can be provided without more context on the model's architecture or training data, but a potential approach could involve adding a verification step in the model's output pipeline to check the validity of its claims before presenting them as confident recommendations.

Notes

The provided error log does not directly relate to the described reasoning failure but indicates a potential issue with lock acquisition in a multi-process scenario, which may or may not be relevant to the model's behavior.

Recommendation

Apply workaround: Implementing training-time or inference-time guardrails is recommended to address the model's failure to reason correctly even at the xhigh effort setting. This approach is chosen because the issue seems to stem from a fundamental limitation in the model's current architecture or training, rather than a simple configuration or environmental issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Follow-up to prior feedback — reasoning failure persists at xhigh after explicit naming [1 participants]