claude-code - 💡(How to fix) Fix Quality regression in Opus reasoning/judgment across sessions (user-reported) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#46129Fetched 2026-04-11 06:28:19
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×3closed ×1

Long-term power user (287 sessions, 1196 memories, complex multi-project workspace) reporting noticeable quality regression in Claude Opus reasoning and judgment compared to sessions from 1-2 weeks ago.

Root Cause

Verification failures:

  • Repeatedly speculated about root causes instead of running one command to check (e.g., said "that's likely a permissions issue" about a crash without reading the script)
  • Declared a broken parser "worked correctly" when the output showed 0 findings with a "NEEDS WORK" verdict -- an obvious contradiction
  • Diagnosed "stale pickle" for an auth failure when the actual bug was passing a full file path where the library expected a filename suffix
RAW_BUFFERClick to expand / collapse

Summary

Long-term power user (287 sessions, 1196 memories, complex multi-project workspace) reporting noticeable quality regression in Claude Opus reasoning and judgment compared to sessions from 1-2 weeks ago.

Observed Behaviors (Session 287, April 9-10 2026)

Verification failures:

  • Repeatedly speculated about root causes instead of running one command to check (e.g., said "that's likely a permissions issue" about a crash without reading the script)
  • Declared a broken parser "worked correctly" when the output showed 0 findings with a "NEEDS WORK" verdict -- an obvious contradiction
  • Diagnosed "stale pickle" for an auth failure when the actual bug was passing a full file path where the library expected a filename suffix

Judgment failures:

  • Built a financial advisory ensemble (4 personas) that unanimously recommended the user double down on 0-day-to-expiration options based on ONE day of trading data (26 trades)
  • No persona dissented or flagged the small sample size
  • User had to catch the survivorship bias himself: "you guys are practically advising me to do 0DTE options on gut feelings because it worked out for one day"
  • Compared pre-system trades to post-system trades and drew invalid conclusions about strategy effectiveness

Process failures:

  • Bypassed a built skill (/Pithwaddle audit) with a manual agent when the skill's script crashed, instead of fixing the script
  • Asked the user to relay tool output instead of capturing it
  • Asked permission to fix obvious bugs 3+ times in one session (anti-pattern the user has corrected dozens of times before)

Context

This user has an extensive memory system (1196 memories), learned rules, pain logs, and anti-patterns documented across 287 sessions. Many of these failure modes have been explicitly stored as "never do this again" memories -- yet they recurred in a single session. The user noted: "last session, before your depreciation issues you guys definitely would have caught that."

Environment

  • Model: claude-opus-4-6 (1M context)
  • Platform: Windows 11, Claude Code CLI
  • Session type: Long session (~3 hours), complex MCP server work + financial data analysis

Expected vs Actual

  • Expected: Opus-level reasoning -- verify before concluding, challenge own assumptions, catch obvious contradictions, provide dissenting perspectives in advisory contexts
  • Actual: Pattern of confident-but-wrong conclusions, groupthink in multi-persona outputs, failure to apply stored lessons from prior sessions

User Impact

  • User trust eroded within a single session
  • Incorrect financial advice given (caught by user before acting on it)
  • Multiple bugs shipped that should have been caught pre-commit

The user explicitly requested this issue be filed.

extent analysis

TL;DR

The most likely fix or workaround is to retrain or fine-tune the Claude Opus model with the user's extensive memory system and learned rules to improve its reasoning and judgment.

Guidance

  • Review the user's memory system and learned rules to identify potential gaps or inconsistencies that may be contributing to the model's poor performance.
  • Consider retraining the model with a larger dataset that includes the user's memories and rules to improve its ability to verify information and challenge assumptions.
  • Evaluate the model's performance on a set of test cases that simulate the user's complex MCP server work and financial data analysis to identify areas where it may be failing.
  • Investigate the possibility of overfitting or degradation of the model over time, which may be causing it to produce confident-but-wrong conclusions.

Example

No specific code snippet is provided, but the user's memories and rules could be used to create a custom dataset for retraining the model. For example, the memories could be used to generate test cases that simulate real-world scenarios, and the rules could be used to evaluate the model's performance on those test cases.

Notes

The issue may be related to the model's inability to effectively utilize the user's extensive memory system and learned rules, which could be due to a variety of factors such as overfitting, degradation, or insufficient training data. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply a workaround by retraining or fine-tuning the model with the user's memory system and learned rules, as this may help to improve its reasoning and judgment capabilities. This approach is recommended because it is a targeted solution that addresses the specific issues reported by the user, and it may be more effective than simply upgrading to a new version of the model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING