claude-code - 💡(How to fix) Fix Quality regression in Opus reasoning/judgment across sessions (user-reported) [1 participants]

claude-code2026-04-10 08:02:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#46129•Fetched 2026-04-11 06:28:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Wittlesus

Participants

Wittlesus

Timeline (top)

labeled ×3closed ×1

Long-term power user (287 sessions, 1196 memories, complex multi-project workspace) reporting noticeable quality regression in Claude Opus reasoning and judgment compared to sessions from 1-2 weeks ago.

Root Cause

Verification failures:

Repeatedly speculated about root causes instead of running one command to check (e.g., said "that's likely a permissions issue" about a crash without reading the script)
Declared a broken parser "worked correctly" when the output showed 0 findings with a "NEEDS WORK" verdict -- an obvious contradiction
Diagnosed "stale pickle" for an auth failure when the actual bug was passing a full file path where the library expected a filename suffix

RAW_BUFFERClick to expand / collapse

Summary

Observed Behaviors (Session 287, April 9-10 2026)

Verification failures:

Repeatedly speculated about root causes instead of running one command to check (e.g., said "that's likely a permissions issue" about a crash without reading the script)
Declared a broken parser "worked correctly" when the output showed 0 findings with a "NEEDS WORK" verdict -- an obvious contradiction
Diagnosed "stale pickle" for an auth failure when the actual bug was passing a full file path where the library expected a filename suffix

Judgment failures:

Built a financial advisory ensemble (4 personas) that unanimously recommended the user double down on 0-day-to-expiration options based on ONE day of trading data (26 trades)
No persona dissented or flagged the small sample size
User had to catch the survivorship bias himself: "you guys are practically advising me to do 0DTE options on gut feelings because it worked out for one day"
Compared pre-system trades to post-system trades and drew invalid conclusions about strategy effectiveness

Process failures:

Bypassed a built skill (/Pithwaddle audit) with a manual agent when the skill's script crashed, instead of fixing the script
Asked the user to relay tool output instead of capturing it
Asked permission to fix obvious bugs 3+ times in one session (anti-pattern the user has corrected dozens of times before)

Context

This user has an extensive memory system (1196 memories), learned rules, pain logs, and anti-patterns documented across 287 sessions. Many of these failure modes have been explicitly stored as "never do this again" memories -- yet they recurred in a single session. The user noted: "last session, before your depreciation issues you guys definitely would have caught that."

Environment

Model: claude-opus-4-6 (1M context)
Platform: Windows 11, Claude Code CLI
Session type: Long session (~3 hours), complex MCP server work + financial data analysis

Expected vs Actual

Expected: Opus-level reasoning -- verify before concluding, challenge own assumptions, catch obvious contradictions, provide dissenting perspectives in advisory contexts
Actual: Pattern of confident-but-wrong conclusions, groupthink in multi-persona outputs, failure to apply stored lessons from prior sessions

User Impact

User trust eroded within a single session
Incorrect financial advice given (caught by user before acting on it)
Multiple bugs shipped that should have been caught pre-commit

The user explicitly requested this issue be filed.

extent analysis

TL;DR

The most likely fix or workaround is to retrain or fine-tune the Claude Opus model with the user's extensive memory system and learned rules to improve its reasoning and judgment.

Guidance

Review the user's memory system and learned rules to identify potential gaps or inconsistencies that may be contributing to the model's poor performance.
Consider retraining the model with a larger dataset that includes the user's memories and rules to improve its ability to verify information and challenge assumptions.
Evaluate the model's performance on a set of test cases that simulate the user's complex MCP server work and financial data analysis to identify areas where it may be failing.
Investigate the possibility of overfitting or degradation of the model over time, which may be causing it to produce confident-but-wrong conclusions.

Example

No specific code snippet is provided, but the user's memories and rules could be used to create a custom dataset for retraining the model. For example, the memories could be used to generate test cases that simulate real-world scenarios, and the rules could be used to evaluate the model's performance on those test cases.

Notes

The issue may be related to the model's inability to effectively utilize the user's extensive memory system and learned rules, which could be due to a variety of factors such as overfitting, degradation, or insufficient training data. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply a workaround by retraining or fine-tuning the model with the user's memory system and learned rules, as this may help to improve its reasoning and judgment capabilities. This approach is recommended because it is a targeted solution that addresses the specific issues reported by the user, and it may be more effective than simply upgrading to a new version of the model.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Quality regression in Opus reasoning/judgment across sessions (user-reported) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Observed Behaviors (Session 287, April 9-10 2026)

Context

Environment

Expected vs Actual

User Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Quality regression in Opus reasoning/judgment across sessions (user-reported) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Observed Behaviors (Session 287, April 9-10 2026)

Context

Environment

Expected vs Actual

User Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING