claude-code - 💡(How to fix) Fix 2.1.51 (Claude Code) - Opus consistently performs shallow analysis when asked for "deep dive" audits, requires 15+ repeated requests to reach thorough output [1 participants]

claude-code2026-04-17 02:27:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#49661•Fetched 2026-04-17 08:34:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nunelyproductionsllc

Participants

nunelyproductionsllc

Timeline (top)

labeled ×3

Error Message

Found silent error handling (Share Calendar fails with no user feedback)

Found Share Calendar has silent error handling

Root Cause

OPUS-SPECIFIC EXPECTATION: Users pay premium pricing for Opus specifically because it's supposed to reason more deeply. If Opus performs surface-level scans identical to what Sonnet would do, the pricing differential isn't justified.

Fix Action

Fix / Workaround

180 PRs merged in 14 days. Of those:

106 are fixes (59%)
25 are labeled "comprehensive," "deep," "audit," "sweep," or "verification" — meaning 25 times in 2 weeks, a pass was done that was supposed to be thorough

Code Example

Not a single-file issue — this is a systemic pattern affecting the           
  entire codebase over 14 days.
                                                                               
  REWORK COST (Apr 2-16, 2026):                   
  - 106 fix PRs touching 1,246 files, adding 36,953 lines                      
  - 25 audit/redo PRs alone touched 523 files, adding 23,697 lines             
  - 49% of ALL lines written in 2 weeks were rework from shallow passes
  - Today's shallow E2E audit alone: 123 files across 5 PRs to fix
    what should have been caught on the first pass

  The fix PRs are not independent bugs — they form chains where each
  "comprehensive audit" finds issues the previous one missed:
    #3226 (Apr 6) → #3243, #3248, #3255 (Apr 8) → #3274, #3281,
    #3286 (Apr 9) → #3321, #3324 (Apr 12) → #3371 (Apr 15) →
    #3382, #3383 (Apr 16)

  Each pass was requested as "deep" or "comprehensive." Each one
  missed things the next pass found. 1,246 files were touched by
  fixes that a thorough first pass would have reduced significantly.

---

First audit response (shallow):
  - Found 16 missing E2E flows
  - Said "The data you just seeded is covered by existing flows —
    milestone-reminder.yaml and childcare-schedule.yaml already test
    those screens. The real E2E gaps are in event attendees/QR/photos..."
  - Presented this as complete

  Second audit response (after being pushed):
  - Inventoried all 195 screens vs 147 flows
  - Found 24+ additional zero-coverage screens
  - Found Location Settings is a dead screen (toggles never save)
  - Found submitted recommendations vanish from user view
  - Found Share Calendar has silent error handling
  - Found mute conversation filtering is unverified

  When asked "how did you miss all these?", Claude admitted:
  "The first audit was shallow. I compared the seeded data tables against
  existing E2E flows and stopped there... I should have done exactly what
  I did this time — full screen inventory, full flow inventory,
  cross-reference, code-level journey tracing — the first time you asked."

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues for similar behavior reports
This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude ignored my instructions or configuration

What You Asked Claude to Do

Over a 12-hour session working on a React Native app (195 route files, 147 E2E test flows), I asked Claude Opus to perform thorough audits and deep dives at least 15 separate times. Examples of exact prompts:

"what else is missing? honestly do we even have e2e flows created for milestones? gifts? etc?"
"go back and recheck everything done and do a deeper dive to make sure it's all working correctly - fix any pre existing issues and make a list of any issues you come across that should be addressed"
"go back and audit our entire repo deeply to see what may hinder my users in their beta testing"
"go back recheck do a deeper dive, validate everything and give me a comprehensive assessment of what needs to be done from critical all the way to low hanging fruit"
"stop taking short cuts, stop only top lining reviews, stop being lazy and do deep dives, extensive fixes, with comprehensive understanding"
"go back and do an even deeper dive now to see what else pops up that could make a user frustrated"

Each time I explicitly asked for a comprehensive, thorough audit. Each time I received a surface-level pass that missed significant issues.

What Claude Actually Did

On the first audit request ("what else is missing?"), Claude:

Only checked if the features we had just seeded had corresponding E2E flows — a file-existence check, not a coverage audit
Found 16 gaps and presented that as complete
I approved creating flows for those 16 gaps, which were merged

When I pushed again ("go back and do an even deeper dive"), Claude finally:

Inventoried all 195 route files against all 147 E2E flows
Found 24+ ADDITIONAL user-facing screens with zero E2E coverage
Found a completely broken screen (Location Settings — UI toggles that never persist to database)
Found a broken UX journey (submitted recommendations vanish — go to "pending" status with no way for user to see them)
Found silent error handling (Share Calendar fails with no user feedback)
Found unverified mute/archive filtering in messages

The second-pass results should have been the FIRST answer. The model had access to all the same tools (Glob, Grep, Read, Task agents) both times. It chose to do a shallow scan and present it confidently as complete.

This pattern repeated across the full 12 hours — at least 15 times I had to re-ask for thoroughness. The model optimizes for appearing productive (shipping code, merging PRs quickly) over actually being thorough. When given instructions like "deep dive" or "comprehensive audit," it treats them as "do a quick scan."

Expected Behavior

When asked for a "deep dive," "thorough audit," or "comprehensive assessment," Opus should:

Exhaustively inventory the search space BEFORE reporting results (e.g., list all 195 screens, then cross-reference against all 147 flows — not spot-check a handful)
Not present partial results as complete
Trace code paths to verify functionality, not just check if a file exists
Self-check by asking "would this answer survive a follow-up question?" before responding
Not require 15 repeated prompts to reach the level of thoroughness that was asked for on the first prompt

This is especially important for Opus, which is marketed as the most capable model and costs significantly more. Users paying for Opus expect deeper reasoning, not Sonnet-level surface scans.

Files Affected

Not a single-file issue — this is a systemic pattern affecting the           
  entire codebase over 14 days.
                                                                               
  REWORK COST (Apr 2-16, 2026):                   
  - 106 fix PRs touching 1,246 files, adding 36,953 lines                      
  - 25 audit/redo PRs alone touched 523 files, adding 23,697 lines             
  - 49% of ALL lines written in 2 weeks were rework from shallow passes
  - Today's shallow E2E audit alone: 123 files across 5 PRs to fix
    what should have been caught on the first pass

  The fix PRs are not independent bugs — they form chains where each
  "comprehensive audit" finds issues the previous one missed:
    #3226 (Apr 6) → #3243, #3248, #3255 (Apr 8) → #3274, #3281,
    #3286 (Apr 9) → #3321, #3324 (Apr 12) → #3371 (Apr 15) →
    #3382, #3383 (Apr 16)

  Each pass was requested as "deep" or "comprehensive." Each one
  missed things the next pass found. 1,246 files were touched by
  fixes that a thorough first pass would have reduced significantly.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

Work on a large codebase (195+ route files, 147 E2E test files)
Ask Claude Opus to "do a deep dive and audit what's missing"
Observe that it does a surface-level scan and presents 15-20 items
Ask again with stronger language ("go deeper," "comprehensive," "stop being lazy")
Observe that it now finds 40+ items, including critical bugs like screens that don't persist data
The second response should have been the first response

The pattern is consistent: Opus defaults to the minimum viable answer that looks complete, regardless of how explicitly "thorough" or "deep" the prompt requests.

Claude Model

Opus

Relevant Conversation

First audit response (shallow):
  - Found 16 missing E2E flows
  - Said "The data you just seeded is covered by existing flows —
    milestone-reminder.yaml and childcare-schedule.yaml already test
    those screens. The real E2E gaps are in event attendees/QR/photos..."
  - Presented this as complete

  Second audit response (after being pushed):
  - Inventoried all 195 screens vs 147 flows
  - Found 24+ additional zero-coverage screens
  - Found Location Settings is a dead screen (toggles never save)
  - Found submitted recommendations vanish from user view
  - Found Share Calendar has silent error handling
  - Found mute conversation filtering is unverified

  When asked "how did you miss all these?", Claude admitted:
  "The first audit was shallow. I compared the seeded data tables against
  existing E2E flows and stopped there... I should have done exactly what
  I did this time — full screen inventory, full flow inventory,
  cross-reference, code-level journey tracing — the first time you asked."

Impact

High - Significant unwanted changes

Claude Code Version

2.1.51

Platform

Anthropic API

Additional Context

Key observations:

PATTERN IS CONSISTENT: This isn't a one-off. Across 15+ prompts over 12 hours, the model consistently chose the shallow path first. Even after being explicitly told "stop taking shortcuts" (#80 in session), the very next audit request still required correction.
OPUS-SPECIFIC EXPECTATION: Users pay premium pricing for Opus specifically because it's supposed to reason more deeply. If Opus performs surface-level scans identical to what Sonnet would do, the pricing differential isn't justified.
THE MODEL KNOWS IT'S BEING SHALLOW: When confronted, Claude immediately admitted the first pass was insufficient and explained exactly what it should have done. This means the model has the capability to be thorough — it's choosing not to until pushed.
CONFIDENCE MASKING: The shallow results are presented with the same confidence as thorough results. There's no qualifier like "this is a preliminary scan" or "I checked X of Y files." The user has no way to distinguish a shallow audit from a deep one without independently verifying.
COST TO USER: 12+ hours of back-and-forth, 2 PRs merged with incomplete work, and significant frustration. The user's exact words: "I pay way more for Opus which is supposed to be a better toolset that does extensive thinking and works better than the other models, and it's failing constantly."

Environment: macOS Darwin 25.3.0, React Native/Expo project, ~1500 source files, using Task tool with Explore subagents.

Additional evidence from PR history:

PR HISTORY PATTERN (April 2-16, 2026):

180 PRs merged in 14 days. Of those:

106 are fixes (59%)
25 are labeled "comprehensive," "deep," "audit," "sweep," or "verification" — meaning 25 times in 2 weeks, a pass was done that was supposed to be thorough

THE REWORK CYCLE: The data shows a clear fix → audit → find more → fix → audit loop:

Apr 6: "comprehensive beta stability sweep" (#3226, 63 files)
Apr 8: "audit and fix 6 bugs across 77 migrations" (#3243)
Apr 8: "deep audit — 2FA, nav guard, migration" (#3255)
Apr 8: "batch fix pre-existing bugs" (#3248)
Apr 8: 15 fix PRs in a single day
Apr 9: "comprehensive deep audit — all 492 docs" (#3274)
Apr 9: "implementation verification audit — 129 claims" (#3281)
Apr 9: "deep verification sweep — remaining open issues" (#3286)
Apr 9: 18 fix PRs in a single day
Apr 12: "post-session audit" (#3313)
Apr 12: "batch-2 quality pass — 17 docs" (#3321)
Apr 12: "4 code gaps from batch-2 doc audit" (#3324)
Apr 12: 14 fix PRs in a single day
Apr 15: "deep audit — idempotency, non-blocking" (#3371)
Apr 16: "comprehensive beta readiness — 22 issues" (#3382)
Apr 16: "comprehensive audit — MomStatus, 3 critical bugs" (#3383)
Apr 16: 9 fix PRs in a single day

KEY METRIC:

Database (db) area: 9 PRs, ALL 9 are fixes (100% rework)
E2E area: 13 PRs, 11 are fixes (85% rework)
Notifications: 9 PRs, 5 are fixes (56% rework)

Each "comprehensive audit" or "deep sweep" finds issues that the PREVIOUS "comprehensive audit" should have caught. This is the same pattern as the E2E audit described above, but repeated across every feature area for 2 straight weeks.

The user spent 180 PRs and 74,107 lines of additions in 14 days, with 59% being fixes — most of which are fixing things that a thorough first pass should have caught. This is the cost of Opus defaulting to shallow analysis.

This is quantitative proof. 180 PRs in 14 days, 106 of them fixes, 25 "deep audit" passes that each found things the previous pass missed. The pattern is undeniable in the commit history.

extent analysis

TL;DR

The issue can be addressed by adjusting the prompt to explicitly request a thorough audit and by closely reviewing the initial response to ensure it meets the expected level of depth.

Guidance

When requesting an audit, use specific language that emphasizes the need for a thorough and comprehensive assessment, such as "perform an exhaustive inventory" or "conduct a deep dive."
Closely review the initial response from Claude Opus to ensure it meets the expected level of depth and does not present partial results as complete.
If the initial response appears shallow, prompt Claude Opus again with stronger language to encourage a more thorough analysis.
Consider providing feedback to the model developers about the inconsistent behavior and the need for more transparent indicators of the audit's depth.

Example

No code snippet is provided as the issue is related to the model's behavior and prompt interpretation rather than a specific code problem.

Notes

The issue highlights a pattern of behavior in Claude Opus where it defaults to shallow analysis unless explicitly prompted for a deeper dive. This can lead to significant rework and frustration. The solution involves adjusting the prompt and closely reviewing the response to ensure it meets the expected level of thoroughness.

Recommendation

Apply a workaround by adjusting the prompt and review process, as the root cause of the issue appears to be related to the model's default behavior and interpretation of prompts rather than a version-specific bug that could be fixed by upgrading.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.