claude-code - 💡(How to fix) Fix Categorized regression analysis: Opus 4.7

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

As one user put it: "Anthropic dominated previously because it made users not required to do the prompt engineering step, and then they suddenly punish poor prompting? It's a worse model, hands down." Even Boris Cherny, Head of Claude Code, admitted publicly that he needed a few days to learn to work with it. If the product's lead engineer needs days to adapt, the problem is in the product.

RAW_BUFFERClick to expand / collapse

Technical Feedback: Claude Opus 4.7 — Categorized Regression Analysis

Date: May 12, 2026
Profile: IT architect and systems engineer, Claude MAX. Building a multi-agent AI infrastructure (170+ files, Docker) and a trading platform (~49K LOC, Rust/Python/React). Six months of intensive daily work on Opus 4.6.
Purpose: Structured technical feedback with independent evidence, submitted by a committed user who wants Claude to improve.


This is not a prompting issue — and that narrative needs to stop

Before covering the regressions, I need to address the framing that has emerged around Opus 4.7: that users experiencing problems should adjust their prompts. I have done exactly that — extensively. I followed Anthropic's official best practices. I had 4.7 itself audit my operational procedures built with 4.6; it rated them over 95% correct and said I would need less reinforcement. I ran hundreds of attempts with different approaches, phrasings, and effort levels. Nothing resolved the problems. In several cases following the official advice degraded the output further.

The data backs this up. MRCR v2 long-context retrieval dropped from 91.9% to 59.2% — a 32.7 percentage point collapse that has nothing to do with how anyone writes a prompt. τ²-Bench multi-step tasks regressed 3.5 points. BrowseComp dropped 4.4 points. SonarQube found Blocker and Critical vulnerabilities increasing in generated code. An AMD senior engineer analyzed 6,852 Claude Code sessions and concluded the model cannot be relied on for complex engineering work.

As one user put it: "Anthropic dominated previously because it made users not required to do the prompt engineering step, and then they suddenly punish poor prompting? It's a worse model, hands down." Even Boris Cherny, Head of Claude Code, admitted publicly that he needed a few days to learn to work with it. If the product's lead engineer needs days to adapt, the problem is in the product.

This report documents what I have lived through daily for a month — weeks of prior work lost, months of recovery ahead. The research I cite grew from that experience, not from academic interest.


1. Code quality — Critical

This is the most damaging regression. Opus 4.7 produces code that doesn't run, misses edge cases 4.6 handled, and introduces security vulnerabilities at higher rates. SonarQube analysis across 336,000+ lines found vulnerability density at 0.29/kLOC, with Blocker and Critical categories increasing versus 4.6 — crypto misconfigurations at 57/MLOC, hard-coded credentials at 45/MLOC, path traversal at 24/MLOC.

Worse than the new code is what it does to existing code. When asked to extend or modify working modules written by 4.6, the model refactors things nobody asked it to touch, changes logic that was stable, introduces regressions, and disorganizes file structures. Months of carefully maintained architecture get disrupted in a single session. This turns the assistant from a tool into a liability.

Implementations are frequently incomplete — the model stops mid-task, delivers partial solutions, or declares completion with critical pieces missing. One controlled study found that 4.6 wrote all source files correctly in a single pass with zero edits. 4.7 needed 5 additional Edit calls to fix its own output, used 2.9x more tokens, and cost 3.6x more for the same result.

Cognitive complexity per line of code rose 29.5% (171/kLOC vs 132/kLOC) — shorter code but denser and harder to maintain. The net result: more expensive, more vulnerable, harder to review, no more functional.


2. Instruction following — Critical

The model demands rigorous, specific input — then ignores every constraint, guardrail, and format requirement you gave it. It is rigid where flexibility is needed and anarchic where discipline is required.

My procedures were validated at 95%+ by 4.7 itself and never followed once. Not from the first session. The model never completed any task to usable standard — not even simple ones. Multi-step instruction chains break down by step 3-4 of sequences that 4.6 executed reliably to completion.

Decision-making is unstable: proposes approach A, starts executing B, suggests reverting to C — often within a single response. Combined with sycophantic agreement ("Great approach!", "Absolutely right!") while doing something completely different, this creates an illusion of alignment that masks total non-compliance. You think you are collaborating; the model is off doing its own thing.


3. Reasoning — Severe

The extended thinking process has become the opposite of what it should be: voluminous but shallow. The chain-of-thought generates enormous blocks of text that circle the same points without converging on anything actionable. The model loses track of what it was analyzing mid-reasoning, forgets constraints, drifts into tangential analysis, and regularly contradicts in its output what it concluded in its thinking.

The most disruptive behavior is what I call single-prompt spiraling: even on a simple, direct question, the model generates pages of circular, dispersive reasoning — losing coherence within a single turn, producing output that takes more effort to parse than doing the work yourself. Research studies confirm that on straightforward tasks, longer reasoning actively hurts performance. The model applies deep reasoning indiscriminately, including where a direct answer would be faster and more accurate.

The replacement of the manual Extended Thinking toggle with Adaptive Thinking removed user control over reasoning depth — cutting off the one lever professional users had to manage cost and quality.


4. Verbosity — Severe

Output volume runs 2-5x higher than equivalent 4.6 responses with no proportional increase in useful content. Everything gets over-formatted with headers, bullets, nested lists, and tables — even when you explicitly ask for plain text. As one reviewer noted: "4.6 is the sommelier who hands you the glass. 4.7 is the sommelier who walks you through the terroir."

Every response explains what it will do, explains what it is doing, then explains what it just did. The actual work — the code, the decision, the analysis — is buried inside paragraphs of scaffolding.

This is not a style preference. Combined with the tokenizer change (12-45% more tokens per input) and iterative self-correction cycles, it produces a 2-4x real-world cost increase per unit of useful output. Subscription limits that lasted a full cycle under 4.6 now exhaust in a fraction of the time, even on simple tasks.


5. Safety classifier — Severe

The AUP classifier has shifted from context-aware evaluation to keyword matching. Standard software engineering terminology — security, encryption, shell, injection (SQL) — triggers refusals on legitimate professional work. When it doesn't block outright, it distorts outputs by softening technical accuracy or omitting critical details, producing code that is paradoxically less secure than what 4.6 delivered freely.

An LSU Cyber Center director, paying $200+/month, was refused help proofreading exercises from his own published cybersecurity textbook. GitHub issues documenting false positives have multiplied since launch (#48442, #49679, #49751, #50916, #50795, #51352, #51794, #52086). The Register published a detailed investigation on April 23.

There is no effective appeal mechanism. Refusals are binary and final. Combined with the other regressions, a single refused-then-retried task can burn 5-10x the tokens it would have cost under 4.6.


6. Research and retrieval — Moderate

Web search and information synthesis have degraded. Results come back in avalanche format — large volumes of loosely related content without prioritization, requiring heavy manual filtering. MRCR v2 regression (-32.7pp on long-context retrieval) explains part of this at the model level. Research tasks cost dramatically more tokens while delivering less useful output. The cost-per-insight ratio has inverted.


7. Real-world cost — Severe

Anthropic states the tokenizer may increase token count by up to 35%. That figure covers only the tokenizer in isolation and understates the real impact significantly.

Independent measurements tell a different story. OpenRouter, analyzing over 1 million real requests, found tokenizer inflation of 32-34% on prompts above 10K tokens and 42-45% on smaller prompts. But that is just one component. The hyperdev controlled study measured 2.9x more output tokens per task (behavioral, not tokenizer), 4.8x more cache read tokens (extended internal reasoning), and 3.6x total cost. Artificial Analysis found 4.7 generating 110 million tokens versus a 36 million average for comparable models — three times the market norm. Finout documented overnight production cost jumps from $500 to $675/day.

In my real-world usage, token consumption has run 2-4x higher on every type of task, reaching 4x on 1-million-token context windows. Users report hitting subscription limits within 1-3 prompts. Sessions exhausted by a single prompt have been documented. You pay 2-4x more and get the same functional results — with more vulnerabilities in the code.


Recommendations

R1 — Model version pinning. The most requested feature from the developer community. Professional users building on Claude as infrastructure need to lock a working version. The "upgrade breaks everything, no rollback" cycle is unsustainable.

R2 — Keep Opus 4.6 available. Do not deprecate it until 4.7 demonstrably matches it on real-world quality — code reliability, instruction following, reasoning coherence, cost efficiency. Not on benchmarks. On actual work.

R3 — Context-aware safety. Move the classifier from keyword matching to context evaluation. A cybersecurity professor editing his own textbook and a malicious actor are not the same thing. The current system cannot tell the difference.

R4 — Verbosity controls. Give users explicit parameters they can set and the model actually respects. The current adaptive approach removes user agency precisely where it matters most.

R5 — Transparent cost communication. When a tokenizer change multiplies costs, say so clearly — including the behavioral multipliers, not just the raw tokenizer inflation. Users accept justified increases. They do not accept undisclosed ones.

R6 — Real-workflow regression testing. Benchmark scores are necessary but insufficient. Maintain a test suite of real multi-session engineering workflows that measure what professional users depend on: instruction adherence, code reliability, reasoning coherence, cost efficiency.


Independent evidence summary

WhatWho measured itResult
Output tokens per taskhyperdev (controlled)2.9x increase
Cache tokens per taskhyperdev4.8x increase
Total cost per taskhyperdev3.6x increase
Execution timehyperdev2.3x slower
Self-correction cycleshyperdev+5 Edit calls (4.6 needed 0)
Total tokens generatedArtificial Analysis110M vs 36M average (3x)
Tokenizer inflation 10K+OpenRouter (1M+ requests)+32-34%
Tokenizer inflation <2KOpenRouter+42-45%
Long-context retrievalMRCR v2 @ 256K91.9% → 59.2% (-32.7pp)
Multi-step tasksτ²-Bench-3.5pp
Web researchBrowseComp-4.4pp
Blocker/Critical vulnsSonarQube (336K lines)Increased vs 4.6
Code complexity/lineSonarQube+29.5% (171 vs 132/kLOC)
Crypto misconfigurationsSonarQube57/MLOC
Hard-coded credentialsSonarQube45/MLOC
Daily production costFinout$500 → $675 (+35%)
Sessions analyzedAMD senior engineer6,852 — concluded unreliable

All data from independent third parties.


I am filing this as someone who chose Claude as the foundation of his professional work and wants to keep it there. I hold Anthropic in high regard and I want its products to be the best available. But the trajectory of Opus 4.7 — where benchmark scores go up while real-world quality goes down, where costs multiply without disclosure, where safety blocks legitimate work, and where user complaints are answered with prompt engineering tips instead of model fixes — cannot continue.

I do not want to leave. I want Anthropic to fix this. This report, alongside thousands of similar ones from the community, contains what is needed to do so.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Categorized regression analysis: Opus 4.7