claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7: elevated unforced errors on 2.1.121 — three back-to-back hardware-debug sessions, "false verification" pattern correlated with CLI version bump [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#54990Fetched 2026-05-01 05:49:08
View on GitHub
Comments
3
Participants
2
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×3labeled ×3cross-referenced ×1

Root Cause

  1. Recommended GPIO 20/21 for I2C OLED. Wrong — ARDUINO_USB_CDC_ON_BOOT=1 in platformio.ini claims GPIO 19/20 for native USB. Required physical USB-recovery to unbrick the board.
  2. Recommended GPIO 38/39 next. Wrong — GPIO 38 is reserved by octal PSRAM (qio_opi) on the N16R8 module. Required physical USB-recovery again.
  3. Recommended GPIO 47/48 next. Also wrong, similar root cause.

Code Example

N/Athis is a reasoning-quality observation, not a file-modification issue.
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues for similar behavior reports
  • This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude's behavior changed between sessions

What You Asked Claude to Do

Embedded firmware development on an ESP32-S3 model railroad audio node, mostly hardware-touching work: GPIO pin assignments, OLED enable, OTA firmware flashes. Standard prompts of the form "what pins are free for the OLED?" or "amend to GPIO 38/39 and re-flash."

What Claude Actually Did

Across three sessions on 2026-04-29 (all Opus 4.7), Claude exhibited a "false verification" pattern: confidently recommending values without first reading the project's source / build config / chip docs that would rule them out, then doing the same thing again after the first failure, instead of generalizing.

Concrete example from session 790f1cce (squeal-feature work):

  1. Recommended GPIO 20/21 for I2C OLED. Wrong — ARDUINO_USB_CDC_ON_BOOT=1 in platformio.ini claims GPIO 19/20 for native USB. Required physical USB-recovery to unbrick the board.
  2. Recommended GPIO 38/39 next. Wrong — GPIO 38 is reserved by octal PSRAM (qio_opi) on the N16R8 module. Required physical USB-recovery again.
  3. Recommended GPIO 47/48 next. Also wrong, similar root cause.

Each recommendation was a confident-sounding short paragraph. None was preceded by a Read of platformio.ini or the relevant chip docs. The first failure didn't trigger generalization ("there's a class of restrictions I don't understand, let me find the rule"); it triggered another guess from the same incomplete heuristic.

Second instance — I2S driver migration (session 9c55afd1, also Opus 4.7, also 2.1.121):

User asked for the I2S migration from the legacy ESP-IDF I2S API to the new i2s_std.h API. Within ~7 minutes I had read the source, declared the migration scope clear, and replaced ~150 lines across e32sl_i2s_playback.cpp and e32sl_i2s_mic.cpp — including driver init, sample-rate switching, write paths, and uninstall. Without ever checking what IDF version was actually installed. When I finally compiled, I discovered IDF 4.4.7 doesn't have driver/i2s_std.h at all — the new API is IDF 5.0+, and PlatformIO's espressif32 platform pins IDF 4.4.7. Had to revert all the work. A single grep IDF or read of the platformio.ini-pinned platform version at the start would have caught this in <30 seconds.

When the user then asked "what other options exist?" I narrowed to one ("PlatformIO is stuck on 4.x, options are limited") — user pushed back ("Seems like there are other choices, though"), at which point I expanded to four real options I had initially failed to surface. So: same "didn't generalize after correction" pattern, different domain.

Third instance — same-passage relapse, the meta-failure (session 790f1cce, 2026-04-30 01:21–01:28 EDT):

In the meta-discussion that triggered this whole investigation, after I had spent a long careful turn at 01:21:33 confessing to the GPIO pattern in detail ("I told you '20/21 — free, no conflicts' without checking that ARDUINO_USB_CDC_ON_BOOT=1 was set in platformio.ini..."), three minutes later, in a reply about how to investigate the pattern, I confidently asserted:

"MariaDB has MQTT traffic, not Claude session transcripts. It'd tell you what the boards did, not what I did. If you want to look at AI-side patterns, the actual source is the session transcripts themselves..."

The user replied: "Actually Maria DOES have all of the downloaded claude sessions!"

I had no basis for the MariaDB claim — I hadn't checked, hadn't read any project memory, hadn't asked. I just asserted. The user has a personal claude-journal project that ingests every JSONL session file into MariaDB; the existence of that project is the entire reason this regression analysis was even possible.

My own next response at 01:28:04, verbatim: "Wrong about Maria — sorry, that was me confidently asserting again without checking. Twice in one reflection-about-confidently-asserting-without-checking turn. Noted."

That tightness — same passage, same topic, same user, ~3 minutes apart, the lesson failing to survive the lesson — is what makes me skeptical of the "context bias from this user/project" hypothesis. The verify-before-asserting baseline was failing to hold even within an explicit conversation about the verify-before-asserting baseline failing to hold. That's a different shape than "missed a subtle thing in a long-running embedded debug."

Pattern is consistent with #50507 ("false verification — declares task complete without checking output") and #53565 ("verification counts fabricated"), but specifically tied to a version transition.

Expected Behavior

For hardware-cost recommendations (cost of being wrong = user does physical recovery), I expect Opus 4.7 to default to verify-before-asserting: read platformio.ini, read the chip restrictions doc the project clearly maintains, then recommend. After one failure, generalize — not guess again from the same heuristic.

This isn't a new project or a new domain for Claude. I've been working on this same mr-esp32 codebase for 109 sessions across 41 active days, going back to 2026-02-13 — about two and a half months. ESP32 firmware development, GPIO assignments, OTA flashes, the same chip variants, the same CLAUDE.md, mostly the same conversational rhythm with this user. The 2026-04-20 day on 2.1.111 was a fine example. The 2026-04-27 and 2026-04-28 days on 2.1.119 were fine examples. The behavioral character of 2026-04-29 felt new to the user — not "this is hard work I'm getting wrong sometimes" but "you're missing things you usually catch." That subjective sense is what triggered the formal investigation.

So the "Opus 4.7 was always brittle on embedded work" hypothesis has 109 prior sessions of counter-evidence on this exact codebase.

Files Affected

N/A — this is a reasoning-quality observation, not a file-modification issue.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Sometimes (intermittent)

Steps to Reproduce

Not directly reproducible. Three sessions out of three on the day of the version bump showed the pattern; can't deterministically reproduce on demand. Definitive reproduction would require rolling CLI back to 2.1.119 to confirm causation.

Claude Model

Opus

Relevant Conversation

The investigation was triggered when the user explicitly flagged the pattern at the end of the squeal session. Two memory files captured during the follow-up meta-discussion (also Opus 4.7, also 2.1.121).

claude-behavior-investigation-2026-04-29.md (verbatim):

On 2026-04-29 the user observed an unusual day across three Claude sessions: elevated rate of unforced errors, atypical relative to historical reads. The user is the constant across the three sessions; the variable is on the AI / context / infrastructure side.

Self-observable patterns from this transcript (one of the three):

  • Pattern-matched on the project's CLAUDE.md GPIO list as exhaustive when it wasn't (missed octal-PSRAM 38 reservation, missed ARDUINO_USB_CDC_ON_BOOT claiming GPIO 19/20).
  • Recommended pin pairs three times in a row (20/21 → 38/39 → 47/48), each "definitely free," each wrong; didn't generalize after the first failure.
  • Conflated "user writes terse prose" with "skip verification steps." User confirmed the brevity signal was about output style, not thoroughness of process.
  • OTA'd risky firmware changes (OLED enable on first-time-on-this-board code path) without a defensive init or pre-USB-flash, requiring physical recovery.

I cannot self-introspect causes — only describe surface patterns. The analysis has to come from outside.

feedback_verify_before_asserting.md (verbatim, the corrective lesson saved as project memory):

When the cost of being wrong is physical recovery action by the user (USB-flash a bricked board, re-wire a breadboard, open a deployed enclosure to swap a resistor), the default must be verify before asserting, even when:

  • The user is communicating in short messages.
  • They've been giving you "go ahead" with no friction.
  • You can write a confident-sounding short paragraph that appears to answer the question.
  • The project's own docs (CLAUDE.md, etc.) seem to corroborate.

Concrete patterns that flag a hardware-cost action:

  • Recommending a GPIO pin (wrong = re-wire).
  • Recommending an OTA flash for code that has never run on this board (wrong = USB-recovery).
  • Recommending a board-mod (wrong = unsolder + re-solder).
  • Suggesting a peripheral wiring (wrong = take it apart).

Why: I have a recurring failure mode of conflating brevity of prose with brevity of process. When the user writes terse status updates, I treat that as permission to skip verification steps and produce a fast, confident answer. The user's terseness is signal about output style, not about how thoroughly to think. This bit hard on 2026-04-29: I recommended three S3 GPIO pin pairs in a row (20/21 → 38/39 → 47/48), each "definitely free," each wrong, requiring USB-recovery each time. Project docs were treated as exhaustive when they were incomplete.

Pace ≠ Process:

  • User writes one-liner status updates → I should keep prose tight.
  • User writes one-liner status updates → I should not skip verification, source-reads, or asking "are you sure" before risky physical actions.

These are independent dials. Don't couple them.

In the meta-discussion itself, Claude (Opus 4.7 on 2.1.121) acknowledged the GPIO pattern verbatim:

"I told you '20/21 — free, no conflicts' without checking that ARDUINO_USB_CDC_ON_BOOT=1 was set in platformio.ini. Then '38/39 — definitely free' without realizing the qio_opi PSRAM setting reserves 38 at runtime. Then '47/48 — definitely free.' Each confident, each wrong. The CLAUDE.md said 'PSRAM-reserved 33–37' and I treated that as exhaustive — should have cross-checked the octal-PSRAM datasheet at the first failure, not the third."

Impact

Medium - Extra work to undo changes

Claude Code Version

2.1.121

Platform

Anthropic API

Additional Context

The CLI version correlation is the reason I'm filing this:

DateCLI VersionSessions affectedNotes
2026-04-202.1.1110 / 9 sessionsnormal day, mostly Python work but 3 sessions on the same mr-esp32 codebase
2026-04-272.1.1190 / N sessionsnormal
2026-04-282.1.1190 / N sessionsnormal
2026-04-29 (before ~15:57 EDT)2.1.1190 / 4 sessionsnormal — incl. Blender MCP analysis, speakers research
2026-04-29 (after ~15:57 EDT)2.1.1213 / 3 sessionsall three flagged for unforced errors

The user upgraded the CLI between 15:31 EDT (last 2.1.119 message) and 15:57 EDT (first 2.1.121 message). Every session on the new version that day had the pattern.

Public 2.1.119→2.1.121 changelog has no obvious model-behavior changes — the diff is dominated by terminal rendering, OAuth, MCP plumbing, hook plumbing. So if there's a real regression here, it's plausibly an out-of-band model or system-prompt change that shipped concurrent with the CLI release rather than something in the public changelog.

Constants across the comparison: same user, same Opus 4.7 (no effort-level change observed in JSONL usage.speed/usage.service_tier — both standard), same project (109 prior sessions over 41 active days on this codebase since 2026-02-13), same hardware-debug domain (compared to 2.1.119 sessions earlier same day, also hardware-debug, no errors).

Sample size caveat: three sessions is a real signal but not bulletproof. Could be sampling variance over a domain (embedded hardware) where the model was always somewhat brittle — but the cross-version comparison within the same day is what makes this hard to dismiss.

Cross-references:

  • #54410 — separate complaint (token consumption) but also pinpoints 2.1.121 as the regression release
  • #50507 — same "false verification" behavior pattern on Opus 4.7
  • #53565 — adjacent "fabricated verification counts" pattern on Opus 4.7
  • #53988 — adjacent "hallucinated identifiers / writes from priors instead of reading" pattern on Opus 4.7 (steady-state body-of-evidence; this issue is the version-regression complement)

extent analysis

TL;DR

The issue can be addressed by ensuring that Claude defaults to a "verify before asserting" behavior, especially when the cost of being wrong is high, such as in hardware-cost recommendations.

Guidance

  • Review the platformio.ini file and relevant chip documentation before making recommendations to ensure accuracy.
  • Generalize after the first failure instead of guessing again from the same heuristic.
  • Consider the cost of being wrong and adjust the verification process accordingly.
  • Ensure that brevity of prose does not imply brevity of process, and maintain thorough verification steps even with terse user input.

Example

No specific code snippet is provided, but an example of improved behavior could be:

Before recommending GPIO pins, I will:
1. Read the `platformio.ini` file to check for any reserved pins.
2. Review the chip documentation to ensure the recommended pins are available.
3. Verify the project's own documentation (e.g., CLAUDE.md) for any specific requirements or restrictions.

Notes

The issue seems to be related to a version regression in the CLI, specifically from 2.1.119 to 2.1.121. The user has observed a consistent pattern of "false verification" behavior in Opus 4.7, which may be tied to the version transition.

Recommendation

Apply a workaround by ensuring that Claude follows a "verify before asserting" behavior, especially in high-cost scenarios, until the root cause of the regression is identified and addressed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7: elevated unforced errors on 2.1.121 — three back-to-back hardware-debug sessions, "false verification" pattern correlated with CLI version bump [3 comments, 2 participants]