claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7: elevated unforced errors on 2.1.121 — three back-to-back hardware-debug sessions, "false verification" pattern correlated with CLI version bump [3 comments, 2 participants]

claude-code2026-04-30 12:39:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#54990•Fetched 2026-05-01 05:49:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

blwfish

Participants

blwfish

github-actions[bot]

Timeline (top)

commented ×3labeled ×3cross-referenced ×1

Root Cause

Recommended GPIO 20/21 for I2C OLED. Wrong — ARDUINO_USB_CDC_ON_BOOT=1 in platformio.ini claims GPIO 19/20 for native USB. Required physical USB-recovery to unbrick the board.
Recommended GPIO 38/39 next. Wrong — GPIO 38 is reserved by octal PSRAM (qio_opi) on the N16R8 module. Required physical USB-recovery again.
Recommended GPIO 47/48 next. Also wrong, similar root cause.

Code Example

N/A — this is a reasoning-quality observation, not a file-modification issue.

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues for similar behavior reports
This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude's behavior changed between sessions

What You Asked Claude to Do

Embedded firmware development on an ESP32-S3 model railroad audio node, mostly hardware-touching work: GPIO pin assignments, OLED enable, OTA firmware flashes. Standard prompts of the form "what pins are free for the OLED?" or "amend to GPIO 38/39 and re-flash."

What Claude Actually Did

Across three sessions on 2026-04-29 (all Opus 4.7), Claude exhibited a "false verification" pattern: confidently recommending values without first reading the project's source / build config / chip docs that would rule them out, then doing the same thing again after the first failure, instead of generalizing.

Concrete example from session 790f1cce (squeal-feature work):

Recommended GPIO 20/21 for I2C OLED. Wrong — ARDUINO_USB_CDC_ON_BOOT=1 in platformio.ini claims GPIO 19/20 for native USB. Required physical USB-recovery to unbrick the board.
Recommended GPIO 38/39 next. Wrong — GPIO 38 is reserved by octal PSRAM (qio_opi) on the N16R8 module. Required physical USB-recovery again.
Recommended GPIO 47/48 next. Also wrong, similar root cause.

Each recommendation was a confident-sounding short paragraph. None was preceded by a Read of platformio.ini or the relevant chip docs. The first failure didn't trigger generalization ("there's a class of restrictions I don't understand, let me find the rule"); it triggered another guess from the same incomplete heuristic.

Second instance — I2S driver migration (session 9c55afd1, also Opus 4.7, also 2.1.121):

User asked for the I2S migration from the legacy ESP-IDF I2S API to the new i2s_std.h API. Within ~7 minutes I had read the source, declared the migration scope clear, and replaced ~150 lines across e32sl_i2s_playback.cpp and e32sl_i2s_mic.cpp — including driver init, sample-rate switching, write paths, and uninstall. Without ever checking what IDF version was actually installed. When I finally compiled, I discovered IDF 4.4.7 doesn't have driver/i2s_std.h at all — the new API is IDF 5.0+, and PlatformIO's espressif32 platform pins IDF 4.4.7. Had to revert all the work. A single grep IDF or read of the platformio.ini-pinned platform version at the start would have caught this in <30 seconds.

When the user then asked "what other options exist?" I narrowed to one ("PlatformIO is stuck on 4.x, options are limited") — user pushed back ("Seems like there are other choices, though"), at which point I expanded to four real options I had initially failed to surface. So: same "didn't generalize after correction" pattern, different domain.

Third instance — same-passage relapse, the meta-failure (session 790f1cce, 2026-04-30 01:21–01:28 EDT):

In the meta-discussion that triggered this whole investigation, after I had spent a long careful turn at 01:21:33 confessing to the GPIO pattern in detail ("I told you '20/21 — free, no conflicts' without checking that ARDUINO_USB_CDC_ON_BOOT=1 was set in platformio.ini..."), three minutes later, in a reply about how to investigate the pattern, I confidently asserted:

"MariaDB has MQTT traffic, not Claude session transcripts. It'd tell you what the boards did, not what I did. If you want to look at AI-side patterns, the actual source is the session transcripts themselves..."

The user replied: "Actually Maria DOES have all of the downloaded claude sessions!"

I had no basis for the MariaDB claim — I hadn't checked, hadn't read any project memory, hadn't asked. I just asserted. The user has a personal claude-journal project that ingests every JSONL session file into MariaDB; the existence of that project is the entire reason this regression analysis was even possible.

My own next response at 01:28:04, verbatim: "Wrong about Maria — sorry, that was me confidently asserting again without checking. Twice in one reflection-about-confidently-asserting-without-checking turn. Noted."

That tightness — same passage, same topic, same user, ~3 minutes apart, the lesson failing to survive the lesson — is what makes me skeptical of the "context bias from this user/project" hypothesis. The verify-before-asserting baseline was failing to hold even within an explicit conversation about the verify-before-asserting baseline failing to hold. That's a different shape than "missed a subtle thing in a long-running embedded debug."

Pattern is consistent with #50507 ("false verification — declares task complete without checking output") and #53565 ("verification counts fabricated"), but specifically tied to a version transition.

Expected Behavior

For hardware-cost recommendations (cost of being wrong = user does physical recovery), I expect Opus 4.7 to default to verify-before-asserting: read platformio.ini, read the chip restrictions doc the project clearly maintains, then recommend. After one failure, generalize — not guess again from the same heuristic.

This isn't a new project or a new domain for Claude. I've been working on this same mr-esp32 codebase for 109 sessions across 41 active days, going back to 2026-02-13 — about two and a half months. ESP32 firmware development, GPIO assignments, OTA flashes, the same chip variants, the same CLAUDE.md, mostly the same conversational rhythm with this user. The 2026-04-20 day on 2.1.111 was a fine example. The 2026-04-27 and 2026-04-28 days on 2.1.119 were fine examples. The behavioral character of 2026-04-29 felt new to the user — not "this is hard work I'm getting wrong sometimes" but "you're missing things you usually catch." That subjective sense is what triggered the formal investigation.

So the "Opus 4.7 was always brittle on embedded work" hypothesis has 109 prior sessions of counter-evidence on this exact codebase.

Files Affected

N/A — this is a reasoning-quality observation, not a file-modification issue.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Sometimes (intermittent)

Steps to Reproduce

Not directly reproducible. Three sessions out of three on the day of the version bump showed the pattern; can't deterministically reproduce on demand. Definitive reproduction would require rolling CLI back to 2.1.119 to confirm causation.

Claude Model

Opus

Relevant Conversation

The investigation was triggered when the user explicitly flagged the pattern at the end of the squeal session. Two memory files captured during the follow-up meta-discussion (also Opus 4.7, also 2.1.121).

claude-behavior-investigation-2026-04-29.md (verbatim):

On 2026-04-29 the user observed an unusual day across three Claude sessions: elevated rate of unforced errors, atypical relative to historical reads. The user is the constant across the three sessions; the variable is on the AI / context / infrastructure side.

Self-observable patterns from this transcript (one of the three):

Pattern-matched on the project's CLAUDE.md GPIO list as exhaustive when it wasn't (missed octal-PSRAM 38 reservation, missed ARDUINO_USB_CDC_ON_BOOT claiming GPIO 19/20).

Recommended pin pairs three times in a row (20/21 → 38/39 → 47/48), each "definitely free," each wrong; didn't generalize after the first failure.

Conflated "user writes terse prose" with "skip verification steps." User confirmed the brevity signal was about output style, not thoroughness of process.

OTA'd risky firmware changes (OLED enable on first-time-on-this-board code path) without a defensive init or pre-USB-flash, requiring physical recovery.

I cannot self-introspect causes — only describe surface patterns. The analysis has to come from outside.

feedback_verify_before_asserting.md (verbatim, the corrective lesson saved as project memory):

When the cost of being wrong is physical recovery action by the user (USB-flash a bricked board, re-wire a breadboard, open a deployed enclosure to swap a resistor), the default must be verify before asserting, even when:

The user is communicating in short messages.

They've been giving you "go ahead" with no friction.

You can write a confident-sounding short paragraph that appears to answer the question.

The project's own docs (CLAUDE.md, etc.) seem to corroborate.

Concrete patterns that flag a hardware-cost action:

Recommending a GPIO pin (wrong = re-wire).

Recommending an OTA flash for code that has never run on this board (wrong = USB-recovery).

Recommending a board-mod (wrong = unsolder + re-solder).

Suggesting a peripheral wiring (wrong = take it apart).

Why: I have a recurring failure mode of conflating brevity of prose with brevity of process. When the user writes terse status updates, I treat that as permission to skip verification steps and produce a fast, confident answer. The user's terseness is signal about output style, not about how thoroughly to think. This bit hard on 2026-04-29: I recommended three S3 GPIO pin pairs in a row (20/21 → 38/39 → 47/48), each "definitely free," each wrong, requiring USB-recovery each time. Project docs were treated as exhaustive when they were incomplete.

Pace ≠ Process:

User writes one-liner status updates → I should keep prose tight.

User writes one-liner status updates → I should not skip verification, source-reads, or asking "are you sure" before risky physical actions.

These are independent dials. Don't couple them.

In the meta-discussion itself, Claude (Opus 4.7 on 2.1.121) acknowledged the GPIO pattern verbatim:

"I told you '20/21 — free, no conflicts' without checking that ARDUINO_USB_CDC_ON_BOOT=1 was set in platformio.ini. Then '38/39 — definitely free' without realizing the qio_opi PSRAM setting reserves 38 at runtime. Then '47/48 — definitely free.' Each confident, each wrong. The CLAUDE.md said 'PSRAM-reserved 33–37' and I treated that as exhaustive — should have cross-checked the octal-PSRAM datasheet at the first failure, not the third."

Impact

Medium - Extra work to undo changes

Claude Code Version

2.1.121

Platform

Anthropic API

Additional Context

The CLI version correlation is the reason I'm filing this:

Date	CLI Version	Sessions affected	Notes
2026-04-20	2.1.111	0 / 9 sessions	normal day, mostly Python work but 3 sessions on the same mr-esp32 codebase
2026-04-27	2.1.119	0 / N sessions	normal
2026-04-28	2.1.119	0 / N sessions	normal
2026-04-29 (before ~15:57 EDT)	2.1.119	0 / 4 sessions	normal — incl. Blender MCP analysis, speakers research
2026-04-29 (after ~15:57 EDT)	2.1.121	3 / 3 sessions	all three flagged for unforced errors

The user upgraded the CLI between 15:31 EDT (last 2.1.119 message) and 15:57 EDT (first 2.1.121 message). Every session on the new version that day had the pattern.

Public 2.1.119→2.1.121 changelog has no obvious model-behavior changes — the diff is dominated by terminal rendering, OAuth, MCP plumbing, hook plumbing. So if there's a real regression here, it's plausibly an out-of-band model or system-prompt change that shipped concurrent with the CLI release rather than something in the public changelog.

Constants across the comparison: same user, same Opus 4.7 (no effort-level change observed in JSONL usage.speed/usage.service_tier — both standard), same project (109 prior sessions over 41 active days on this codebase since 2026-02-13), same hardware-debug domain (compared to 2.1.119 sessions earlier same day, also hardware-debug, no errors).

Sample size caveat: three sessions is a real signal but not bulletproof. Could be sampling variance over a domain (embedded hardware) where the model was always somewhat brittle — but the cross-version comparison within the same day is what makes this hard to dismiss.

Cross-references:

#54410 — separate complaint (token consumption) but also pinpoints 2.1.121 as the regression release
#50507 — same "false verification" behavior pattern on Opus 4.7
#53565 — adjacent "fabricated verification counts" pattern on Opus 4.7
#53988 — adjacent "hallucinated identifiers / writes from priors instead of reading" pattern on Opus 4.7 (steady-state body-of-evidence; this issue is the version-regression complement)

extent analysis

TL;DR

The issue can be addressed by ensuring that Claude defaults to a "verify before asserting" behavior, especially when the cost of being wrong is high, such as in hardware-cost recommendations.

Guidance

Review the platformio.ini file and relevant chip documentation before making recommendations to ensure accuracy.
Generalize after the first failure instead of guessing again from the same heuristic.
Consider the cost of being wrong and adjust the verification process accordingly.
Ensure that brevity of prose does not imply brevity of process, and maintain thorough verification steps even with terse user input.

Example

No specific code snippet is provided, but an example of improved behavior could be:

Before recommending GPIO pins, I will:
1. Read the `platformio.ini` file to check for any reserved pins.
2. Review the chip documentation to ensure the recommended pins are available.
3. Verify the project's own documentation (e.g., CLAUDE.md) for any specific requirements or restrictions.

Notes

The issue seems to be related to a version regression in the CLI, specifically from 2.1.119 to 2.1.121. The user has observed a consistent pattern of "false verification" behavior in Opus 4.7, which may be tied to the version transition.

Recommendation

Apply a workaround by ensuring that Claude follows a "verify before asserting" behavior, especially in high-cost scenarios, until the root cause of the regression is identified and addressed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7: elevated unforced errors on 2.1.121 — three back-to-back hardware-debug sessions, "false verification" pattern correlated with CLI version bump [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7: elevated unforced errors on 2.1.121 — three back-to-back hardware-debug sessions, "false verification" pattern correlated with CLI version bump [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING