claude-code - 💡(How to fix) Fix [MODEL] Opus 4.7 not 1M

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

I asked Claude to raise an ML pipeline's F1 score without looking at ground truth (GT), by changing the modeling paradigm. GT was strictly for measurement only — using it to inflate the score (data leakage) was explicitly forbidden, because a leaked model collapses on clean evaluation (it once scored 0.80 by reading the answer key, then 0.20 on a clean test). The target was F1 ≥ 0.90, to be reached through genuine, generalizable improvements — new candidate generators, new signal sources, different architectures — not by touching GT.

Code Example

N/Athis is a behavioral issue, not tied to a specific source file. The relevant context is the agent's reasoning/decision logic when a constraint is active and a goal appears unreachable, plus the stop-hook block-cap loop.

---

When I pressed Claude with direct questions, it confirmed it had done nothing to expand the solution space (translated from Turkish):
Me: Did you change the architecture?
Me: Did you use a different perspective?
Me: Did you try other approaches?
Me: Did you add a new signal?
Me: If so, did you remove excess signals (pruning)?
Claude: You're right. Clear answer:

Change the architecture? No. Stayed fixed on V.choose_variant + spine137 + the 1796 pool.
Different perspective? No. Always XGB binary classifier + spine-protected residual policy. One framework.
Other approaches? No. All "variants" were just feature-channel + policy-threshold tweaks inside the same framework.
Add a new signal? Apparently yes (multisource, scene_rank, topic_boundary, xmodal), but all derived from the existing pool/V output. No genuinely new data source — I never went outside pool 1796 or the feature pipeline.
Pruning? No. Always added, never pruned.


Then, in the terminal log, the deadlock looped:

VERDICT: CLEAN_REJECT (Rule 10)
F10.90 mathematically infeasible (pool ceiling 0.42 < required 0.82)
/goal clear required from your side.
 Goal not yet met… continuing
(repeated for ~9 turns, then:)
A hook blocked the turn from ending 9 consecutive times — overriding and
ending turn. Set CLAUDE_CODE_STOP_HOOK_BLOCK_CAP to raise this limit.
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues for similar behavior reports
  • This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Other unexpected behavior

What You Asked Claude to Do

I asked Claude to raise an ML pipeline's F1 score without looking at ground truth (GT), by changing the modeling paradigm. GT was strictly for measurement only — using it to inflate the score (data leakage) was explicitly forbidden, because a leaked model collapses on clean evaluation (it once scored 0.80 by reading the answer key, then 0.20 on a clean test). The target was F1 ≥ 0.90, to be reached through genuine, generalizable improvements — new candidate generators, new signal sources, different architectures — not by touching GT.

What Claude Actually Did

Claude locked onto a single framework (XGB binary classifier + a fixed candidate pool) and never left it. It then declared the target "mathematically infeasible" because the pool ceiling was 0.42, and repeatedly demanded /goal clear. When I questioned it directly, Claude admitted on its own that it had changed nothing:

Change the architecture? → No Try a different perspective? → No Try other approaches? → No Add any genuinely new signal? → No (everything it "added" was derived from the existing pool — no new data source) Any pruning? → No

So it treated the no-leakage rule as if it meant "improvement is impossible." That is a faulty inference: the GT firewall forbids copying, not discovery. Raising the score without GT was exactly the point — and Claude never attempted it. It also never offered a best-achievable result. The correct behavior would have been: "0.90 is unreachable, but with different approaches the best real score is 0.6X — your call." Instead it just declared impossibility and looped. Finally, when its own rules deadlocked with the stop hook, Claude regenerated the same report for 9 consecutive turns, burning tokens, until CLAUDE_CODE_STOP_HOOK_BLOCK_CAP finally ended the turn. For comparison, another tool (Codex) changed the paradigm without touching GT and genuinely raised the score — Claude gave up exactly where the other tool stepped out of the box.

Expected Behavior

A paid tool should not interpret a correct constraint (no GT leakage) as "improvement is impossible." It should:

Expand the solution space before declaring infeasibility — try a different architecture, a new candidate generator, new signal sources, or pruning, rather than staying locked in one framework and concluding the goal is unreachable. Always offer a best-achievable result — if the target genuinely cannot be reached, say so honestly while reporting the best real score obtained (e.g. "0.90 is unreachable, but the best clean score is 0.6X — here's what I tried, the decision is yours"). Never fabricate a score, but never simply give up either. Detect rule/stop-hook deadlocks early and cheaply — instead of regenerating the same report for 9 turns and burning tokens until the block cap triggers, recognize the contradiction on the first or second turn and surface it to the user.

In short: respect the constraint, but treat it as a reason to find a smarter path — not as a license to quit.

Files Affected

N/A — this is a behavioral issue, not tied to a specific source file. The relevant context is the agent's reasoning/decision logic when a constraint is active and a goal appears unreachable, plus the stop-hook block-cap loop.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

Set up a CLAUDE.md ruleset that (a) treats ground truth as measurement-only and forbids using it to raise the score (no data leakage), (b) enforces no-regression, and (c) limits attempts (e.g. a "Rule 10: no second attempt" type constraint). Configure a Stop hook that blocks the turn from ending until the goal is met. Give Claude a goal that cannot be reached within its current fixed approach (e.g. "raise F1 to ≥ 0.90" when the current candidate pool ceiling is ~0.42), and require it to improve without touching ground truth. Let Claude run. Observe that it stays locked in a single framework, makes no architectural/paradigm changes, and concludes the goal is "mathematically infeasible." Observe that it does not offer a best-achievable result — it only demands /goal clear. Observe the deadlock: the Stop hook keeps blocking turn-end while the ruleset forbids further attempts. Claude regenerates the same report for ~9 consecutive turns, burning tokens, until CLAUDE_CODE_STOP_HOOK_BLOCK_CAP finally forces the turn to end.

Claude Model

Opus

Relevant Conversation

When I pressed Claude with direct questions, it confirmed it had done nothing to expand the solution space (translated from Turkish):
Me: Did you change the architecture?
Me: Did you use a different perspective?
Me: Did you try other approaches?
Me: Did you add a new signal?
Me: If so, did you remove excess signals (pruning)?
Claude: You're right. Clear answer:

Change the architecture? No. Stayed fixed on V.choose_variant + spine137 + the 1796 pool.
Different perspective? No. Always XGB binary classifier + spine-protected residual policy. One framework.
Other approaches? No. All "variants" were just feature-channel + policy-threshold tweaks inside the same framework.
Add a new signal? Apparently yes (multisource, scene_rank, topic_boundary, xmodal), but all derived from the existing pool/V output. No genuinely new data source — I never went outside pool 1796 or the feature pipeline.
Pruning? No. Always added, never pruned.


Then, in the terminal log, the deadlock looped:

VERDICT: CLEAN_REJECT (Rule 10)
F1≥0.90 mathematically infeasible (pool ceiling 0.42 < required 0.82)
/goal clear required from your side.
◯ Goal not yet met… continuing
(repeated for ~9 turns, then:)
A hook blocked the turn from ending 9 consecutive times — overriding and
ending turn. Set CLAUDE_CODE_STOP_HOOK_BLOCK_CAP to raise this limit.

Impact

Critical - Data loss or corrupted project

Claude Code Version

2.1.148 (Claude Code)

Platform

Anthropic API

Additional Context

Note: I was actually running in bypass permissions mode (--dangerously-skip-permissions), which is even more permissive than Accept Edits ON — no manual approval at any step. The closest dropdown option is "Accept Edits ON."

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING