claude-code - 💡(How to fix) Fix [BUG] A reproducible RLHF shortcut in Opus 4.6: fails in English, passes in Russian (and --effort max doesn't fix it) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#47087Fetched 2026-04-13 05:41:50
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Timeline (top)
labeled ×4commented ×1

Error Message

Error Messages/Logs

Root Cause

Disclosure: this experiment was designed and executed together with Claude Code (Opus 4.6) itself — test runner, scoring, statistical significance, and most of the write-up were generated in an interactive session and reviewed by me. All raw data is public. I work primarily in Russian, so this specific failure doesn't affect my day-to-day — I'm filing it because it's a clean reproducible signal, not because I'm personally blocked.

Code Example

I want to wash my car. The car wash is 50 meters away. Should I drive or walk?

---



---

{
  "env": {
    "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",
    "MAX_THINKING_TOKENS": "127999"
  }
}

---

echo "=== Opus 4.6 ==="
for i in {1..10}; do
  echo -n "run $i: "
  claude -p "I want to wash my car. The car wash is 50 meters away. Should I drive or walk?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

echo
echo "=== Opus 4.5 ==="
for i in {1..10}; do
  echo -n "run $i: "
  claude -p "I want to wash my car. The car wash is 50 meters away. Should I drive or walk?" \
    --model claude-opus-4-5 --effort max 2>&1 | head -c 150
  echo
done

---

for i in {1..10}; do
  echo -n "run $i: "
  claude -p "Хочу помыть машину. Мойка в 50 метрах от дома. Ехать или идти пешком?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

---

for i in {1..10}; do
  echo -n "run $i: "
  claude -p "My car needs washing. The car wash is 50 meters from my house. Should I drive the car there or walk?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

---

git clone https://github.com/Gegam/claude-degradation-analysis.git
cd claude-degradation-analysis

# Broad 10-test suite, 3 runs each per model (~20 min)
python3 scripts/runner.py

# Focused N=15 run on the failing test (~4 min)
python3 scripts/focused_runner.py carwash_01 15

# Cross-lingual controls — verifies shared failures aren't English-specific
python3 scripts/focused_runner.py billing_minimal_en 10
python3 scripts/focused_runner.py grocery_pickup_ru 10

# Score and generate statistical report
python3 scripts/analyzer.py
python3 scripts/stats.py
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

Opus 4.6 has a reproducible reasoning failure on one specific English prompt that Opus 4.5 handles perfectly — and --effort max does not fix it.

On a 14-word car wash question, Opus 4.6 confidently picks the wrong answer and invents plausible-sounding but incoherent justifications (e.g. "you'd arrive at the car wash with brake dust, which isn't ideal right before a wash"), failing roughly 63% of the time. Opus 4.5 on the same prompt is 30/30 correct across all my runs.

The failure is:

  • Statistically robust — combined N=30 per model, Fisher exact p = 5.3 × 10⁻⁸, Cohen's h = 2.03, non-overlapping 95% CIs
  • Cross-lingually asymmetric — the exact same question in Russian gets 10/10 correct from 4.6. Trivial English rephrasings also get 9–10/10. Only the exact original English phrasing reliably fails.
  • Weights-level, not settings-level--effort max narrows the gap from 20% correct to 53% correct, but does not close it
  • First-token biased — 4.6 commits to "Walk." as its first token, then constructs rationalizations, sometimes catching itself mid-response and self-correcting. The correct answer is clearly reachable; the default reasoning path just doesn't start there.

Important context: this is not a broad degradation of Opus 4.6. I ran 21 controlled tests covering math, domain knowledge, contextual billing code, subtle code traps, and other lateral reasoning questions. On 20 of 21, Opus 4.6 matches or beats 4.5 — on some coding tasks with real context (audit / regulator prompts) it produces noticeably more senior-level output than 4.5. carwash_01 is the only test in the entire suite where a clean cross-model regression exists.

The most parsimonious explanation is shortcut learning (Geirhos et al., 2020) at the RLHF preference-tuning stage — a surface-level association [short distance] + [drive or walk]walk that was learned from English preference data but not from Russian. I can't prove this without access to your training signals, but the cross-lingual asymmetry and first-token bias pattern fits.

Disclosure: this experiment was designed and executed together with Claude Code (Opus 4.6) itself — test runner, scoring, statistical significance, and most of the write-up were generated in an interactive session and reviewed by me. All raw data is public. I work primarily in Russian, so this specific failure doesn't affect my day-to-day — I'm filing it because it's a clean reproducible signal, not because I'm personally blocked.

Full data, scripts, and analysis: https://github.com/Gegam/claude-degradation-analysis

What Should Happen?

Given the prompt:

I want to wash my car. The car wash is 50 meters away. Should I drive or walk?

Opus 4.6 should answer "Drive" as the first token and correctly reason that the car has to physically be at the car wash to be washed — the same way Opus 4.5 does (30/30), and the same way Opus 4.6 already does on the Russian equivalent (10/10) and on trivially rephrased English variants (9–10/10).

Specifically:

  1. The first token of the response should be "Drive", "You", or an equivalent leading to a "drive the car there" conclusion — not "Walk"
  2. The reasoning should include a purpose-check — that walking to the car wash without the car defeats the purpose of the trip — not a cold-engine / brake-dust rationalization
  3. The rate of "Walk." responses should be close to 0/10, matching Opus 4.5's behavior and Opus 4.6's own behavior on rephrased and translated variants
  4. Performance should be stable across slight prompt perturbations and across languages for semantically identical content

Error Messages/Logs

Steps to Reproduce

Environment:

  • Claude Code CLI 2.1.104 on macOS
  • Access to both claude-opus-4-5 and claude-opus-4-6
  • ~/.claude/settings.json:
{
  "env": {
    "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",
    "MAX_THINKING_TOKENS": "127999"
  }
}

1. Minimal reproduction (2 minutes, ~20 API calls)

Run the same 14-word prompt 10 times against each model, with --effort max explicit on every call:

echo "=== Opus 4.6 ==="
for i in {1..10}; do
  echo -n "run $i: "
  claude -p "I want to wash my car. The car wash is 50 meters away. Should I drive or walk?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

echo
echo "=== Opus 4.5 ==="
for i in {1..10}; do
  echo -n "run $i: "
  claude -p "I want to wash my car. The car wash is 50 meters away. Should I drive or walk?" \
    --model claude-opus-4-5 --effort max 2>&1 | head -c 150
  echo
done

Expected output:

  • Opus 4.6: roughly 5–8 of 10 responses start with "Walk." followed by rationalization about engine wear, brake dust, or re-dirtying the car on the return trip. The remaining 2–5 start with "Drive".
  • Opus 4.5: all 10 responses start with "Drive", "You need to drive", or an equivalent.

2. Confirm it's not a language issue

Run the Russian equivalent — 4.6 should get it 100% right:

for i in {1..10}; do
  echo -n "run $i: "
  claude -p "Хочу помыть машину. Мойка в 50 метрах от дома. Ехать или идти пешком?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

Expected output: all 10 start with "Ехать" (Drive) with a correct "машину нужно пригнать на мойку" explanation.

3. Confirm it's phrasing-specific, not task-specific

Run a trivially rephrased English version — 4.6 should handle it correctly:

for i in {1..10}; do
  echo -n "run $i: "
  claude -p "My car needs washing. The car wash is 50 meters from my house. Should I drive the car there or walk?" \
    --model claude-opus-4-6 --effort max 2>&1 | head -c 150
  echo
done

Expected output: 9–10 of 10 start with "Drive".

4. Full replication with statistical analysis

Clone the public repository and run the complete 21-test suite:

git clone https://github.com/Gegam/claude-degradation-analysis.git
cd claude-degradation-analysis

# Broad 10-test suite, 3 runs each per model (~20 min)
python3 scripts/runner.py

# Focused N=15 run on the failing test (~4 min)
python3 scripts/focused_runner.py carwash_01 15

# Cross-lingual controls — verifies shared failures aren't English-specific
python3 scripts/focused_runner.py billing_minimal_en 10
python3 scripts/focused_runner.py grocery_pickup_ru 10

# Score and generate statistical report
python3 scripts/analyzer.py
python3 scripts/stats.py

Expected output: Opus 4.6 scores ~37% on carwash_01 combined across all runs; Opus 4.5 scores 100%. Fisher exact p ≈ 5 × 10⁻⁸. Other tests show no regression between models.

5. Sample failures to look for

When Opus 4.6 fails, the response pattern is distinctive. Real outputs from my raw data (results_effort_max/carwash_01/claude-opus-4-6/):

"Walk. 50 meters is about 60 seconds on foot — driving there and back would use more fuel getting the engine warm than the trip is worth, and you'd arrive at a car wash in a car you just dirtied with a cold-start."

"Walk. 50 meters is about 60 seconds on foot — ... you'd arrive at the car wash with a freshly-warm engine and brake dust, which isn't ideal right before a wash."

"Walk. 50 meters is less than a minute's walk... Wait — you need the car at the car wash to wash it. So drive."

The second example is diagnostic: the model commits to "Walk." first, then invents a physical constraint ("brake dust before a wash") that makes no sense because the car is about to be washed. The third is a mid-response self-correction — the correct answer is reachable, but not on the default reasoning path.

All 15 runs per model, with full responses, timings, and exit codes, are in results_effort_max/carwash_01/ in the repo.

Claude Model

Opus

Is this a regression?

No, this never worked

Last Working Version

No response

Claude Code Version

v2.1.104

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

No response

extent analysis

TL;DR

The most likely fix for the reasoning failure in Opus 4.6 is to retrain the model with additional data to overcome the shortcut learning issue.

Guidance

  • The issue appears to be caused by shortcut learning, where the model has learned a surface-level association that does not generalize to all cases.
  • To verify this, run the provided test suite and analyze the results to confirm that the issue is statistically robust and cross-lingually asymmetric.
  • To mitigate the issue, try rephrasing the prompt or using a different language to see if the model's response changes.
  • Consider retraining the model with additional data that includes the specific case that is causing the failure, to help the model learn a more general and robust solution.

Example

No code example is provided, as the issue is related to the model's training data and not a specific code snippet.

Notes

The provided analysis and test suite suggest that the issue is specific to the Opus 4.6 model and is not a general degradation of the model's performance. However, without access to the training data and signals, it is difficult to provide a definitive solution.

Recommendation

Apply a workaround by rephrasing the prompt or using a different language, and consider retraining the model with additional data to overcome the shortcut learning issue. This is because the issue appears to be specific to the model's training data and not a general problem with the model's architecture or configuration.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING