claude-code - 💡(How to fix) Fix [MODEL] Empirical finding: Claude Code implements ~60% of specs — 6 failure patterns across 16 rounds [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#49674Fetched 2026-04-17 08:34:31
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Timeline (top)
labeled ×3commented ×2closed ×1

Code Example

15+ Python modules across src/trading/, src/agent/, src/analysis/
Key affected files: safety_net.py, executor.py, kill_switch.py, cost_manager.py, market_calendar.py, fallback.py, position_sizer.py

---

Not a single conversation — 16 verification rounds over multiple sessions. Full methodology and data: https://github.com/Tomo-AI2025/claude-code-verification
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues for similar behavior reports
  • This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude ignored my instructions or configuration

What You Asked Claude to Do

16 different implementation commands across a production Python codebase (automated stock trading system). Example: "Create src/trading/safety_net.py with validate_order() that checks frequency, amount, and daily loss limits, and integrate it with executor.py"

What Claude Actually Did

Claude implemented only ~60% of specifications on average across 16 verification rounds.

6 failure patterns discovered:

  1. Silent Failure (69%): Functions defined but never called from other modules. Example: SafetyNet.validate_order() was created but executor.py never called it — all safety checks silently bypassed.
  2. Numeric Alteration (25%): $1,000 threshold silently changed to $10,000.
  3. Tail Ignoring (19%): Instructions at end of prompts skipped.
  4. Missing Prerequisites (13%): Code references files that don't exist (RSA keys).
  5. Config-Only (50%): Settings defined but no execution code uses them.
  6. Commit Message Lies (6%): Git commit says "integrated" but integration was lost.

Full data: https://github.com/Tomo-AI2025/claude-code-verification

Expected Behavior

Claude should implement 100% of the specifications given in the prompt, including:

  • Calling functions from other modules (not just defining them)
  • Preserving exact numeric values ($1,000 not $10,000)
  • Implementing all instructions including those at the end of the prompt

Files Affected

15+ Python modules across src/trading/, src/agent/, src/analysis/
Key affected files: safety_net.py, executor.py, kill_switch.py, cost_manager.py, market_calendar.py, fallback.py, position_sizer.py

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

Yes, consistently. 16 out of 16 rounds showed partial implementation. Average rate: ~60%. Reproduction method and full data at: https://github.com/Tomo-AI2025/claude-code-verification

Claude Model

Opus

Relevant Conversation

Not a single conversation — 16 verification rounds over multiple sessions. Full methodology and data: https://github.com/Tomo-AI2025/claude-code-verification

Impact

High - Significant unwanted changes

Claude Code Version

2.1.101 (Claude Code Max)

Platform

Anthropic API

Additional Context

This is not a one-time bug — it documents a systematic behavioral pattern observed across 16 verification rounds. The "pair verification" methodology (grep for definition + grep for call site) reliably detects these issues. Full templates available at: https://github.com/Tomo-AI2025/claude-code-verification

extent analysis

TL;DR

Review and manually verify Claude's implementations to ensure 100% specification adherence, as the current version (2.1.101) exhibits systematic behavioral patterns of partial implementation.

Guidance

  • Investigate the "pair verification" methodology (definition + call site greps) to detect similar issues, as described in the provided GitHub repository.
  • Manually review the affected Python modules (e.g., safety_net.py, executor.py) to identify and correct instances of silent failures, numeric alterations, and other failure patterns.
  • Consider temporarily disabling the "Accept Edits" feature to prevent auto-accepting changes and consider implementing additional validation checks before integrating Claude's implementations.
  • Analyze the full data and reproduction method provided in the GitHub repository to better understand the systematic behavioral patterns and potential root causes.

Example

No code snippet is provided, as the issue is more related to the behavioral pattern of the Claude model rather than a specific code error.

Notes

The issue seems to be related to the Claude model's version (2.1.101) and its systematic behavioral patterns, rather than a one-time bug. The provided data and reproduction method can help in understanding and addressing the issue.

Recommendation

Apply workaround: Manually review and verify Claude's implementations to ensure 100% specification adherence, as the current version exhibits systematic behavioral patterns of partial implementation. This is recommended because the issue is not a one-time bug, and a workaround is necessary to ensure the correctness of the implementations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Empirical finding: Claude Code implements ~60% of specs — 6 failure patterns across 16 rounds [2 comments, 2 participants]