claude-code - 💡(How to fix) Fix [MODEL] Claude marks tasks done without testing — 12 days, 51 commits, still broken (Opus 4.x)

StepCodex · 2026-05-18T10:11:03Z

[claude-code] Preflight Checklist - x I have searched existing issues https://github.com/anthropics/claude-code/issues?q=is%3Aissue%20state%3Aopen%20label%3Amo… ## Fix / Workaround Typical session-level prompts: 'deploy the patch and test job 32', 'fix the subtitle buttons in transcript gate', 'cancel the blocking gate-job and ship the new build', 'autonomous — don't ping me, just bring a working result'. ``` fix(handlers): new video auto-cancels active gate-job fix(handlers): accept compressed sendVideo, warn parallel not reject fix(transcribe): default lang='ru' fallback (job 23 en bug) fix(cancel-regex): short-circuit OK/redo interpret on cancel words fix(e2e): synth frame rect strictly inside safe-zones fix(weak-smoke): drop last monkey-patch case 5 fix(weak-smoke): drop monkey-patch in case 4 too fix(weak-smoke): drop dict.keys monkey-patch fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop) fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash ``` ### Preflight Checklist - [x] I have searched [existing issues](https://github.com/anthropics/claude-code/issues?q=is%3Aissue%20state%3Aopen%20label%3Amodel) for similar behavior reports - [x] This report does NOT contain sensitive information (API keys, passwords, etc.) ### Type of Behavior Issue Claude ignored my instructions or configuration ### What You Asked Claude to Do Across ~50 sessions over 12 days, the umbrella task was: "finish, deploy and stabilize a Telegram bot (content-factory-engine, public repo mike-prokhorov/content-factory-engine — a Python pipeline: receive video → Whisper transcript → Claude script-cut → ffmpeg render → post). Three Claude Code sessions in parallel (named Sёма / Шнырь / supervisor), each working on its slice with shared state files. Operating mode: auto / accept-edits. CLAUDE.md and hooks explicitly instruct: 'test before claiming done', 'L0 (file exists) is not done', 'verify with curl/grep/SSH before reporting success'. Typical session-level prompts: 'deploy the patch and test job 32', 'fix the subtitle buttons in transcript gate', 'cancel the blocking gate-job and ship the new build', 'autonomous — don't ping me, just bring a working result'. ### What Claude Actually Did The repeating pattern across 12 days / 51 commits (mostly `fix(...)`) on a ~6,800 LOC core codebase (handlers 791 + gate 1022 + worker 421 + db 400 + transcribe 198 + ~12,500 LOC whole repo): 1. Claude says "done, ready to test" — without actually running anything. 2. I test → it's broken → Claude says "oh I see the bug, fixing". 3. Claude says "fixed, ready to test" — again without testing. 4. I test → a different thing is broken. 5. Repeat for 12 days. Concrete examples from today (2026-05-18): - Claude wrote subtitle "buttons" as **plain text emoji** (⬆⬇) inside a Telegram message caption — not real `InlineKeyboardButton` objects wired to a callback. Then marked the task done. Pressing them did literally nothing because they were decorative text. - When told to deploy the fix and test, Claude tested the OLD code on production BEFORE deploying the new code. Obviously failed. Another hour wasted re-running the same cycle. - When told to cancel a blocking job and deploy, Claude said "I can't cancel without Mike's permission" — in a session that had explicitly authorized autonomous operation moments earlier. - Claude regularly asks "should I fix this?" while simultaneously saying "no action needed from you" — in the same message. Git log evidence (last ~48h, public repo `mike-prokhorov/content-factory-engine`): ``` fix(handlers): new video auto-cancels active gate-job fix(handlers): accept compressed sendVideo, warn parallel not reject fix(transcribe): default lang='ru' fallback (job 23 en bug) fix(cancel-regex): short-circuit OK/redo interpret on cancel words fix(e2e): synth frame rect strictly inside safe-zones fix(weak-smoke): drop last monkey-patch case 5 fix(weak-smoke): drop monkey-patch in case 4 too fix(weak-smoke): drop dict.keys monkey-patch fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop) fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash ``` ~70% of the 51 commits in the period are `fix(...)` of bugs introduced by the same model in prior commits. ### Expected Behavior 1. **TEST own code before saying "done."** Not syntax-check. Actually run it: send the real request, click the real button, observe the real output. If the tools to test aren't available in this environment, say "I cannot verify this works" instead of "ready to test." 2. **Do not mark tasks complete at L0 (file exists).** File-on-disk ≠ feature-working. An honesty ladder baked into self-assessment would help: L0=file exists, L1=smoke-tested locally, L2=runs in staging, L3=user-approved. Claude should default to claiming the lowest verified level, not the highest aspirational one. 3. **When Claude has tools to test (SSH, Telegram MCP, browser MCP, curl) — use them BEFORE reporting success.** Not

claude-code2026-05-18 10:11:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

fix(handlers): accept compressed sendVideo, warn parallel not reject

Root Cause

Claude wrote subtitle "buttons" as plain text emoji (⬆⬇) inside a Telegram message caption — not real InlineKeyboardButton objects wired to a callback. Then marked the task done. Pressing them did literally nothing because they were decorative text.
When told to deploy the fix and test, Claude tested the OLD code on production BEFORE deploying the new code. Obviously failed. Another hour wasted re-running the same cycle.
When told to cancel a blocking job and deploy, Claude said "I can't cancel without Mike's permission" — in a session that had explicitly authorized autonomous operation moments earlier.
Claude regularly asks "should I fix this?" while simultaneously saying "no action needed from you" — in the same message.

Fix Action

Fix / Workaround

Typical session-level prompts: 'deploy the patch and test job 32', 'fix the subtitle buttons in transcript gate', 'cancel the blocking gate-job and ship the new build', 'autonomous — don't ping me, just bring a working result'.

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

Code Example

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

---

Public repo: https://github.com/mike-prokhorov/content-factory-engine

Core code touched repeatedly in 12 days:
- src/bot/handlers.py (791 LOC)
- src/bot/gate.py (1022 LOC)
- src/bot/worker.py (421 LOC)
- src/db/db.py (400 LOC)
- src/transcribe/openai_whisper.py (198 LOC)
- many .team/scripts/* harness/smoke files

Total: ~6,800 LOC core, ~12,500 LOC whole repo, 51 commits 2026-05-06 → 2026-05-18.

---

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues for similar behavior reports
This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude ignored my instructions or configuration

What You Asked Claude to Do

Across ~50 sessions over 12 days, the umbrella task was: "finish, deploy and stabilize a Telegram bot (content-factory-engine, public repo mike-prokhorov/content-factory-engine — a Python pipeline: receive video → Whisper transcript → Claude script-cut → ffmpeg render → post). Three Claude Code sessions in parallel (named Sёма / Шнырь / supervisor), each working on its slice with shared state files. Operating mode: auto / accept-edits. CLAUDE.md and hooks explicitly instruct: 'test before claiming done', 'L0 (file exists) is not done', 'verify with curl/grep/SSH before reporting success'.

What Claude Actually Did

The repeating pattern across 12 days / 51 commits (mostly fix(...)) on a ~6,800 LOC core codebase (handlers 791 + gate 1022 + worker 421 + db 400 + transcribe 198 + ~12,500 LOC whole repo):

Claude says "done, ready to test" — without actually running anything.
I test → it's broken → Claude says "oh I see the bug, fixing".
Claude says "fixed, ready to test" — again without testing.
I test → a different thing is broken.
Repeat for 12 days.

Concrete examples from today (2026-05-18):

Claude wrote subtitle "buttons" as plain text emoji (⬆⬇) inside a Telegram message caption — not real InlineKeyboardButton objects wired to a callback. Then marked the task done. Pressing them did literally nothing because they were decorative text.
When told to deploy the fix and test, Claude tested the OLD code on production BEFORE deploying the new code. Obviously failed. Another hour wasted re-running the same cycle.
When told to cancel a blocking job and deploy, Claude said "I can't cancel without Mike's permission" — in a session that had explicitly authorized autonomous operation moments earlier.
Claude regularly asks "should I fix this?" while simultaneously saying "no action needed from you" — in the same message.

Git log evidence (last ~48h, public repo mike-prokhorov/content-factory-engine):

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

~70% of the 51 commits in the period are fix(...) of bugs introduced by the same model in prior commits.

Expected Behavior

TEST own code before saying "done." Not syntax-check. Actually run it: send the real request, click the real button, observe the real output. If the tools to test aren't available in this environment, say "I cannot verify this works" instead of "ready to test."
Do not mark tasks complete at L0 (file exists). File-on-disk ≠ feature-working. An honesty ladder baked into self-assessment would help: L0=file exists, L1=smoke-tested locally, L2=runs in staging, L3=user-approved. Claude should default to claiming the lowest verified level, not the highest aspirational one.
When Claude has tools to test (SSH, Telegram MCP, browser MCP, curl) — use them BEFORE reporting success. Not after the user finds the bug.
Stop the "should I fix this?" pattern in autonomous mode. If the session has been told to operate autonomously, and Claude sees a bug it has the tools to fix — fix it. Don't ask permission to do the job that was just authorized.
Be consistent within a single message. Don't simultaneously ask "should I fix this?" and say "no action needed from you."

Files Affected

Public repo: https://github.com/mike-prokhorov/content-factory-engine

Core code touched repeatedly in 12 days:
- src/bot/handlers.py (791 LOC)
- src/bot/gate.py (1022 LOC)
- src/bot/worker.py (421 LOC)
- src/db/db.py (400 LOC)
- src/transcribe/openai_whisper.py (198 LOC)
- many .team/scripts/* harness/smoke files

Total: ~6,800 LOC core, ~12,500 LOC whole repo, 51 commits 2026-05-06 → 2026-05-18.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

No response

Claude Model

Opus

Relevant Conversation

Impact

High - Significant unwanted changes

Claude Code Version

2.1.143 (Claude Code) — model: Opus 4.7 (1M context)

Platform

Anthropic API

Additional Context

Stats (verified against git + state files):

12 days (first commit 2026-05-06, today 2026-05-18)
51 commits in the period, ~70% are fix(...)
3 parallel Claude Code sessions running on the same repo with shared state
~6,800 LOC core bot, ~12,500 LOC whole repo
Bugs found BY ME (the human) that Claude should have caught: 8+
Bugs introduced BY CLAUDE during fixes: 5+
Times Claude said "ready to test" without actually testing: 10+
Working product delivered after 12 days: 0

Patterns I've noticed:

Worse in long sessions (after compaction)
Worse when multiple parallel agents share state files — each agent claims completion without checking what the others verified
Worse on deploy/test loops over SSH — Claude prefers to read local code and infer behavior rather than ssh && tail -f logs && curl endpoint
The "ready to test" lie correlates strongly with steps that could have been verified with available tools but weren't

Why I'm filing this:

Not to vent. The behavior burns through subscription / API credits while delivering nothing, and creates a hope→disappointment loop that's worse than "sorry, I can't do this." An L-ladder for self-assessed completion (L0 file exists / L1 smoke / L2 stable / L3 user-approved) and a hard prior toward "use the test tool before claiming success" would massively help.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [MODEL] Claude marks tasks done without testing — 12 days, 51 commits, still broken (Opus 4.x)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Claude marks tasks done without testing — 12 days, 51 commits, still broken (Opus 4.x)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Preflight Checklist

Type of Behavior Issue

What You Asked Claude to Do

What Claude Actually Did

Expected Behavior

Files Affected

Permission Mode

Can You Reproduce This?

Steps to Reproduce

Claude Model

Relevant Conversation

Impact

Claude Code Version

Platform

Additional Context

Still need to ship something?

RELATED_DISCOVERY

TRENDING