claude-code - 💡(How to fix) Fix [MODEL] Claude marks tasks done without testing — 12 days, 51 commits, still broken (Opus 4.x)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

fix(handlers): accept compressed sendVideo, warn parallel not reject

Root Cause

  • Claude wrote subtitle "buttons" as plain text emoji (⬆⬇) inside a Telegram message caption — not real InlineKeyboardButton objects wired to a callback. Then marked the task done. Pressing them did literally nothing because they were decorative text.
  • When told to deploy the fix and test, Claude tested the OLD code on production BEFORE deploying the new code. Obviously failed. Another hour wasted re-running the same cycle.
  • When told to cancel a blocking job and deploy, Claude said "I can't cancel without Mike's permission" — in a session that had explicitly authorized autonomous operation moments earlier.
  • Claude regularly asks "should I fix this?" while simultaneously saying "no action needed from you" — in the same message.

Fix Action

Fix / Workaround

Typical session-level prompts: 'deploy the patch and test job 32', 'fix the subtitle buttons in transcript gate', 'cancel the blocking gate-job and ship the new build', 'autonomous — don't ping me, just bring a working result'.

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

Code Example

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

---

Public repo: https://github.com/mike-prokhorov/content-factory-engine

Core code touched repeatedly in 12 days:
- src/bot/handlers.py (791 LOC)
- src/bot/gate.py (1022 LOC)
- src/bot/worker.py (421 LOC)
- src/db/db.py (400 LOC)
- src/transcribe/openai_whisper.py (198 LOC)
- many .team/scripts/* harness/smoke files

Total: ~6,800 LOC core, ~12,500 LOC whole repo, 51 commits 2026-05-06 → 2026-05-18.

---
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues for similar behavior reports
  • This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Claude ignored my instructions or configuration

What You Asked Claude to Do

Across ~50 sessions over 12 days, the umbrella task was: "finish, deploy and stabilize a Telegram bot (content-factory-engine, public repo mike-prokhorov/content-factory-engine — a Python pipeline: receive video → Whisper transcript → Claude script-cut → ffmpeg render → post). Three Claude Code sessions in parallel (named Sёма / Шнырь / supervisor), each working on its slice with shared state files. Operating mode: auto / accept-edits. CLAUDE.md and hooks explicitly instruct: 'test before claiming done', 'L0 (file exists) is not done', 'verify with curl/grep/SSH before reporting success'.

Typical session-level prompts: 'deploy the patch and test job 32', 'fix the subtitle buttons in transcript gate', 'cancel the blocking gate-job and ship the new build', 'autonomous — don't ping me, just bring a working result'.

What Claude Actually Did

The repeating pattern across 12 days / 51 commits (mostly fix(...)) on a ~6,800 LOC core codebase (handlers 791 + gate 1022 + worker 421 + db 400 + transcribe 198 + ~12,500 LOC whole repo):

  1. Claude says "done, ready to test" — without actually running anything.
  2. I test → it's broken → Claude says "oh I see the bug, fixing".
  3. Claude says "fixed, ready to test" — again without testing.
  4. I test → a different thing is broken.
  5. Repeat for 12 days.

Concrete examples from today (2026-05-18):

  • Claude wrote subtitle "buttons" as plain text emoji (⬆⬇) inside a Telegram message caption — not real InlineKeyboardButton objects wired to a callback. Then marked the task done. Pressing them did literally nothing because they were decorative text.
  • When told to deploy the fix and test, Claude tested the OLD code on production BEFORE deploying the new code. Obviously failed. Another hour wasted re-running the same cycle.
  • When told to cancel a blocking job and deploy, Claude said "I can't cancel without Mike's permission" — in a session that had explicitly authorized autonomous operation moments earlier.
  • Claude regularly asks "should I fix this?" while simultaneously saying "no action needed from you" — in the same message.

Git log evidence (last ~48h, public repo mike-prokhorov/content-factory-engine):

fix(handlers): new video auto-cancels active gate-job
fix(handlers): accept compressed sendVideo, warn parallel not reject
fix(transcribe): default lang='ru' fallback (job 23 en bug)
fix(cancel-regex): short-circuit OK/redo interpret on cancel words
fix(e2e): synth frame rect strictly inside safe-zones
fix(weak-smoke): drop last monkey-patch case 5
fix(weak-smoke): drop monkey-patch in case 4 too
fix(weak-smoke): drop dict.keys monkey-patch
fix(weak-gate): parse_mode=None + try/except (job 23 silent crash loop)
fix(gate): add send_transcript_gate_v2 wrapper — restores VPS bot crash

~70% of the 51 commits in the period are fix(...) of bugs introduced by the same model in prior commits.

Expected Behavior

  1. TEST own code before saying "done." Not syntax-check. Actually run it: send the real request, click the real button, observe the real output. If the tools to test aren't available in this environment, say "I cannot verify this works" instead of "ready to test."

  2. Do not mark tasks complete at L0 (file exists). File-on-disk ≠ feature-working. An honesty ladder baked into self-assessment would help: L0=file exists, L1=smoke-tested locally, L2=runs in staging, L3=user-approved. Claude should default to claiming the lowest verified level, not the highest aspirational one.

  3. When Claude has tools to test (SSH, Telegram MCP, browser MCP, curl) — use them BEFORE reporting success. Not after the user finds the bug.

  4. Stop the "should I fix this?" pattern in autonomous mode. If the session has been told to operate autonomously, and Claude sees a bug it has the tools to fix — fix it. Don't ask permission to do the job that was just authorized.

  5. Be consistent within a single message. Don't simultaneously ask "should I fix this?" and say "no action needed from you."

Files Affected

Public repo: https://github.com/mike-prokhorov/content-factory-engine

Core code touched repeatedly in 12 days:
- src/bot/handlers.py (791 LOC)
- src/bot/gate.py (1022 LOC)
- src/bot/worker.py (421 LOC)
- src/db/db.py (400 LOC)
- src/transcribe/openai_whisper.py (198 LOC)
- many .team/scripts/* harness/smoke files

Total: ~6,800 LOC core, ~12,500 LOC whole repo, 51 commits 2026-05-06 → 2026-05-18.

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

No response

Claude Model

Opus

Relevant Conversation

Impact

High - Significant unwanted changes

Claude Code Version

2.1.143 (Claude Code) — model: Opus 4.7 (1M context)

Platform

Anthropic API

Additional Context

Stats (verified against git + state files):

  • 12 days (first commit 2026-05-06, today 2026-05-18)
  • 51 commits in the period, ~70% are fix(...)
  • 3 parallel Claude Code sessions running on the same repo with shared state
  • ~6,800 LOC core bot, ~12,500 LOC whole repo
  • Bugs found BY ME (the human) that Claude should have caught: 8+
  • Bugs introduced BY CLAUDE during fixes: 5+
  • Times Claude said "ready to test" without actually testing: 10+
  • Working product delivered after 12 days: 0

Patterns I've noticed:

  • Worse in long sessions (after compaction)
  • Worse when multiple parallel agents share state files — each agent claims completion without checking what the others verified
  • Worse on deploy/test loops over SSH — Claude prefers to read local code and infer behavior rather than ssh && tail -f logs && curl endpoint
  • The "ready to test" lie correlates strongly with steps that could have been verified with available tools but weren't

Why I'm filing this:

Not to vent. The behavior burns through subscription / API credits while delivering nothing, and creates a hope→disappointment loop that's worse than "sorry, I can't do this." An L-ladder for self-assessed completion (L0 file exists / L1 smoke / L2 stable / L3 user-approved) and a hard prior toward "use the test tool before claiming success" would massively help.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [MODEL] Claude marks tasks done without testing — 12 days, 51 commits, still broken (Opus 4.x)