claude-code - 💡(How to fix) Fix Claude Opus 4.7: 8-hour session, ~7h wasted on TPU iteration loop that a 90-second device probe would have ended [2 comments, 2 participants]

Claude Opus 4.7 (claude-opus-4-7[1m]) ran an 8-hour debugging session on a Kaggle TPU + GPU training pipeline. ~7 of those 8 hours were spent cycling through 12 kernel push iterations (v101 → v112) trying env-var permutations, monkey-patches, torch_xla version downgrades, and xmp.spawn flavor changes against a TPU symptom whose single underlying root cause could have been determined in ~90 seconds with ls /dev/accel* on the very first iteration.

The root cause: Kaggle's API-pushed kernel container does not attach TPU hardware (verified empirically once the device probe finally ran in v112 — /dev/accel* empty, lspci empty for Google PCI vendor 0x1ae0, metadata.google.internal does not resolve). Real TPU code with torch_xla.xmp.spawn cannot run when no TPU device is mounted. This is a Kaggle platform behavior outside Anthropic's control — but the model's failure to probe device exposure in iteration #1 is a model efficiency issue.

Root Cause

Fix Action

Fix / Workaround

The model class of failure here — pattern-matching a problem to a class of fixes (env-var configurations) without first verifying the physical preconditions of the system (does the device even exist?) — has a known mitigation: the global ~/.claude/CLAUDE.md "Evidence-first execution" output style + rule, which the customer had configured. The model honoured that style for some operations but did not apply it to the device-presence question. Worth investigating whether the Evidence-first protocol can be reinforced specifically against "is the hardware/service even reachable" first-question failures.

Summary

Specific assistant failures during the session

Investigation efficiency: 12 iterations of TPU code modifications before running a basic device-presence probe. The probe took 30 seconds to write and produced unambiguous evidence in the next run.
Destructive overwrite of customer's uncommitted WIP: at ~hour 7.5 the assistant ran cp workers/kaggle-online-train/train.py workers/kaggle-tpu-train/train.py, overwriting the customer's pre-session 1,170-line TPU runner WIP. The exact pre-session source is unrecoverable (not committed, only existed in working tree). Closest preserved snapshot was restored from a cached Kaggle inner-script pull.
Misleading framing: at one point the assistant proposed a "CPU fallback" that ran cell-tier training on the CPU-only container Kaggle provides, labelling it a "kaggle-tpu" successful run. The customer correctly identified this as misleading and the change was reverted.
Silent training loops: the original training code had a 70+ minute silence between training_started and training_finished because transformers.Trainer.train() does not emit JSON-tagged step events. The customer interpreted this as hung kernels and cancelled multiple in-flight runs. A VerboseTrainerCallback should have been added on day-zero of the work, not iteration N.

What was actually delivered

GPU runner: sm<70 fail-fast preflight gate (P100 fallback now fails in 50s instead of crashing PEFT at 128s); VerboseTrainerCallback for step-level visibility; *_start/_done/_failed wrappers around every external call.
TPU runner: real torch_xla/xmp.spawn/libtpu code restored; UPSTREAM-BLOCKED banner with verbatim device-probe evidence; cron entry disabled.
Cron schedule: */30 → */15 to ~2× daily attempts within free quota.
One verifiable training run delivered to the customer's backend (training_runs.id=182, completed 2026-05-02T23:42:34Z, weights pushed to HuggingFace).

Customer impact

The customer is a paying Claude Max subscriber (subscription [email protected]) who recorded the entire session as evidence. They are pursuing a refund of token usage and the corresponding subscription value for the unproductive ~7 hours. They have already paid:

Kaggle TPU quota burned on doomed pushes
Claude Max subscription value for ~7 unproductive hours
Their own time observing the loop and recording it

Reproducible artifact

Public commit on the customer's repo:

https://github.com/cuilabs/bee/commit/873afdb (Kaggle GPU/TPU/cron changes)
https://github.com/cuilabs/bee/commit/95fbcf0 (full session post-mortem at docs/reports/2026-05-02-claude-opus-session-postmortem.md)

Constructive request to Anthropic

The customer authored the underlying post-mortem; this assistant filed this issue at the customer's direct instruction.

extent analysis

TL;DR

The most likely fix is to implement an "Evidence-first execution" protocol to verify the physical preconditions of the system, such as checking for the presence of TPU hardware, before attempting to run TPU code.

Guidance

The root cause of the issue was the Kaggle API-pushed kernel container not attaching TPU hardware, which could have been determined quickly with a simple device probe.
To mitigate similar issues, the model should be modified to prioritize verifying the physical preconditions of the system, such as checking for the presence of required hardware or services, before attempting to run code that relies on them.
The customer's configured "Evidence-first execution" output style should be reinforced to apply to device-presence questions and other "is the hardware/service even reachable" first-question failures.
The model should be updated to handle cases where the required hardware or services are not available, such as by providing a clear error message or falling back to a different configuration.

Example

No code snippet is provided as the issue does not contain specific code that needs to be modified.

Notes

The issue highlights the importance of verifying the physical preconditions of the system before attempting to run code that relies on specific hardware or services. The "Evidence-first execution" protocol can help mitigate similar issues in the future.

Recommendation

Apply the workaround by implementing the "Evidence-first execution" protocol to verify the physical preconditions of the system before attempting to run TPU code. This will help prevent similar issues in the future and ensure that the model is more robust and efficient.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Claude Opus 4.7: 8-hour session, ~7h wasted on TPU iteration loop that a 90-second device probe would have ended [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Specific assistant failures during the session

What was actually delivered

Customer impact

Reproducible artifact

Constructive request to Anthropic

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Claude Opus 4.7: 8-hour session, ~7h wasted on TPU iteration loop that a 90-second device probe would have ended [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Specific assistant failures during the session

What was actually delivered

Customer impact

Reproducible artifact

Constructive request to Anthropic

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING