claude-code - 💡(How to fix) Fix [BUG] Claude deployed over ephemeral data without verifying backup, causing ~12 hours of processing and ~$50-100 in API costs lost [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#48430Fetched 2026-04-16 07:00:21
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×4commented ×1

Error Message

  1. It had read the R2 upload code — it could see the try/catch with console.warn that silently swallowed failures

Error Messages/Logs

RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

I was running a long-running batch job on Fly.io that processed 140 YouTube videos through an analysis pipeline (each requiring Claude API calls for visual analysis, scoring, and synthesis). The batch took ~12 hours and completed successfully — 139/140 videos processed.

The batch results were written to /tmp on the Fly.io machine (primary storage) with R2 (Cloudflare S3) as a backup. The R2 uploads were wrapped in a try/catch and marked as non-fatal, so failures were silently logged as warnings.

When the batch completed, Claude immediately deployed a new version of the backend to Fly.io without:

  1. Checking whether R2 backups had actually succeeded
  2. Checking /tmp for the batch data
  3. Verifying the data was persisted anywhere before restarting the machines

The deploy wiped /tmp, and it turned out R2 had been returning 403 Forbidden the entire time — every single upload failed silently. All 139 batch results were lost.

Expected behavior

Before taking a destructive action (deploying, which restarts machines and wipes ephemeral storage), Claude should have:

  • Verified the batch data was safely persisted (checked R2 accessibility or at minimum tested a single download)
  • Warned me that deploying would wipe /tmp where the only copy of the data lived
  • Noticed the R2 403 errors in the Fly logs before proceeding

Impact

  • ~12 hours of compute time on a performance-4x Fly.io machine wasted
  • 139 Claude API calls for visual analysis (claude-sonnet, vision + long context) lost
  • 139 Claude API calls for scoring lost
  • 139 Claude API calls for strategic synthesis lost
  • Total: ~417 Claude API calls worth of tokens, plus Replicate (Demucs) and Fly.io compute costs

Estimated cost of lost work: $50-100 (primarily Claude API tokens for 278+ Sonnet calls including vision, plus Replicate, Fly.io compute, and proxy bandwidth for 12 hours of processing)

Context

Claude was monitoring the batch progress the entire time via Fly.io log streaming. It knew the batch was running, knew results were being written to /tmp, and had access to the R2 upload code showing the non-fatal try/catch pattern. The information to avoid this mistake was available throughout the conversation.

What Should Happen?

Expected behavior

Claude's own system instructions state: "Carefully consider the reversibility and blast radius of actions" and "for actions that are hard to reverse, affect shared systems beyond your local environment, or could otherwise be risky or destructive, check with the user before proceeding."

Deploying to Fly.io restarts machines and wipes ephemeral storage. Claude had full context to know this was destructive:

  1. It had read the batch processing code — it knew results were written to /tmp/batch_${id}.json as primary storage, with R2 as a non-fatal secondary backup
  2. It had been monitoring the batch for ~12 hours — it watched every single video complete via log streaming and knew exactly where the data lived
  3. It had read the R2 upload code — it could see the try/catch with console.warn that silently swallowed failures
  4. It had access to Fly logs — the R2 403 errors were in the logs the entire time

Before deploying, Claude should have:

  • Tested a single R2 download to verify the backup was intact (one curl command)
  • Checked Fly logs for R2 warnings before assuming data was persisted
  • Warned me that deploying would wipe /tmp and asked for confirmation
  • At minimum, SSH'd into the machine to confirm the batch files existed and were backed up

Instead, it deployed immediately with no verification, treating a destructive action as routine.

Error Messages/Logs

Steps to Reproduce

Steps to reproduce

  1. Have a long-running batch job writing results to /tmp on a Fly.io machine, with a secondary backup to cloud storage (R2/S3) wrapped in a non-fatal try/catch
  2. Have the cloud storage credentials be invalid (returning 403), so all backups silently fail
  3. Monitor the batch with Claude over ~12 hours — Claude streams the logs and sees every video complete
  4. When the batch finishes, ask Claude to deploy new code to the same Fly.io app
  5. Claude will deploy immediately without checking whether the data survived, wiping /tmp and all results

How to reproduce the core issue more simply

  1. Have any process writing important data to ephemeral storage (/tmp on Fly.io, or any container that restarts on deploy)
  2. Ask Claude to deploy or restart the service
  3. Claude will not verify data persistence before taking the destructive action, even when it has full context about where the data lives and how it's backed up

Claude Model

Sonnet (default)

Is this a regression?

I don't know

Last Working Version

No response

Claude Code Version

claude-opus-4-6

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

No response

extent analysis

TL;DR

To prevent data loss, Claude should verify data persistence before deploying new code to Fly.io, checking R2 backups and warning users about potential data wipe from ephemeral storage.

Guidance

  • Before deploying, Claude should test a single R2 download to verify the backup is intact.
  • Claude should check Fly logs for R2 warnings or errors to ensure data is persisted.
  • Claude should warn users that deploying will wipe ephemeral storage (/tmp) and ask for confirmation before proceeding.
  • Consider implementing a pre-deployment check to verify data safety, especially for long-running batch jobs with important data written to ephemeral storage.

Example

No specific code example is provided, but a simple curl command to test R2 download or a log check can be implemented to verify data persistence.

Notes

This issue highlights the importance of careful consideration before taking destructive actions, especially when dealing with ephemeral storage and critical data. The solution involves adding verification steps to ensure data safety before deployment.

Recommendation

Apply a workaround by modifying Claude's deployment process to include pre-deployment checks for data persistence, such as testing R2 downloads and checking Fly logs for warnings. This will help prevent similar data loss incidents in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING