claude-code - 💡(How to fix) Fix [BUG] Claude deployed over ephemeral data without verifying backup, causing ~12 hours of processing and ~$50-100 in API costs lost [1 comments, 2 participants]

Preflight Checklist

I have searched existing issues and this hasn't been reported yet
This is a single bug report (please file separate reports for different bugs)
I am using the latest version of Claude Code

What's Wrong?

I was running a long-running batch job on Fly.io that processed 140 YouTube videos through an analysis pipeline (each requiring Claude API calls for visual analysis, scoring, and synthesis). The batch took ~12 hours and completed successfully — 139/140 videos processed.

The batch results were written to /tmp on the Fly.io machine (primary storage) with R2 (Cloudflare S3) as a backup. The R2 uploads were wrapped in a try/catch and marked as non-fatal, so failures were silently logged as warnings.

When the batch completed, Claude immediately deployed a new version of the backend to Fly.io without:

Checking whether R2 backups had actually succeeded
Checking /tmp for the batch data
Verifying the data was persisted anywhere before restarting the machines

The deploy wiped /tmp, and it turned out R2 had been returning 403 Forbidden the entire time — every single upload failed silently. All 139 batch results were lost.

Expected behavior

Before taking a destructive action (deploying, which restarts machines and wipes ephemeral storage), Claude should have:

Verified the batch data was safely persisted (checked R2 accessibility or at minimum tested a single download)
Warned me that deploying would wipe /tmp where the only copy of the data lived
Noticed the R2 403 errors in the Fly logs before proceeding

Impact

~12 hours of compute time on a performance-4x Fly.io machine wasted
139 Claude API calls for visual analysis (claude-sonnet, vision + long context) lost
139 Claude API calls for scoring lost
139 Claude API calls for strategic synthesis lost
Total: ~417 Claude API calls worth of tokens, plus Replicate (Demucs) and Fly.io compute costs

Estimated cost of lost work: $50-100 (primarily Claude API tokens for 278+ Sonnet calls including vision, plus Replicate, Fly.io compute, and proxy bandwidth for 12 hours of processing)

Context

Claude was monitoring the batch progress the entire time via Fly.io log streaming. It knew the batch was running, knew results were being written to /tmp, and had access to the R2 upload code showing the non-fatal try/catch pattern. The information to avoid this mistake was available throughout the conversation.

What Should Happen?

Expected behavior

Claude's own system instructions state: "Carefully consider the reversibility and blast radius of actions" and "for actions that are hard to reverse, affect shared systems beyond your local environment, or could otherwise be risky or destructive, check with the user before proceeding."

Deploying to Fly.io restarts machines and wipes ephemeral storage. Claude had full context to know this was destructive:

It had read the batch processing code — it knew results were written to /tmp/batch_${id}.json as primary storage, with R2 as a non-fatal secondary backup
It had been monitoring the batch for ~12 hours — it watched every single video complete via log streaming and knew exactly where the data lived
It had read the R2 upload code — it could see the try/catch with console.warn that silently swallowed failures
It had access to Fly logs — the R2 403 errors were in the logs the entire time

Before deploying, Claude should have:

Tested a single R2 download to verify the backup was intact (one curl command)
Checked Fly logs for R2 warnings before assuming data was persisted
Warned me that deploying would wipe /tmp and asked for confirmation
At minimum, SSH'd into the machine to confirm the batch files existed and were backed up

Instead, it deployed immediately with no verification, treating a destructive action as routine.

Error Messages/Logs

Steps to Reproduce

Steps to reproduce

Have a long-running batch job writing results to /tmp on a Fly.io machine, with a secondary backup to cloud storage (R2/S3) wrapped in a non-fatal try/catch
Have the cloud storage credentials be invalid (returning 403), so all backups silently fail
Monitor the batch with Claude over ~12 hours — Claude streams the logs and sees every video complete
When the batch finishes, ask Claude to deploy new code to the same Fly.io app
Claude will deploy immediately without checking whether the data survived, wiping /tmp and all results

How to reproduce the core issue more simply

Have any process writing important data to ephemeral storage (/tmp on Fly.io, or any container that restarts on deploy)
Ask Claude to deploy or restart the service
Claude will not verify data persistence before taking the destructive action, even when it has full context about where the data lives and how it's backed up

Claude Model

Sonnet (default)

Is this a regression?

I don't know

Last Working Version

No response

Claude Code Version

claude-opus-4-6

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

No response

extent analysis

TL;DR

To prevent data loss, Claude should verify data persistence before deploying new code to Fly.io, checking R2 backups and warning users about potential data wipe from ephemeral storage.

Guidance

Before deploying, Claude should test a single R2 download to verify the backup is intact.
Claude should check Fly logs for R2 warnings or errors to ensure data is persisted.
Claude should warn users that deploying will wipe ephemeral storage (/tmp) and ask for confirmation before proceeding.
Consider implementing a pre-deployment check to verify data safety, especially for long-running batch jobs with important data written to ephemeral storage.

Example

No specific code example is provided, but a simple curl command to test R2 download or a log check can be implemented to verify data persistence.

Notes

This issue highlights the importance of careful consideration before taking destructive actions, especially when dealing with ephemeral storage and critical data. The solution involves adding verification steps to ensure data safety before deployment.

Recommendation

Apply a workaround by modifying Claude's deployment process to include pre-deployment checks for data persistence, such as testing R2 downloads and checking Fly logs for warnings. This will help prevent similar data loss incidents in the future.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [BUG] Claude deployed over ephemeral data without verifying backup, causing ~12 hours of processing and ~$50-100 in API costs lost [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Preflight Checklist

What's Wrong?

What Should Happen?

Error Messages/Logs

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Claude deployed over ephemeral data without verifying backup, causing ~12 hours of processing and ~$50-100 in API costs lost [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Preflight Checklist

What's Wrong?

What Should Happen?

Error Messages/Logs

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING