transformers - 💡(How to fix) Fix [deepseek_v4] save_pretrained silently downcasts FP32 tensors to BF16 (hc_*, attn_sink, ffn.gate.bias, compressor.ape, indexer.compressor.ape)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

save_pretrained silently downcasts 417 FP32 tensors to BF16 when saving a DeepSeek-V4 model loaded with torch_dtype=torch.bfloat16. No warning, no error. The downcast loses numerical precision on plumbing tensors that DeepSeek's release spec keeps at FP32 for a reason.

Error Message

save_pretrained silently downcasts 417 FP32 tensors to BF16 when saving a DeepSeek-V4 model loaded with torch_dtype=torch.bfloat16. No warning, no error. The downcast loses numerical precision on plumbing tensors that DeepSeek's release spec keeps at FP32 for a reason.

Root Cause

save_pretrained silently downcasts 417 FP32 tensors to BF16 when saving a DeepSeek-V4 model loaded with torch_dtype=torch.bfloat16. No warning, no error. The downcast loses numerical precision on plumbing tensors that DeepSeek's release spec keeps at FP32 for a reason.

Fix Action

Fix / Workaround

Workaround (working in production)

Code Example

from transformers import AutoModelForCausalLM
import torch, safetensors
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.save_pretrained("/tmp/dsv4_resaved")
# Compare /tmp/dsv4_resaved/model-*.safetensors dtypes vs source
# Result: 417 keys that were FP32 in source are now BF16 in resaved
RAW_BUFFERClick to expand / collapse

Summary

save_pretrained silently downcasts 417 FP32 tensors to BF16 when saving a DeepSeek-V4 model loaded with torch_dtype=torch.bfloat16. No warning, no error. The downcast loses numerical precision on plumbing tensors that DeepSeek's release spec keeps at FP32 for a reason.

Affected tensor groups

Per DeepSeek's release at deepseek-ai/DeepSeek-V4-Flash, the following tensor groups are FP32 in the source safetensors:

PatternCount per modelRole
layers.X.hc_attn_{base,fn,scale} + mtp.0.hc_attn_{base,fn,scale}3 × 44 = 132Hyper-connection attention plumbing
layers.X.hc_ffn_{base,fn,scale} + mtp.0.hc_ffn_{base,fn,scale}3 × 44 = 132Hyper-connection FFN plumbing
model.hc_head_{base,fn,scale} + mtp.0.hc_head_{base,fn,scale}6Top-level + MTP hc_head
layers.X.attn.attn_sink + mtp.0.attn.attn_sink~44Attention sink tokens
layers.X.ffn.gate.bias (was e_score_correction_bias)41MoE routing bias
layers.X.attn.compressor.ape (was position_bias)41Compressor positional encoding
layers.X.attn.indexer.compressor.ape (was position_bias)21Indexer positional encoding
Total417

All 417 are saved as BF16 by save_pretrained when the model dtype is BF16.

Repro (minimal)

from transformers import AutoModelForCausalLM
import torch, safetensors
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.save_pretrained("/tmp/dsv4_resaved")
# Compare /tmp/dsv4_resaved/model-*.safetensors dtypes vs source
# Result: 417 keys that were FP32 in source are now BF16 in resaved

Workaround (working in production)

Postprocess: after save_pretrained returns, walk the saved safetensors shards, read FP32 versions of the affected keys from the BF16 source checkpoint, and write them back in place (atomic per-shard via .tmp + os.replace). Working example: scripts/fixup_artifact.py in canada-quant/dsv4-flash-w4a16-fp8-mtp.

Suggested fix

save_pretrained should preserve per-tensor dtype rather than coerce to model's torch_dtype. The current behavior is right for most weights (you wanted BF16 for the W4A16 main model) but wrong for plumbing tensors that ship FP32 for numerical-stability reasons. A whitelist of "always preserve source dtype" patterns per architecture would work; for DSv4-Flash the regex is roughly r".*\.(hc_|attn_sink|ffn\.gate\.bias|compressor\.ape|indexer\.compressor\.ape)$".

Alternatively, a per-parameter dtype hint in the saved metadata + opt-in flag on the model config would solve it generically.

Why it matters

Without the postprocess restore, the saved artifact has BF16 plumbing → numerical drift on the gating math + LM head logits → measurable quality regression. With the restore, our W4A16+FP8+MTP artifact at https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP hits 86.88% MMLU and 93.71% GSM8K (within SE of the un-restored-baseline-impossible-to-produce target). The sibling NVFP4 artifact at canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP applies the same restore postprocess.

This is filed by the canada-quant team during W4A16+FP8+MTP quantization work. See FINDINGS_FOR_SIBLING.md §C13 for the diagnosis trace.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING