transformers - ✅(Solved) Fix CLIPTextModel / CLIPVisionModel fail to load old checkpoints after architecture flattening [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45390Fetched 2026-04-15 06:19:48
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
subscribed ×3closed ×1commented ×1cross-referenced ×1

After the recent refactoring that flattened CLIPTextModel (removed the self.text_model wrapper) and CLIPVisionModel (removed the self.vision_model wrapper), old checkpoints that were saved with the nested structure can no longer be loaded correctly.

All weights end up randomly initialized because the checkpoint keys (e.g. text_model.embeddings.token_embedding.weight) don't match the new model's state dict keys (e.g. embeddings.token_embedding.weight).

Error Message

Checkpoint key example: text_model.embeddings.token_embedding.weight Expected token_embedding sum: -4.9096 Actual token_embedding sum: -0.0497 # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497

Root Cause

All weights end up randomly initialized because the checkpoint keys (e.g. text_model.embeddings.token_embedding.weight) don't match the new model's state dict keys (e.g. embeddings.token_embedding.weight).

Fix Action

Fixed

PR fix notes

PR #45361: Add CLIP-like models in conversion to VLMs

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/trl/issues/5497, also fixes https://github.com/huggingface/transformers/issues/45390

TL;DR; the base model prefix is never appended if it is part of a bigger VLM, which was true for LLaVa. Loading CLIP checkpoint is not affected tho, which is why we missed it before merging

Checked "load-save-load back" pipeline with several models at random: Llava, InternVL, CLIP, Siglip, T5Gemma2, Gemma3, GotOCR, AltClip, ClipSeg. I hope other models are saved in the hub in a similar way

cc @albertvillanova

Changed files

  • src/transformers/conversion_mapping.py (modified, +15/-0)
  • tests/models/altclip/test_modeling_altclip.py (modified, +9/-5)
  • tests/models/chinese_clip/test_modeling_chinese_clip.py (modified, +52/-0)
  • tests/test_modeling_common.py (modified, +22/-8)

Code Example

import torch
from transformers import CLIPTextModel, CLIPTextConfig

# Any old-format CLIP checkpoint works; this one ships with diffusers tests
model_path = "hf-internal-testing/tiny-stable-diffusion-torch"

# Download so it's cached
from huggingface_hub import hf_hub_download
ckpt_dir = hf_hub_download(model_path, "text_encoder/pytorch_model.bin")

# Show the checkpoint has text_model.* keys
sd = torch.load(ckpt_dir, map_location="cpu", weights_only=True)
print("Checkpoint key example:", list(sd.keys())[1])
# → text_model.embeddings.token_embedding.weight

expected_sum = sd["text_model.embeddings.token_embedding.weight"].sum().item()
print(f"Expected token_embedding sum: {expected_sum:.4f}")

# Load via from_pretrained
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder"
)
actual_sum = te.state_dict()["embeddings.token_embedding.weight"].sum().item()
print(f"Actual token_embedding sum:   {actual_sum:.4f}")

assert abs(expected_sum - actual_sum) < 1e-5, (
    f"Weights were NOT loaded! expected={expected_sum:.4f}, got={actual_sum:.4f}"
)

---

Checkpoint key example: text_model.embeddings.token_embedding.weight
Expected token_embedding sum: -4.9096
Actual token_embedding sum:   -0.0497    # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497
RAW_BUFFERClick to expand / collapse

Description

After the recent refactoring that flattened CLIPTextModel (removed the self.text_model wrapper) and CLIPVisionModel (removed the self.vision_model wrapper), old checkpoints that were saved with the nested structure can no longer be loaded correctly.

All weights end up randomly initialized because the checkpoint keys (e.g. text_model.embeddings.token_embedding.weight) don't match the new model's state dict keys (e.g. embeddings.token_embedding.weight).

Minimal reproducer

import torch
from transformers import CLIPTextModel, CLIPTextConfig

# Any old-format CLIP checkpoint works; this one ships with diffusers tests
model_path = "hf-internal-testing/tiny-stable-diffusion-torch"

# Download so it's cached
from huggingface_hub import hf_hub_download
ckpt_dir = hf_hub_download(model_path, "text_encoder/pytorch_model.bin")

# Show the checkpoint has text_model.* keys
sd = torch.load(ckpt_dir, map_location="cpu", weights_only=True)
print("Checkpoint key example:", list(sd.keys())[1])
# → text_model.embeddings.token_embedding.weight

expected_sum = sd["text_model.embeddings.token_embedding.weight"].sum().item()
print(f"Expected token_embedding sum: {expected_sum:.4f}")

# Load via from_pretrained
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder"
)
actual_sum = te.state_dict()["embeddings.token_embedding.weight"].sum().item()
print(f"Actual token_embedding sum:   {actual_sum:.4f}")

assert abs(expected_sum - actual_sum) < 1e-5, (
    f"Weights were NOT loaded! expected={expected_sum:.4f}, got={actual_sum:.4f}"
)

Output (failing):

Checkpoint key example: text_model.embeddings.token_embedding.weight
Expected token_embedding sum: -4.9096
Actual token_embedding sum:   -0.0497    # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497

Impact

This breaks any downstream code that loads CLIPTextModel or CLIPVisionModel from checkpoints saved with previous transformers versions — including all Stable Diffusion pipelines in diffusers.

extent analysis

TL;DR

The issue can be fixed by updating the checkpoint loading logic to handle the changed model structure after refactoring.

Guidance

  • The root cause is the mismatch between the old checkpoint keys (e.g., text_model.embeddings.token_embedding.weight) and the new model's state dict keys (e.g., embeddings.token_embedding.weight).
  • To verify the fix, compare the expected and actual sums of the token_embedding.weight tensor after loading the checkpoint.
  • A potential workaround is to manually update the checkpoint keys to match the new model structure before loading the checkpoint.
  • Consider adding a compatibility layer to handle old checkpoints and ensure a smooth transition to the new model structure.

Example

# Manually update the checkpoint keys
updated_sd = {}
for key, value in sd.items():
    if key.startswith("text_model."):
        updated_sd[key[11:]] = value
    else:
        updated_sd[key] = value

# Load the updated checkpoint
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder", state_dict=updated_sd
)

Notes

This fix assumes that the only change is the removal of the self.text_model and self.vision_model wrappers. If there are other structural changes, additional updates may be necessary.

Recommendation

Apply a workaround by manually updating the checkpoint keys or adding a compatibility layer to handle old checkpoints, as upgrading to a fixed version is not mentioned in the issue. This approach ensures that existing checkpoints can still be loaded correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING