transformers - ✅(Solved) Fix CLIPTextModel / CLIPVisionModel fail to load old checkpoints after architecture flattening [1 pull requests, 1 comments, 2 participants]

transformers2026-04-13 04:50:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45390•Fetched 2026-04-15 06:19:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sayakpaul

Participants

Abineshabee

sayakpaul

Timeline (top)

subscribed ×3closed ×1commented ×1cross-referenced ×1

After the recent refactoring that flattened CLIPTextModel (removed the self.text_model wrapper) and CLIPVisionModel (removed the self.vision_model wrapper), old checkpoints that were saved with the nested structure can no longer be loaded correctly.

All weights end up randomly initialized because the checkpoint keys (e.g. text_model.embeddings.token_embedding.weight) don't match the new model's state dict keys (e.g. embeddings.token_embedding.weight).

Error Message

Checkpoint key example: text_model.embeddings.token_embedding.weight Expected token_embedding sum: -4.9096 Actual token_embedding sum: -0.0497 # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497

Root Cause

Fix Action

Fixed

Fixed by PR: Add CLIP-like models in conversion to VLMs (https://github.com/huggingface/transformers/pull/45361)

PR fix notes

PR #45361: Add CLIP-like models in conversion to VLMs

Repository: huggingface/transformers
Author: zucchini-nlp
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45361

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/trl/issues/5497, also fixes https://github.com/huggingface/transformers/issues/45390

TL;DR; the base model prefix is never appended if it is part of a bigger VLM, which was true for LLaVa. Loading CLIP checkpoint is not affected tho, which is why we missed it before merging

Checked "load-save-load back" pipeline with several models at random: Llava, InternVL, CLIP, Siglip, T5Gemma2, Gemma3, GotOCR, AltClip, ClipSeg. I hope other models are saved in the hub in a similar way

cc @albertvillanova

Changed files

src/transformers/conversion_mapping.py (modified, +15/-0)
tests/models/altclip/test_modeling_altclip.py (modified, +9/-5)
tests/models/chinese_clip/test_modeling_chinese_clip.py (modified, +52/-0)
tests/test_modeling_common.py (modified, +22/-8)

Code Example

import torch
from transformers import CLIPTextModel, CLIPTextConfig

# Any old-format CLIP checkpoint works; this one ships with diffusers tests
model_path = "hf-internal-testing/tiny-stable-diffusion-torch"

# Download so it's cached
from huggingface_hub import hf_hub_download
ckpt_dir = hf_hub_download(model_path, "text_encoder/pytorch_model.bin")

# Show the checkpoint has text_model.* keys
sd = torch.load(ckpt_dir, map_location="cpu", weights_only=True)
print("Checkpoint key example:", list(sd.keys())[1])
# → text_model.embeddings.token_embedding.weight

expected_sum = sd["text_model.embeddings.token_embedding.weight"].sum().item()
print(f"Expected token_embedding sum: {expected_sum:.4f}")

# Load via from_pretrained
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder"
)
actual_sum = te.state_dict()["embeddings.token_embedding.weight"].sum().item()
print(f"Actual token_embedding sum:   {actual_sum:.4f}")

assert abs(expected_sum - actual_sum) < 1e-5, (
    f"Weights were NOT loaded! expected={expected_sum:.4f}, got={actual_sum:.4f}"
)

---

Checkpoint key example: text_model.embeddings.token_embedding.weight
Expected token_embedding sum: -4.9096
Actual token_embedding sum:   -0.0497    # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497

RAW_BUFFERClick to expand / collapse

Description

Minimal reproducer

import torch
from transformers import CLIPTextModel, CLIPTextConfig

# Any old-format CLIP checkpoint works; this one ships with diffusers tests
model_path = "hf-internal-testing/tiny-stable-diffusion-torch"

# Download so it's cached
from huggingface_hub import hf_hub_download
ckpt_dir = hf_hub_download(model_path, "text_encoder/pytorch_model.bin")

# Show the checkpoint has text_model.* keys
sd = torch.load(ckpt_dir, map_location="cpu", weights_only=True)
print("Checkpoint key example:", list(sd.keys())[1])
# → text_model.embeddings.token_embedding.weight

expected_sum = sd["text_model.embeddings.token_embedding.weight"].sum().item()
print(f"Expected token_embedding sum: {expected_sum:.4f}")

# Load via from_pretrained
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder"
)
actual_sum = te.state_dict()["embeddings.token_embedding.weight"].sum().item()
print(f"Actual token_embedding sum:   {actual_sum:.4f}")

assert abs(expected_sum - actual_sum) < 1e-5, (
    f"Weights were NOT loaded! expected={expected_sum:.4f}, got={actual_sum:.4f}"
)

Output (failing):

Checkpoint key example: text_model.embeddings.token_embedding.weight
Expected token_embedding sum: -4.9096
Actual token_embedding sum:   -0.0497    # ← random init, not checkpoint value

AssertionError: Weights were NOT loaded! expected=-4.9096, got=-0.0497

Impact

This breaks any downstream code that loads CLIPTextModel or CLIPVisionModel from checkpoints saved with previous transformers versions — including all Stable Diffusion pipelines in diffusers.

extent analysis

TL;DR

The issue can be fixed by updating the checkpoint loading logic to handle the changed model structure after refactoring.

Guidance

The root cause is the mismatch between the old checkpoint keys (e.g., text_model.embeddings.token_embedding.weight) and the new model's state dict keys (e.g., embeddings.token_embedding.weight).
To verify the fix, compare the expected and actual sums of the token_embedding.weight tensor after loading the checkpoint.
A potential workaround is to manually update the checkpoint keys to match the new model structure before loading the checkpoint.
Consider adding a compatibility layer to handle old checkpoints and ensure a smooth transition to the new model structure.

Example

# Manually update the checkpoint keys
updated_sd = {}
for key, value in sd.items():
    if key.startswith("text_model."):
        updated_sd[key[11:]] = value
    else:
        updated_sd[key] = value

# Load the updated checkpoint
te = CLIPTextModel.from_pretrained(
    model_path, subfolder="text_encoder", state_dict=updated_sd
)

Notes

This fix assumes that the only change is the removal of the self.text_model and self.vision_model wrappers. If there are other structural changes, additional updates may be necessary.

Recommendation

Apply a workaround by manually updating the checkpoint keys or adding a compatibility layer to handle old checkpoints, as upgrading to a fixed version is not mentioned in the issue. This approach ensures that existing checkpoints can still be loaded correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#output truncation #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix CLIPTextModel / CLIPVisionModel fail to load old checkpoints after architecture flattening [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #45361: Add CLIP-like models in conversion to VLMs

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

Description

Minimal reproducer

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix CLIPTextModel / CLIPVisionModel fail to load old checkpoints after architecture flattening [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #45361: Add CLIP-like models in conversion to VLMs

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

Description

Minimal reproducer

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING