transformers - ✅(Solved) Fix Add V-JEPA 2.1 inference support [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45496Fetched 2026-04-18 05:51:47
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current VJEPA2Model supports.

PR fix notes

PR #45497: Add V-JEPA 2.1 inference support

Description (problem / solution / changelog)

What does this PR do?

Adds inference support and checkpoint conversion for Meta V-JEPA 2.1 pretrained backbones within the existing vjepa2 model family.

V-JEPA 2.1 was released by Meta on 2026-03-16 with four pretrained encoders at 384 resolution: ViT-B (80M), ViT-L (300M), ViT-g (1B), and ViT-G (2B). This PR extends the existing VJEPA2Model with backward-compatible config fields and modeling changes to support loading these checkpoints.

Changes

Config (configuration_vjepa2.py): 8 new fields with backward-compatible defaults that preserve V-JEPA 2.0 behavior:

  • use_rope_interleave, use_modality_embeddings, interpolate_rope, return_all_tokens, img_temporal_dim_size
  • Predictor-only: teacher_embed_dim, n_output_distillation, hierarchical_layers

Modeling (modeling_vjepa2.py):

  • Corrected RoPE implementation (repeat_interleave vs repeat) with config toggle
  • Learnable modality embeddings (image/video) for both encoder and predictor, with top-level modality routing
  • Separate image patch embedding (tubelet_size=1) when img_temporal_dim_size is set
  • Hierarchical feature extraction with per-layer norms_block in encoder
  • Smart encoder output: concatenated features when predictor needs them (n_output_distillation > 1), single-norm output for get_vision_features() / skip_predictor=True
  • RoPE position interpolation for flexible resolution (encoder only, matches Meta's behavior)
  • Multi-layer predictor embed (Linear+GELU+Linear) for n_output_distillation > 1
  • Predictor context token projection (proj_context) for return_all_tokens
  • Teacher embedding dimension-based output projection sizing
  • VJEPA2ForVideoClassification guard against unsupported hierarchical configs

Converter (convert_vjepa2_to_hf.py):

  • Four new 2.1 model variants with correct architecture parameters
  • Key remappings for all new layers (modality embeddings, norms_block, patch_embed_img, proj_context)
  • Checkpoint key handling (ema_encoder for distilled models, target_encoder for self-supervised)
  • Updated test function to handle 2.1 tuple predictor returns

Tests (test_modeling_vjepa2.py):

  • Config defaults test (verifies 2.0 backward compatibility)
  • Fast forward pass with shape assertions for n_output_distillation=1 (distilled path)
  • Fast forward pass with shape assertions for n_output_distillation=4 (multi-layer path)
  • All 90 existing tests pass unchanged

Docs (vjepa2.md): V-JEPA 2.1 section with checkpoint table and architecture notes.

Verification

Verified end-to-end against Meta's reference implementation on two checkpoints covering both architecture paths:

ViT-B/384 (80M, n_output_distillation=1, distilled):

  • All weight keys match (strict load)
  • Encoder: max diff 0.0001
  • Predictor: max diff 0.008

ViT-g/384 (1B, n_output_distillation=4, self-supervised):

  • All weight keys match (strict load)
  • Encoder: max diff 0.004
  • Predictor target: max diff 0.0002
  • Predictor context: max diff 0.002

All diffs are within SDPA floating-point precision.

2.1 checkpoint summary

ModelParamsDistilledn_output_distillationteacher_embed_dim
ViT-B/16, 38480MYes (ViT-G)11664
ViT-L/16, 384300MYes (ViT-G)11664
ViT-g/16, 3841BNo4
ViT-G/16, 3842BNo4

Fixes https://github.com/huggingface/transformers/issues/45496

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

AI assistance (Claude Code) was used for implementation. I reviewed and verified all changes, ran tests, and validated checkpoint conversion end-to-end on both architecture paths (ViT-B and ViT-g).

Before submitting

Who can review?

@yonigozlan — vision models reviewer and original V-JEPA 2 contributor @molbap — vision models reviewer

Changed files

  • docs/source/en/model_doc/vjepa2.md (modified, +16/-1)
  • src/transformers/models/vjepa2/configuration_vjepa2.py (modified, +24/-0)
  • src/transformers/models/vjepa2/convert_vjepa2_to_hf.py (modified, +143/-7)
  • src/transformers/models/vjepa2/modeling_vjepa2.py (modified, +146/-39)
  • tests/models/vjepa2/test_modeling_vjepa2.py (modified, +107/-0)
RAW_BUFFERClick to expand / collapse

Feature request

Meta released V-JEPA 2.1 on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing vjepa2 model family in transformers supports V-JEPA 2.0 but not 2.1.

V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current VJEPA2Model supports.

Paper: https://huggingface.co/papers/2603.14482 Code: https://github.com/facebookresearch/vjepa2 (see app/vjepa_2_1/)

Motivation

V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via from_pretrained with standard HF APIs, consistent with the existing V-JEPA 2.0 integration.

There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.

Your contribution

I have a working implementation on a branch that extends the existing vjepa2 model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.

extent analysis

TL;DR

Update the VJEPA2Model to support V-JEPA 2.1 by incorporating the necessary config and modeling extensions.

Guidance

  • Extend the existing vjepa2 model family to include backward-compatible config fields for the new architectural changes in V-JEPA 2.1.
  • Implement the required modeling changes, such as corrected RoPE implementation, learnable modality embeddings, and hierarchical feature extraction.
  • Verify the updated model against Meta's reference implementation to ensure consistency and accuracy.
  • Load the V-JEPA 2.1 checkpoints through the updated from_pretrained API to ensure seamless integration with the existing transformers support.

Notes

The implementation details are not fully provided, but it is mentioned that a working implementation exists on a branch, which will be opened as a PR shortly.

Recommendation

Apply workaround by extending the existing VJEPA2Model to support V-JEPA 2.1 until the official update is released, as this will allow users to load the new checkpoints through the standard HF APIs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Add V-JEPA 2.1 inference support [1 pull requests, 1 participants]