Fix Action

Fix / Workaround

V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current VJEPA2Model supports.

PR fix notes

PR #45497: Add V-JEPA 2.1 inference support

davevanveen · 2026-04-17T20:59:54Z

[transformers] PR 45497: Add V-JEPA 2.1 inference support - Repository: huggingface/transformers - Author: davevanveen - State: open | merged: False - Link: ht… # PR #45497: Add V-JEPA 2.1 inference support - Repository: huggingface/transformers - Author: davevanveen - State: open | merged: False - Link: https://github.com/huggingface/transformers/pull/45497 ## Description (problem / solution / changelog) # What does this PR do? Adds inference support and checkpoint conversion for **Meta V-JEPA 2.1** pretrained backbones within the existing `vjepa2` model family. V-JEPA 2.1 was released by Meta on 2026-03-16 with four pretrained encoders at 384 resolution: ViT-B (80M), ViT-L (300M), ViT-g (1B), and ViT-G (2B). This PR extends the existing `VJEPA2Model` with backward-compatible config fields and modeling changes to support loading these checkpoints. ### Changes **Config** (`configuration_vjepa2.py`): 8 new fields with backward-compatible defaults that preserve V-JEPA 2.0 behavior: - `use_rope_interleave`, `use_modality_embeddings`, `interpolate_rope`, `return_all_tokens`, `img_temporal_dim_size` - Predictor-only: `teacher_embed_dim`, `n_output_distillation`, `hierarchical_layers` **Modeling** (`modeling_vjepa2.py`): - Corrected RoPE implementation (`repeat_interleave` vs `repeat`) with config toggle - Learnable modality embeddings (image/video) for both encoder and predictor, with top-level modality routing - Separate image patch embedding (`tubelet_size=1`) when `img_temporal_dim_size` is set - Hierarchical feature extraction with per-layer `norms_block` in encoder - Smart encoder output: concatenated features when predictor needs them (`n_output_distillation > 1`), single-norm output for `get_vision_features()` / `skip_predictor=True` - RoPE position interpolation for flexible resolution (encoder only, matches Meta's behavior) - Multi-layer predictor embed (Linear+GELU+Linear) for `n_output_distillation > 1` - Predictor context token projection (`proj_context`) for `return_all_tokens` - Teacher embedding dimension-based output projection sizing - `VJEPA2ForVideoClassification` guard against unsupported hierarchical configs **Converter** (`convert_vjepa2_to_hf.py`): - Four new 2.1 model variants with correct architecture parameters - Key remappings for all new layers (modality embeddings, norms_block, patch_embed_img, proj_context) - Checkpoint key handling (`ema_encoder` for distilled models, `target_encoder` for self-supervised) - Updated test function to handle 2.1 tuple predictor returns **Tests** (`test_modeling_vjepa2.py`): - Config defaults test (verifies 2.0 backward compatibility) - Fast forward pass with shape assertions for `n_output_distillation=1` (distilled path) - Fast forward pass with shape assertions for `n_output_distillation=4` (multi-layer path) - All 90 existing tests pass unchanged **Docs** (`vjepa2.md`): V-JEPA 2.1 section with checkpoint table and architecture notes. ### Verification Verified end-to-end against Meta's reference implementation on two checkpoints covering both architecture paths: **ViT-B/384** (80M, `n_output_distillation=1`, distilled): - All weight keys match (strict load) - Encoder: max diff **0.0001** - Predictor: max diff **0.008** **ViT-g/384** (1B, `n_output_distillation=4`, self-supervised): - All weight keys match (strict load) - Encoder: max diff **0.004** - Predictor target: max diff **0.0002** - Predictor context: max diff **0.002** All diffs are within SDPA floating-point precision. ### 2.1 checkpoint summary | Model | Params | Distilled | `n_output_distillation` | `teacher_embed_dim` | |-------|--------|-----------|------------------------|---------------------| | ViT-B/16, 384 | 80M | Yes (ViT-G) | 1 | 1664 | | ViT-L/16, 384 | 300M | Yes (ViT-G) | 1 | 1664 | | ViT-g/16, 384 | 1B | No | 4 | — | | ViT-G/16, 384 | 2B | No | 4 | — | Fixes https://github.com/huggingface/transformers/issues/45496 ## Code Agent Policy - [x] I confirm that this is not a pure code agent PR. AI assistance (Claude Code) was used for implementation. I reviewed and verified all changes, ran tests, and validated checkpoint conversion end-to-end on both architecture paths (ViT-B and ViT-g). ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? https://github.com/huggingface/transformers/issues/45496 - [x] Did you make sure to update the documentation with your changes? - [x] Did you write any new necessary tests? ## Who can review? @yonigozlan — vision models reviewer and original V-JEPA 2 contributor @molbap — vision models reviewer ## Changed files - `docs/source/en/model_doc/vjepa2.md` (modified, +16/-1) - `src/transformers/models/vjepa2/configuration_v

Repository: huggingface/transformers
Author: davevanveen
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45497

Description (problem / solution / changelog)

What does this PR do?

Adds inference support and checkpoint conversion for Meta V-JEPA 2.1 pretrained backbones within the existing vjepa2 model family.

V-JEPA 2.1 was released by Meta on 2026-03-16 with four pretrained encoders at 384 resolution: ViT-B (80M), ViT-L (300M), ViT-g (1B), and ViT-G (2B). This PR extends the existing VJEPA2Model with backward-compatible config fields and modeling changes to support loading these checkpoints.

Changes

Config (configuration_vjepa2.py): 8 new fields with backward-compatible defaults that preserve V-JEPA 2.0 behavior:

use_rope_interleave, use_modality_embeddings, interpolate_rope, return_all_tokens, img_temporal_dim_size
Predictor-only: teacher_embed_dim, n_output_distillation, hierarchical_layers

Modeling (modeling_vjepa2.py):

Corrected RoPE implementation (repeat_interleave vs repeat) with config toggle
Learnable modality embeddings (image/video) for both encoder and predictor, with top-level modality routing
Separate image patch embedding (tubelet_size=1) when img_temporal_dim_size is set
Hierarchical feature extraction with per-layer norms_block in encoder
Smart encoder output: concatenated features when predictor needs them (n_output_distillation > 1), single-norm output for get_vision_features() / skip_predictor=True
RoPE position interpolation for flexible resolution (encoder only, matches Meta's behavior)
Multi-layer predictor embed (Linear+GELU+Linear) for n_output_distillation > 1
Predictor context token projection (proj_context) for return_all_tokens
Teacher embedding dimension-based output projection sizing
VJEPA2ForVideoClassification guard against unsupported hierarchical configs

Converter (convert_vjepa2_to_hf.py):

Four new 2.1 model variants with correct architecture parameters
Key remappings for all new layers (modality embeddings, norms_block, patch_embed_img, proj_context)
Checkpoint key handling (ema_encoder for distilled models, target_encoder for self-supervised)
Updated test function to handle 2.1 tuple predictor returns

Tests (test_modeling_vjepa2.py):

Config defaults test (verifies 2.0 backward compatibility)
Fast forward pass with shape assertions for n_output_distillation=1 (distilled path)
Fast forward pass with shape assertions for n_output_distillation=4 (multi-layer path)
All 90 existing tests pass unchanged

Docs (vjepa2.md): V-JEPA 2.1 section with checkpoint table and architecture notes.

Verification

Verified end-to-end against Meta's reference implementation on two checkpoints covering both architecture paths:

ViT-B/384 (80M, n_output_distillation=1, distilled):

All weight keys match (strict load)
Encoder: max diff 0.0001
Predictor: max diff 0.008

ViT-g/384 (1B, n_output_distillation=4, self-supervised):

All weight keys match (strict load)
Encoder: max diff 0.004
Predictor target: max diff 0.0002
Predictor context: max diff 0.002

All diffs are within SDPA floating-point precision.

2.1 checkpoint summary

Model	Params	Distilled	`n_output_distillation`	`teacher_embed_dim`
ViT-B/16, 384	80M	Yes (ViT-G)	1	1664
ViT-L/16, 384	300M	Yes (ViT-G)	1	1664
ViT-g/16, 384	1B	No	4	—
ViT-G/16, 384	2B	No	4	—

Fixes https://github.com/huggingface/transformers/issues/45496

Code Agent Policy

I confirm that this is not a pure code agent PR.

AI assistance (Claude Code) was used for implementation. I reviewed and verified all changes, ran tests, and validated checkpoint conversion end-to-end on both architecture paths (ViT-B and ViT-g).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? https://github.com/huggingface/transformers/issues/45496
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@yonigozlan — vision models reviewer and original V-JEPA 2 contributor @molbap — vision models reviewer

Changed files

docs/source/en/model_doc/vjepa2.md (modified, +16/-1)
src/transformers/models/vjepa2/configuration_vjepa2.py (modified, +24/-0)
src/transformers/models/vjepa2/convert_vjepa2_to_hf.py (modified, +143/-7)
src/transformers/models/vjepa2/modeling_vjepa2.py (modified, +146/-39)
tests/models/vjepa2/test_modeling_vjepa2.py (modified, +107/-0)

Feature request

Meta released V-JEPA 2.1 on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing vjepa2 model family in transformers supports V-JEPA 2.0 but not 2.1.

Paper: https://huggingface.co/papers/2603.14482 Code: https://github.com/facebookresearch/vjepa2 (see app/vjepa_2_1/)

Motivation

V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via from_pretrained with standard HF APIs, consistent with the existing V-JEPA 2.0 integration.

There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.

Your contribution

I have a working implementation on a branch that extends the existing vjepa2 model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.

extent analysis

TL;DR

Update the VJEPA2Model to support V-JEPA 2.1 by incorporating the necessary config and modeling extensions.

Guidance

Extend the existing vjepa2 model family to include backward-compatible config fields for the new architectural changes in V-JEPA 2.1.
Implement the required modeling changes, such as corrected RoPE implementation, learnable modality embeddings, and hierarchical feature extraction.
Verify the updated model against Meta's reference implementation to ensure consistency and accuracy.
Load the V-JEPA 2.1 checkpoints through the updated from_pretrained API to ensure seamless integration with the existing transformers support.

Notes

The implementation details are not fully provided, but it is mentioned that a working implementation exists on a branch, which will be opened as a PR shortly.

Recommendation

Apply workaround by extending the existing VJEPA2Model to support V-JEPA 2.1 until the official update is released, as this will allow users to load the new checkpoints through the standard HF APIs.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Add V-JEPA 2.1 inference support [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #45497: Add V-JEPA 2.1 inference support

Description (problem / solution / changelog)

What does this PR do?

Changes

Verification

2.1 checkpoint summary

Code Agent Policy

Before submitting

Who can review?

Changed files

Feature request

Motivation

Your contribution

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Add V-JEPA 2.1 inference support [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #45497: Add V-JEPA 2.1 inference support

Description (problem / solution / changelog)

What does this PR do?

Changes

Verification

2.1 checkpoint summary

Code Agent Policy

Before submitting

Who can review?

Changed files

Feature request

Motivation

Your contribution

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING