transformers - ✅(Solved) Fix add HyperCLOVA X SEED Vision Instruct 3B [1 pull requests, 11 comments, 4 participants]

transformers2026-03-29 16:48:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45099•Fetched 2026-04-08 01:48:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×11subscribed ×11mentioned ×10cross-referenced ×2

Root Cause

Practical issues caused by not being in transformers: The model currently can only be loaded with trust_remote_code=True, and it has been confirmed that it fails the @strict config validation introduced in transformers v5. Specifically, during vLLM's transformers v5 compatibility work (vllm-project/vllm#38379), it was discovered that HCXVisionConfig fails the strict validation when initialized with text_config=None. vLLM applied a temporary fix by vendoring the config (vllm-project/vllm#38447), but the fundamental resolution order would be: vendoring → fixing configuration_hyperclovax.py on HuggingFace Hub → official upstreaming into transformers. Steps 1 and 2 are currently in progress, and this issue is being opened to address step 3.

PR fix notes

PR #44314: add HyperClovaX Vision

Repository: huggingface/transformers
Author: jp1924
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44314

Description (problem / solution / changelog)

What does this PR do?

Hello, Transformers team!

I submitted a PR to add naver-hyperclovax/HyperCLOVAX-SEED-Think-32B (hereafter HCX), developed by the Korean IT company Naver while executing the government's national AI model project.

The HCX code was written based on Transformer 4.52.4, leading to the following issues:

Being based on an outdated Transformer model prevents the application of the latest training optimization techniques supported by Transformer 5.0.0 (e.g., sequence parallelism).
The use of some deprecated code or features may cause unexpected bugs in the latest Transformer version.
The modeling code was overly complex, reducing debugging and development convenience. Additionally, experimental code used during model creation remained untouched.

Moving to Transformer 5.0.0 significantly improved the readability and development convenience of the modeling code. We aim to leverage this to add the HCX model to transformers.

TODO list

Add docstrings

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan @molbap

Changed files

docs/source/en/_toctree.yml (modified, +2/-0)
docs/source/en/model_doc/hyperclovax_vision_v2.md (added, +313/-0)
docs/source/en/model_doc/qwen2_5_vl.md (modified, +21/-12)
docs/source/ko/_toctree.yml (modified, +2/-0)
docs/source/ko/model_doc/hyperclovax_vision_v2.md (added, +313/-0)
src/transformers/conversion_mapping.py (modified, +14/-0)
src/transformers/models/__init__.py (modified, +1/-0)
src/transformers/models/auto/configuration_auto.py (modified, +13/-1)
src/transformers/models/auto/image_processing_auto.py (modified, +1/-0)
src/transformers/models/auto/modeling_auto.py (modified, +7/-0)
src/transformers/models/auto/processing_auto.py (modified, +1/-0)
src/transformers/models/auto/tokenization_auto.py (modified, +4/-1)
src/transformers/models/auto/video_processing_auto.py (modified, +1/-0)
src/transformers/models/hyperclovax_vision_v2/__init__.py (added, +28/-0)
src/transformers/models/hyperclovax_vision_v2/configuration_hyperclovax_vision_v2.py (added, +168/-0)
src/transformers/models/hyperclovax_vision_v2/modeling_hyperclovax_vision_v2.py (added, +1002/-0)
src/transformers/models/hyperclovax_vision_v2/modular_hyperclovax_vision_v2.py (added, +717/-0)
src/transformers/models/hyperclovax_vision_v2/processing_hyperclovax_vision_v2.py (added, +199/-0)
src/transformers/models/qwen2_5_vl/configuration_qwen2_5_vl.py (modified, +2/-2)
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +7/-1)
src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +3/-1)
tests/models/hyperclovax_vision_v2/__init__.py (added, +0/-0)
tests/models/hyperclovax_vision_v2/test_modeling_hyperclovax_vision_v2.py (added, +424/-0)
tests/models/hyperclovax_vision_v2/test_processing_hyperclovax_vision_v2.py (added, +241/-0)

RAW_BUFFERClick to expand / collapse

Model description

This is a lightweight Vision-Language Model designed to be accessible for researchers, while providing strong support for the Korean language. Its compact size lowers the barrier to entry for VLM research and experimentation, and its native Korean capability — including Korean VQA, chart/diagram understanding, and OCR-free processing — makes it a practical and valuable resource for the broader multilingual VLM research community.

Model Description

HyperCLOVAX-SEED-Vision-Instruct-3B is a Vision-Language Model developed by NAVER, built upon a LLaVA-based architecture. Key characteristics are as follows:

Architecture: LLaVA-based Vision-Language Model
- LLM Module: Transformer-based Dense Model
- Vision Encoder: SigLIP-based, 378×378px input resolution per grid
- Vision-Language Connector: C-Abstractor (Conv+Pooling) with AnyRes mechanism, supporting up to 9 grids and 1.29M total pixels
Parameter Count: 3.2B (LLM) + 0.43B (Vision)
Input/Output: Text + Image + Video / Text
Context Length: 16K

Motivation

Practical issues caused by not being in transformers: The model currently can only be loaded with trust_remote_code=True, and it has been confirmed that it fails the @strict config validation introduced in transformers v5. Specifically, during vLLM's transformers v5 compatibility work (vllm-project/vllm#38379), it was discovered that HCXVisionConfig fails the strict validation when initialized with text_config=None. vLLM applied a temporary fix by vendoring the config (vllm-project/vllm#38447), but the fundamental resolution order would be: vendoring → fixing configuration_hyperclovax.py on HuggingFace Hub → official upstreaming into transformers. Steps 1 and 2 are currently in progress, and this issue is being opened to address step 3.
Novel architecture requiring new implementation: There is no structurally equivalent model in the current transformers codebase. The closest reference is llava_onevision, but the key differentiator is the use of C-Abstractor (Conv+Pooling based, HoneyBee paper) as the Vision-Language Connector. Therefore, this model addition is based on llava_onevision, but requires a new implementation of the C-Abstractor connector.

Regarding the Existing Related PR

I checked that no existing PR covers this model. However, there is a related PR #44314 which corresponds to HyperCLOVAX Vision V2 in terms of internal model versioning, while the model requested in this issue is the 3B model, corresponding to V1.

From a code management perspective, inheriting V2 from V1 could be a clean option. That said, given that the V2 PR is already open and appears to be close to merging, it may also make sense to merge V2 first and then have V1 inherit from V2. As the repository has been moving toward modular-centered management, the maintainers' perspective matters most here, so I would appreciate any feedback on whether adding V1 is considered necessary. If it is, I am happy to proceed with that work alongside updating the code on HuggingFace Hub, and will follow the direction is deemed most appropriate.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Huggingface hub: naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B
vLLM upstream: vllm-project/vllm#20931 (merged 2025-07-25)

extent analysis

Fix Plan

To address the compatibility issue with transformers v5, we need to update the HCXVisionConfig to pass the strict validation.

Update configuration_hyperclovax.py to include a default text_config:

from transformers import AutoConfig

class HCXVisionConfig(AutoConfig):
    def __init__(self, **kwargs):
        text_config = kwargs.pop("text_config", None)
        if text_config is None:
            text_config = {"num_layers": 12, "hidden_size": 768}  # default config
        super().__init__(text_config=text_config, **kwargs)

Create a new implementation of the C-Abstractor connector based on the llava_onevision model:

from transformers import AutoModel

class CAbstractor(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(**kwargs)
        self.pooling = nn.MaxPool2d(**kwargs)

    def forward(self, x):
        x = self.conv(x)
        x = self.pooling(x)
        return x

Update the HyperCLOVAX-SEED-Vision-Instruct-3B model to use the new C-Abstractor connector:

from transformers import AutoModel

class HyperCLOVAXModel(AutoModel):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.c_abstractor = CAbstractor(**kwargs)

    def forward(self, x):
        x = self.c_abstractor(x)
        # rest of the forward pass
        return x

Verification

To verify that the fix worked, you can run the following tests:

Load the HyperCLOVAX-SEED-Vision-Instruct-3B model with trust_remote_code=False and check that it passes the strict validation.
Test the model on a sample input and verify that it produces the expected output.

Extra Tips

Make sure to update the configuration_hyperclovax.py file on the HuggingFace Hub to reflect the changes.
Consider merging the V1 and V2 models into a single implementation to avoid code duplication.
Keep in mind that the C-Abstractor connector is a novel architecture and may require additional testing and validation to ensure its correctness.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix add HyperCLOVA X SEED Vision Instruct 3B [1 pull requests, 11 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #44314: add HyperClovaX Vision

Description (problem / solution / changelog)

What does this PR do?

TODO list

Before submitting

Who can review?

Changed files

Model description

Model Description

Motivation

Regarding the Existing Related PR

Open source status

Provide useful links for the implementation

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix add HyperCLOVA X SEED Vision Instruct 3B [1 pull requests, 11 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #44314: add HyperClovaX Vision

Description (problem / solution / changelog)

What does this PR do?

TODO list

Before submitting

Who can review?

Changed files

Model description

Model Description

Motivation

Regarding the Existing Related PR

Open source status

Provide useful links for the implementation

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING