transformers - ✅(Solved) Fix add HyperCLOVA X SEED Vision Instruct 3B [1 pull requests, 11 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45099Fetched 2026-04-08 01:48:37
View on GitHub
Comments
11
Participants
4
Timeline
35
Reactions
1
Timeline (top)
commented ×11subscribed ×11mentioned ×10cross-referenced ×2

Root Cause

  • Practical issues caused by not being in transformers: The model currently can only be loaded with trust_remote_code=True, and it has been confirmed that it fails the @strict config validation introduced in transformers v5. Specifically, during vLLM's transformers v5 compatibility work (vllm-project/vllm#38379), it was discovered that HCXVisionConfig fails the strict validation when initialized with text_config=None. vLLM applied a temporary fix by vendoring the config (vllm-project/vllm#38447), but the fundamental resolution order would be: vendoring → fixing configuration_hyperclovax.py on HuggingFace Hub → official upstreaming into transformers. Steps 1 and 2 are currently in progress, and this issue is being opened to address step 3.

PR fix notes

PR #44314: add HyperClovaX Vision

Description (problem / solution / changelog)

What does this PR do?

Hello, Transformers team!

I submitted a PR to add naver-hyperclovax/HyperCLOVAX-SEED-Think-32B (hereafter HCX), developed by the Korean IT company Naver while executing the government's national AI model project.

The HCX code was written based on Transformer 4.52.4, leading to the following issues:

  1. Being based on an outdated Transformer model prevents the application of the latest training optimization techniques supported by Transformer 5.0.0 (e.g., sequence parallelism).
  2. The use of some deprecated code or features may cause unexpected bugs in the latest Transformer version.
  3. The modeling code was overly complex, reducing debugging and development convenience. Additionally, experimental code used during model creation remained untouched.

Moving to Transformer 5.0.0 significantly improved the readability and development convenience of the modeling code. We aim to leverage this to add the HCX model to transformers.

TODO list

  • Add docstrings

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan @molbap

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc @MekkCyber - kernels: @MekkCyber @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • docs/source/en/_toctree.yml (modified, +2/-0)
  • docs/source/en/model_doc/hyperclovax_vision_v2.md (added, +313/-0)
  • docs/source/en/model_doc/qwen2_5_vl.md (modified, +21/-12)
  • docs/source/ko/_toctree.yml (modified, +2/-0)
  • docs/source/ko/model_doc/hyperclovax_vision_v2.md (added, +313/-0)
  • src/transformers/conversion_mapping.py (modified, +14/-0)
  • src/transformers/models/__init__.py (modified, +1/-0)
  • src/transformers/models/auto/configuration_auto.py (modified, +13/-1)
  • src/transformers/models/auto/image_processing_auto.py (modified, +1/-0)
  • src/transformers/models/auto/modeling_auto.py (modified, +7/-0)
  • src/transformers/models/auto/processing_auto.py (modified, +1/-0)
  • src/transformers/models/auto/tokenization_auto.py (modified, +4/-1)
  • src/transformers/models/auto/video_processing_auto.py (modified, +1/-0)
  • src/transformers/models/hyperclovax_vision_v2/__init__.py (added, +28/-0)
  • src/transformers/models/hyperclovax_vision_v2/configuration_hyperclovax_vision_v2.py (added, +168/-0)
  • src/transformers/models/hyperclovax_vision_v2/modeling_hyperclovax_vision_v2.py (added, +1002/-0)
  • src/transformers/models/hyperclovax_vision_v2/modular_hyperclovax_vision_v2.py (added, +717/-0)
  • src/transformers/models/hyperclovax_vision_v2/processing_hyperclovax_vision_v2.py (added, +199/-0)
  • src/transformers/models/qwen2_5_vl/configuration_qwen2_5_vl.py (modified, +2/-2)
  • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +7/-1)
  • src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +3/-1)
  • tests/models/hyperclovax_vision_v2/__init__.py (added, +0/-0)
  • tests/models/hyperclovax_vision_v2/test_modeling_hyperclovax_vision_v2.py (added, +424/-0)
  • tests/models/hyperclovax_vision_v2/test_processing_hyperclovax_vision_v2.py (added, +241/-0)
RAW_BUFFERClick to expand / collapse

Model description

This is a lightweight Vision-Language Model designed to be accessible for researchers, while providing strong support for the Korean language. Its compact size lowers the barrier to entry for VLM research and experimentation, and its native Korean capability — including Korean VQA, chart/diagram understanding, and OCR-free processing — makes it a practical and valuable resource for the broader multilingual VLM research community.

Model Description

HyperCLOVAX-SEED-Vision-Instruct-3B is a Vision-Language Model developed by NAVER, built upon a LLaVA-based architecture. Key characteristics are as follows:

  • Architecture: LLaVA-based Vision-Language Model
    • LLM Module: Transformer-based Dense Model
    • Vision Encoder: SigLIP-based, 378×378px input resolution per grid
    • Vision-Language Connector: C-Abstractor (Conv+Pooling) with AnyRes mechanism, supporting up to 9 grids and 1.29M total pixels
  • Parameter Count: 3.2B (LLM) + 0.43B (Vision)
  • Input/Output: Text + Image + Video / Text
  • Context Length: 16K

Motivation

  • Practical issues caused by not being in transformers: The model currently can only be loaded with trust_remote_code=True, and it has been confirmed that it fails the @strict config validation introduced in transformers v5. Specifically, during vLLM's transformers v5 compatibility work (vllm-project/vllm#38379), it was discovered that HCXVisionConfig fails the strict validation when initialized with text_config=None. vLLM applied a temporary fix by vendoring the config (vllm-project/vllm#38447), but the fundamental resolution order would be: vendoring → fixing configuration_hyperclovax.py on HuggingFace Hub → official upstreaming into transformers. Steps 1 and 2 are currently in progress, and this issue is being opened to address step 3.

  • Novel architecture requiring new implementation: There is no structurally equivalent model in the current transformers codebase. The closest reference is llava_onevision, but the key differentiator is the use of C-Abstractor (Conv+Pooling based, HoneyBee paper) as the Vision-Language Connector. Therefore, this model addition is based on llava_onevision, but requires a new implementation of the C-Abstractor connector.

Regarding the Existing Related PR

I checked that no existing PR covers this model. However, there is a related PR #44314 which corresponds to HyperCLOVAX Vision V2 in terms of internal model versioning, while the model requested in this issue is the 3B model, corresponding to V1.

From a code management perspective, inheriting V2 from V1 could be a clean option. That said, given that the V2 PR is already open and appears to be close to merging, it may also make sense to merge V2 first and then have V1 inherit from V2. As the repository has been moving toward modular-centered management, the maintainers' perspective matters most here, so I would appreciate any feedback on whether adding V1 is considered necessary. If it is, I am happy to proceed with that work alongside updating the code on HuggingFace Hub, and will follow the direction is deemed most appropriate.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

extent analysis

Fix Plan

To address the compatibility issue with transformers v5, we need to update the HCXVisionConfig to pass the strict validation.

  • Update configuration_hyperclovax.py to include a default text_config:
from transformers import AutoConfig

class HCXVisionConfig(AutoConfig):
    def __init__(self, **kwargs):
        text_config = kwargs.pop("text_config", None)
        if text_config is None:
            text_config = {"num_layers": 12, "hidden_size": 768}  # default config
        super().__init__(text_config=text_config, **kwargs)
  • Create a new implementation of the C-Abstractor connector based on the llava_onevision model:
from transformers import AutoModel

class CAbstractor(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(**kwargs)
        self.pooling = nn.MaxPool2d(**kwargs)

    def forward(self, x):
        x = self.conv(x)
        x = self.pooling(x)
        return x
  • Update the HyperCLOVAX-SEED-Vision-Instruct-3B model to use the new C-Abstractor connector:
from transformers import AutoModel

class HyperCLOVAXModel(AutoModel):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.c_abstractor = CAbstractor(**kwargs)

    def forward(self, x):
        x = self.c_abstractor(x)
        # rest of the forward pass
        return x

Verification

To verify that the fix worked, you can run the following tests:

  • Load the HyperCLOVAX-SEED-Vision-Instruct-3B model with trust_remote_code=False and check that it passes the strict validation.
  • Test the model on a sample input and verify that it produces the expected output.

Extra Tips

  • Make sure to update the configuration_hyperclovax.py file on the HuggingFace Hub to reflect the changes.
  • Consider merging the V1 and V2 models into a single implementation to avoid code duplication.
  • Keep in mind that the C-Abstractor connector is a novel architecture and may require additional testing and validation to ensure its correctness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING