transformers - 💡(How to fix) Fix Add optional learnable context tokens for CLIP text prompts [1 comments, 2 participants]

transformers2026-03-24 11:44:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44969•Fetched 2026-04-08 01:21:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

anuj-aj

Participants

anuj-aj

Rocketknight1

Timeline (top)

commented ×1labeled ×1

Error Message

This process is error-prone and not reusable. Providing native support would simplify these workflows and make them more robust.

RAW_BUFFERClick to expand / collapse

Feature request

Add optional support for learnable context tokens in the CLIP text encoder.

Prompt learning approaches replace fixed prompt text (e.g., "a photo of a") with learnable embedding vectors that are optimized during training. These context tokens are typically represented as:

[V1], [V2], ..., [VM] [CLASS]

where each Vi is a learnable embedding vector (nn.Parameter) and M is the number of context tokens.

These tokens are prepended to the input embeddings and improve alignment between image and text representations.

Currently, implementing this in Transformers requires manual manipulation via inputs_embeds, along with handling attention masks and sequence constraints.

Proposed feature:

Allow optional learnable context tokens to be prepended to text inputs
Handle attention mask and positional alignment internally
Keep the feature optional and backward compatible

Reference: Learning to Prompt for Vision-Language Model

Motivation

When working with CLIP, prompt design significantly affects performance. Fixed prompts often do not capture domain-specific context well.

Prompt learning methods show that replacing fixed prompts with learnable context tokens ([V1], [V2], ..., [VM]) improves performance across multiple datasets.

These learned tokens effectively replace manually designed phrases (e.g., "a photo of a") with optimized embedding vectors that encode domain-specific context.

Currently, implementing this requires:

manual embedding construction via inputs_embeds
manual attention mask updates
careful handling of sequence length constraints

This process is error-prone and not reusable. Providing native support would simplify these workflows and make them more robust.

Your contribution

I would be happy to contribute a PR for this feature.

I can implement:

optional learnable context token support
attention mask and sequence handling
documentation and usage examples

I will follow the contribution guidelines and ensure backward compatibility.

extent analysis

Fix Plan

To add optional support for learnable context tokens in the CLIP text encoder, follow these steps:

Modify the __call__ method of the CLIPTextEncoder class to accept an additional argument context_tokens which defaults to None.
If context_tokens is provided, prepend these tokens to the input embeddings and update the attention mask accordingly.
Handle sequence length constraints by checking the total length of the input sequence and the context tokens.

Example code:

class CLIPTextEncoder(nn.Module):
    def __init__(self, ...):
        ...
        self.context_token_embeddings = nn.ParameterList([nn.Parameter(torch.randn(embed_dim)) for _ in range(max_context_tokens)])

    def __call__(self, input_ids, attention_mask, context_tokens=None):
        if context_tokens is not None:
            # Prepend context tokens to input embeddings
            input_embeddings = self.token_embedding(input_ids)
            context_token_embeddings = torch.stack([self.context_token_embeddings[i] for i in context_tokens])
            input_embeddings = torch.cat((context_token_embeddings, input_embeddings), dim=1)

            # Update attention mask
            attention_mask = torch.cat((torch.ones((input_embeddings.shape[0], len(context_tokens)), device=input_ids.device), attention_mask), dim=1)

        # Rest of the forward pass remains the same
        ...

Add documentation and usage examples to demonstrate how to use the new feature.

Verification

To verify that the fix worked, test the following scenarios:

Provide a list of context tokens and verify that they are correctly prepended to the input embeddings.
Check that the attention mask is updated correctly when context tokens are provided.
Test the model with and without context tokens to ensure that the output is correct and the model is backward compatible.

Extra Tips

Make sure to follow the contribution guidelines and ensure that the new feature is properly documented and tested.
Consider adding a max_context_tokens argument to the CLIPTextEncoder class to limit the number of context tokens that can be provided.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Add optional learnable context tokens for CLIP text prompts [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Feature request

Motivation

Your contribution

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Add optional learnable context tokens for CLIP text prompts [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Feature request

Motivation

Your contribution

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING