transformers - 💡(How to fix) Fix Add optional learnable context tokens for CLIP text prompts [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44969Fetched 2026-04-08 01:21:21
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Author
Timeline (top)
commented ×1labeled ×1

Error Message

This process is error-prone and not reusable. Providing native support would simplify these workflows and make them more robust.

RAW_BUFFERClick to expand / collapse

Feature request

Add optional support for learnable context tokens in the CLIP text encoder.

Prompt learning approaches replace fixed prompt text (e.g., "a photo of a") with learnable embedding vectors that are optimized during training. These context tokens are typically represented as:

[V1], [V2], ..., [VM] [CLASS] Image

where each Vi is a learnable embedding vector (nn.Parameter) and M is the number of context tokens.

These tokens are prepended to the input embeddings and improve alignment between image and text representations.

Currently, implementing this in Transformers requires manual manipulation via inputs_embeds, along with handling attention masks and sequence constraints.

Proposed feature:

  • Allow optional learnable context tokens to be prepended to text inputs
  • Handle attention mask and positional alignment internally
  • Keep the feature optional and backward compatible

Reference: Learning to Prompt for Vision-Language Model

Motivation

When working with CLIP, prompt design significantly affects performance. Fixed prompts often do not capture domain-specific context well.

Prompt learning methods show that replacing fixed prompts with learnable context tokens ([V1], [V2], ..., [VM]) improves performance across multiple datasets.

These learned tokens effectively replace manually designed phrases (e.g., "a photo of a") with optimized embedding vectors that encode domain-specific context.

Currently, implementing this requires:

  • manual embedding construction via inputs_embeds
  • manual attention mask updates
  • careful handling of sequence length constraints

This process is error-prone and not reusable. Providing native support would simplify these workflows and make them more robust.

Your contribution

I would be happy to contribute a PR for this feature.

I can implement:

  • optional learnable context token support
  • attention mask and sequence handling
  • documentation and usage examples

I will follow the contribution guidelines and ensure backward compatibility.

extent analysis

Fix Plan

To add optional support for learnable context tokens in the CLIP text encoder, follow these steps:

  • Modify the __call__ method of the CLIPTextEncoder class to accept an additional argument context_tokens which defaults to None.
  • If context_tokens is provided, prepend these tokens to the input embeddings and update the attention mask accordingly.
  • Handle sequence length constraints by checking the total length of the input sequence and the context tokens.

Example code:

class CLIPTextEncoder(nn.Module):
    def __init__(self, ...):
        ...
        self.context_token_embeddings = nn.ParameterList([nn.Parameter(torch.randn(embed_dim)) for _ in range(max_context_tokens)])

    def __call__(self, input_ids, attention_mask, context_tokens=None):
        if context_tokens is not None:
            # Prepend context tokens to input embeddings
            input_embeddings = self.token_embedding(input_ids)
            context_token_embeddings = torch.stack([self.context_token_embeddings[i] for i in context_tokens])
            input_embeddings = torch.cat((context_token_embeddings, input_embeddings), dim=1)

            # Update attention mask
            attention_mask = torch.cat((torch.ones((input_embeddings.shape[0], len(context_tokens)), device=input_ids.device), attention_mask), dim=1)

        # Rest of the forward pass remains the same
        ...
  • Add documentation and usage examples to demonstrate how to use the new feature.

Verification

To verify that the fix worked, test the following scenarios:

  • Provide a list of context tokens and verify that they are correctly prepended to the input embeddings.
  • Check that the attention mask is updated correctly when context tokens are provided.
  • Test the model with and without context tokens to ensure that the output is correct and the model is backward compatible.

Extra Tips

  • Make sure to follow the contribution guidelines and ensure that the new feature is properly documented and tested.
  • Consider adding a max_context_tokens argument to the CLIPTextEncoder class to limit the number of context tokens that can be provided.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Add optional learnable context tokens for CLIP text prompts [1 comments, 2 participants]