transformers - 💡(How to fix) Fix Qwen3.5 cannot generate normally with flash-attention [2 comments, 2 participants]

transformers2026-03-24 20:29:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44977•Fetched 2026-04-08 01:26:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yuyijiong

Participants

JJJYmmm

yuyijiong

Timeline (top)

subscribed ×3commented ×2mentioned ×2closed ×1

Code Example

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import pandas as pd
import torch
from transformers import AutoTokenizer,Qwen3ForCausalLM,AutoConfig,AutoModelForCausalLM,GenerationConfig

if __name__ == '__main__':
    model_path="/share/models/Qwen3.5-4B"#"/share/models/Qwen3-4B-Instruct-2507"#

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, add_bos_token=False,
                                              add_eos_token=False)

    model = AutoModelForCausalLM.from_pretrained(model_path,
                                             dtype=torch.bfloat16,
                                             trust_remote_code=True,
                                             attn_implementation="flash_attention_2",
                                             device_map="cuda"
                                             ).eval()

    prompt="How are you today?"
    chat_prompt=tokenizer.apply_chat_template([{"role":"user","content":prompt}],tokenize=False,add_generation_prompt=True,enable_thinking=False)

    generation_config=GenerationConfig(
        do_sample=False,
        temperature=1.0,
        top_p=1.0,
        num_return_sequences=1,
        max_new_tokens=100,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=True,
        return_dict_in_generate=True,
        output_logits=True,
    )


      chat_prompt_ids = tokenizer(chat_prompt, return_tensors="pt")["input_ids"].to(model.device)
      output = model.generate(input_ids=chat_prompt_ids,
                                      generation_config=generation_config,)
      output_text = tokenizer.decode(output['sequences'][0][chat_prompt_ids.size(1):],skip_special_tokens=False)
      print(output_text)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.3.0.dev0
Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Python version: 3.12.11
Huggingface_hub version: 1.7.2
Safetensors version: 0.6.2
Accelerate version: 1.12.0
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA H20
Flash Attention: 2.8.3

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import pandas as pd
import torch
from transformers import AutoTokenizer,Qwen3ForCausalLM,AutoConfig,AutoModelForCausalLM,GenerationConfig

if __name__ == '__main__':
    model_path="/share/models/Qwen3.5-4B"#"/share/models/Qwen3-4B-Instruct-2507"#

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, add_bos_token=False,
                                              add_eos_token=False)

    model = AutoModelForCausalLM.from_pretrained(model_path,
                                             dtype=torch.bfloat16,
                                             trust_remote_code=True,
                                             attn_implementation="flash_attention_2",
                                             device_map="cuda"
                                             ).eval()

    prompt="How are you today?"
    chat_prompt=tokenizer.apply_chat_template([{"role":"user","content":prompt}],tokenize=False,add_generation_prompt=True,enable_thinking=False)

    generation_config=GenerationConfig(
        do_sample=False,
        temperature=1.0,
        top_p=1.0,
        num_return_sequences=1,
        max_new_tokens=100,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=True,
        return_dict_in_generate=True,
        output_logits=True,
    )


      chat_prompt_ids = tokenizer(chat_prompt, return_tensors="pt")["input_ids"].to(model.device)
      output = model.generate(input_ids=chat_prompt_ids,
                                      generation_config=generation_config,)
      output_text = tokenizer.decode(output['sequences'][0][chat_prompt_ids.size(1):],skip_special_tokens=False)
      print(output_text)

In this script, when using qwen3.5 and attn_implementation="flash_attention_2", the model's output will be "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", which is abnormal.

However, when using qwen3.5 and attn_implementation is not set, the output is normal; when using qwen3-4b-instruct with attn_implementation="flash_attention_2", the output is also normal. That means qwen3.5 is not compatible with flash-attention-2.

Expected behavior

qwen3.5 should be compatible with flash-attention-2

extent analysis

Fix Plan

To fix the compatibility issue between Qwen3.5 and Flash Attention 2, we need to modify the model configuration.

Update the attn_implementation to a compatible version or remove it to use the default attention implementation.
Alternatively, you can try updating the transformers library to the latest version, as compatibility issues may be resolved in newer versions.

Here's an example of how you can modify your code:

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             dtype=torch.bfloat16,
                                             trust_remote_code=True,
                                             # Remove or update attn_implementation
                                             # attn_implementation="flash_attention_2",
                                             device_map="cuda"
                                             ).eval()

If you still want to use Flash Attention 2, you can try updating the Flash Attention library to the latest version:

pip install --upgrade flash-attention

Verification

To verify that the fix worked, run your script again with the modified model configuration. The output should be normal and not contain the abnormal "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" string.

Extra Tips

Make sure to check the compatibility of your model and attention implementation before running your script.
If you're using a custom model or attention implementation, ensure that it's compatible with your PyTorch and Transformers versions.
You can also try resetting the attn_implementation to its default value by removing the attn_implementation argument or setting it to None.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

qwen3.5 should be compatible with flash-attention-2

#environment setup #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Qwen3.5 cannot generate normally with flash-attention [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Qwen3.5 cannot generate normally with flash-attention [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING