transformers - 💡(How to fix) Fix Fallback to kernels-community/flash-attn2 is blocked by other checks when fa2 is not installed [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45399Fetched 2026-04-15 06:19:44
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2closed ×1commented ×1

Error Message


ImportError Traceback (most recent call last) Cell In[4], line 2 1 from transformers import AutoModelForCausalLM ----> 2 model = AutoModelForCausalLM.from_pretrained("models/Llama-3.2-1B", torch_dtype="auto", attn_implementation="flash_attention_2", device_map="cpu")

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:387, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs) 385 if model_class.config_class == config.sub_configs.get("text_config", None): 386 config = config.get_text_config() --> 387 return model_class.from_pretrained( 388 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs 389 ) 390 raise ValueError( 391 f"Unrecognized configuration class {config.class} for this kind of AutoModel: {cls.name}.\n" 392 f"Model type should be one of {', '.join(c.name for c in cls._model_mapping)}." 393 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs) 4090 config = copy.deepcopy(config) # We do not want to modify the config inplace in from_pretrained. 4091 with ContextManagers(model_init_context): -> 4092 model = cls(config, *model_args, **model_kwargs) 4093 patch_output_recorders(model) 4095 if hf_quantizer is not None: # replace module with quantized modules (does not touch weights)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:435, in LlamaForCausalLM.init(self, config) 434 def init(self, config): --> 435 super().init(config) 436 self.model = LlamaModel(config) 437 self.vocab_size = config.vocab_size

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1255, in PreTrainedModel.init(self, config, *inputs, **kwargs) 1251 self.name_or_path = config.name_or_path 1253 # Check the attention implementation is supported, or set it if not yet set (on the internal attr, to avoid 1254 # setting it recursively) -> 1255 self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation( 1256 self.config._attn_implementation, 1257 is_init_check=True, 1258 # We need to use this constant that is set through context manager as it cannot be forwarded in the model's init 1259 allow_all_kernels=hub_kernels.ALLOW_ALL_KERNELS, 1260 ) 1261 # Check the experts implementation is supported, or set it if not yet set (on the internal attr, to avoid 1262 # setting it recursively) 1263 self.config._experts_implementation_internal = self._check_and_adjust_experts_implementation( 1264 self.config._experts_implementation 1265 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1865, in PreTrainedModel._check_and_adjust_attn_implementation(self, attn_implementation, is_init_check, allow_all_kernels) 1863 raise e 1864 else: -> 1865 applicable_attn_implementation = self.get_correct_attn_implementation( 1866 applicable_attn_implementation, is_init_check 1867 ) 1869 # preload flash attention here to allow compile with fullgraph 1870 if is_flash_attention_requested(requested_attention_implementation=applicable_attn_implementation):

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912, in PreTrainedModel.get_correct_attn_implementation(self, requested_attention, is_init_check) 1908 if is_flash_attention_requested(requested_attention_implementation=applicable_attention) and ( 1909 fa_matched := re.search(r"^flash_attention_(\d)$", applicable_attention) 1910 ): 1911 fa_version = int(fa_matched.group(1)) # last digit -> 1912 self._flash_attn_can_dispatch(flash_attn_version=fa_version, is_init_check=is_init_check) 1913 elif "flex_attention" in applicable_attention: 1914 self._flex_attn_can_dispatch(is_init_check)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647, in PreTrainedModel._flash_attn_can_dispatch(self, flash_attn_version, is_init_check) 1644 raise ValueError(f"Requested Flash Attention {flash_attn_version} which is not supported.") 1646 # Check if we can even use the FA version based on the env of the user -> 1647 self._flash_attn_import_error(**FLASH_ATTENTION_COMPATIBILITY_MATRIX[flash_attn_version]) 1649 # Check for attention dropout, which is incompatible with newer FA versions 1650 # (many should not really care about dropout as it is not super effective, hence warning for now) 1651 if flash_attn_version > 2:

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1602, in PreTrainedModel._flash_attn_import_error(self, flash_attn_version, general_availability_check, pkg_availability_check, supported_devices, custom_supported_devices, cuda_min_major_version) 1600 # Can the package be seen in the import structure 1601 if not pkg_availability_check(): -> 1602 raise ImportError( 1603 f"{preface} the package for FlashAttention{flash_attn_version} doesn't seem to be installed." 1604 ) 1605 # Minimum version (FA2 only) 1606 elif flash_attn_version == 2 and not is_flash_attn_greater_or_equal("2.3.3"):

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package for FlashAttention2 doesn't seem to be installed.

Fix Action

Fix / Workaround

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs) 4090 config = copy.deepcopy(config) # We do not want to modify the config inplace in from_pretrained. 4091 with ContextManagers(model_init_context): -> 4092 model = cls(config, *model_args, **model_kwargs) 4093 patch_output_recorders(model) 4095 if hf_quantizer is not None: # replace module with quantized modules (does not touch weights)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912, in PreTrainedModel.get_correct_attn_implementation(self, requested_attention, is_init_check) 1908 if is_flash_attention_requested(requested_attention_implementation=applicable_attention) and ( 1909 fa_matched := re.search(r"^flash_attention_(\d)$", applicable_attention) 1910 ): 1911 fa_version = int(fa_matched.group(1)) # last digit -> 1912 self._flash_attn_can_dispatch(flash_attn_version=fa_version, is_init_check=is_init_check) 1913 elif "flex_attention" in applicable_attention: 1914 self._flex_attn_can_dispatch(is_init_check)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647, in PreTrainedModel._flash_attn_can_dispatch(self, flash_attn_version, is_init_check) 1644 raise ValueError(f"Requested Flash Attention {flash_attn_version} which is not supported.") 1646 # Check if we can even use the FA version based on the env of the user -> 1647 self._flash_attn_import_error(**FLASH_ATTENTION_COMPATIBILITY_MATRIX[flash_attn_version]) 1649 # Check for attention dropout, which is incompatible with newer FA versions 1650 # (many should not really care about dropout as it is not super effective, hence warning for now) 1651 if flash_attn_version > 2:

Code Example

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("models/Llama-3.2-1B", torch_dtype="auto", attn_implementation="flash_attention_2", device_map="cpu")

---

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[4], [line 2](vscode-notebook-cell:?execution_count=4&line=2)
      1 from transformers import AutoModelForCausalLM
----> [2](vscode-notebook-cell:?execution_count=4&line=2) model = AutoModelForCausalLM.from_pretrained("models/Llama-3.2-1B", torch_dtype="auto", attn_implementation="flash_attention_2", device_map="cpu")

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:387, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    385     if model_class.config_class == config.sub_configs.get("text_config", None):
    386         config = config.get_text_config()
--> [387](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:387)     return model_class.from_pretrained(
    388         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    389     )
    390 raise ValueError(
    391     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    392     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping)}."
    393 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   4090 config = copy.deepcopy(config)  # We do not want to modify the config inplace in from_pretrained.
   4091 with ContextManagers(model_init_context):
-> [4092](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092)     model = cls(config, *model_args, **model_kwargs)
   4093     patch_output_recorders(model)
   4095     if hf_quantizer is not None:  # replace module with quantized modules (does not touch weights)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:435, in LlamaForCausalLM.__init__(self, config)
    434 def __init__(self, config):
--> [435](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:435)     super().__init__(config)
    436     self.model = LlamaModel(config)
    437     self.vocab_size = config.vocab_size

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1255, in PreTrainedModel.__init__(self, config, *inputs, **kwargs)
   1251 self.name_or_path = config.name_or_path
   1253 # Check the attention implementation is supported, or set it if not yet set (on the internal attr, to avoid
   1254 # setting it recursively)
-> [1255](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1255) self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
   1256     self.config._attn_implementation,
   1257     is_init_check=True,
   1258     # We need to use this constant that is set through context manager as it cannot be forwarded in the model's __init__
   1259     allow_all_kernels=hub_kernels.ALLOW_ALL_KERNELS,
   1260 )
   1261 # Check the experts implementation is supported, or set it if not yet set (on the internal attr, to avoid
   1262 # setting it recursively)
   1263 self.config._experts_implementation_internal = self._check_and_adjust_experts_implementation(
   1264     self.config._experts_implementation
   1265 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1865, in PreTrainedModel._check_and_adjust_attn_implementation(self, attn_implementation, is_init_check, allow_all_kernels)
   1863         raise e
   1864 else:
-> [1865](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1865)     applicable_attn_implementation = self.get_correct_attn_implementation(
   1866         applicable_attn_implementation, is_init_check
   1867     )
   1869     # preload flash attention here to allow compile with fullgraph
   1870     if is_flash_attention_requested(requested_attention_implementation=applicable_attn_implementation):

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912, in PreTrainedModel.get_correct_attn_implementation(self, requested_attention, is_init_check)
   1908 if is_flash_attention_requested(requested_attention_implementation=applicable_attention) and (
   1909     fa_matched := re.search(r"^flash_attention_(\d)$", applicable_attention)
   1910 ):
   1911     fa_version = int(fa_matched.group(1))  # last digit
-> [1912](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912)     self._flash_attn_can_dispatch(flash_attn_version=fa_version, is_init_check=is_init_check)
   1913 elif "flex_attention" in applicable_attention:
   1914     self._flex_attn_can_dispatch(is_init_check)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647, in PreTrainedModel._flash_attn_can_dispatch(self, flash_attn_version, is_init_check)
   1644     raise ValueError(f"Requested Flash Attention {flash_attn_version} which is not supported.")
   1646 # Check if we can even use the FA version based on the env of the user
-> [1647](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647) self._flash_attn_import_error(**FLASH_ATTENTION_COMPATIBILITY_MATRIX[flash_attn_version])
   1649 # Check for attention dropout, which is incompatible with newer FA versions
   1650 # (many should not really care about dropout as it is not super effective, hence warning for now)
   1651 if flash_attn_version > 2:

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1602, in PreTrainedModel._flash_attn_import_error(self, flash_attn_version, general_availability_check, pkg_availability_check, supported_devices, custom_supported_devices, cuda_min_major_version)
   1600 # Can the package be seen in the import structure
   1601 if not pkg_availability_check():
-> [1602](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1602)     raise ImportError(
   1603         f"{preface} the package for FlashAttention{flash_attn_version} doesn't seem to be installed."
   1604     )
   1605 # Minimum version (FA2 only)
   1606 elif flash_attn_version == 2 and not is_flash_attn_greater_or_equal("2.3.3"):

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package for FlashAttention2 doesn't seem to be installed.
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.5.3
  • Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.31
  • Python version: 3.10.0
  • Huggingface_hub version: 1.10.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.6.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: False - main_training_function: main - enable_cpu_affinity: False - downcast_bf16: False - tpu_use_cluster: False - tpu_use_sudo: False
  • DeepSpeed version: 0.16.5
  • PyTorch version (accelerator?): 2.7.0+cu126 (CUDA)
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA L40

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("models/Llama-3.2-1B", torch_dtype="auto", attn_implementation="flash_attention_2", device_map="cpu")
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[4], [line 2](vscode-notebook-cell:?execution_count=4&line=2)
      1 from transformers import AutoModelForCausalLM
----> [2](vscode-notebook-cell:?execution_count=4&line=2) model = AutoModelForCausalLM.from_pretrained("models/Llama-3.2-1B", torch_dtype="auto", attn_implementation="flash_attention_2", device_map="cpu")

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:387, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    385     if model_class.config_class == config.sub_configs.get("text_config", None):
    386         config = config.get_text_config()
--> [387](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:387)     return model_class.from_pretrained(
    388         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    389     )
    390 raise ValueError(
    391     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    392     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping)}."
    393 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   4090 config = copy.deepcopy(config)  # We do not want to modify the config inplace in from_pretrained.
   4091 with ContextManagers(model_init_context):
-> [4092](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:4092)     model = cls(config, *model_args, **model_kwargs)
   4093     patch_output_recorders(model)
   4095     if hf_quantizer is not None:  # replace module with quantized modules (does not touch weights)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:435, in LlamaForCausalLM.__init__(self, config)
    434 def __init__(self, config):
--> [435](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:435)     super().__init__(config)
    436     self.model = LlamaModel(config)
    437     self.vocab_size = config.vocab_size

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1255, in PreTrainedModel.__init__(self, config, *inputs, **kwargs)
   1251 self.name_or_path = config.name_or_path
   1253 # Check the attention implementation is supported, or set it if not yet set (on the internal attr, to avoid
   1254 # setting it recursively)
-> [1255](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1255) self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
   1256     self.config._attn_implementation,
   1257     is_init_check=True,
   1258     # We need to use this constant that is set through context manager as it cannot be forwarded in the model's __init__
   1259     allow_all_kernels=hub_kernels.ALLOW_ALL_KERNELS,
   1260 )
   1261 # Check the experts implementation is supported, or set it if not yet set (on the internal attr, to avoid
   1262 # setting it recursively)
   1263 self.config._experts_implementation_internal = self._check_and_adjust_experts_implementation(
   1264     self.config._experts_implementation
   1265 )

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1865, in PreTrainedModel._check_and_adjust_attn_implementation(self, attn_implementation, is_init_check, allow_all_kernels)
   1863         raise e
   1864 else:
-> [1865](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1865)     applicable_attn_implementation = self.get_correct_attn_implementation(
   1866         applicable_attn_implementation, is_init_check
   1867     )
   1869     # preload flash attention here to allow compile with fullgraph
   1870     if is_flash_attention_requested(requested_attention_implementation=applicable_attn_implementation):

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912, in PreTrainedModel.get_correct_attn_implementation(self, requested_attention, is_init_check)
   1908 if is_flash_attention_requested(requested_attention_implementation=applicable_attention) and (
   1909     fa_matched := re.search(r"^flash_attention_(\d)$", applicable_attention)
   1910 ):
   1911     fa_version = int(fa_matched.group(1))  # last digit
-> [1912](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1912)     self._flash_attn_can_dispatch(flash_attn_version=fa_version, is_init_check=is_init_check)
   1913 elif "flex_attention" in applicable_attention:
   1914     self._flex_attn_can_dispatch(is_init_check)

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647, in PreTrainedModel._flash_attn_can_dispatch(self, flash_attn_version, is_init_check)
   1644     raise ValueError(f"Requested Flash Attention {flash_attn_version} which is not supported.")
   1646 # Check if we can even use the FA version based on the env of the user
-> [1647](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1647) self._flash_attn_import_error(**FLASH_ATTENTION_COMPATIBILITY_MATRIX[flash_attn_version])
   1649 # Check for attention dropout, which is incompatible with newer FA versions
   1650 # (many should not really care about dropout as it is not super effective, hence warning for now)
   1651 if flash_attn_version > 2:

File ~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1602, in PreTrainedModel._flash_attn_import_error(self, flash_attn_version, general_availability_check, pkg_availability_check, supported_devices, custom_supported_devices, cuda_min_major_version)
   1600 # Can the package be seen in the import structure
   1601 if not pkg_availability_check():
-> [1602](https://vscode-remote+ssh-002dremote-002bserver35.vscode-resource.vscode-cdn.net/home/linli/Quant/Ours/~/anaconda3/envs/LLM/lib/python3.10/site-packages/transformers/modeling_utils.py:1602)     raise ImportError(
   1603         f"{preface} the package for FlashAttention{flash_attn_version} doesn't seem to be installed."
   1604     )
   1605 # Minimum version (FA2 only)
   1606 elif flash_attn_version == 2 and not is_flash_attn_greater_or_equal("2.3.3"):

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package for FlashAttention2 doesn't seem to be installed.

Expected behavior

If fa2 is not installed, the code should fall back to kernels-community/flash-attn2 successfully.

extent analysis

TL;DR

The issue is likely due to the missing installation of FlashAttention2, which is required for the specified attention implementation.

Guidance

  • Check if FlashAttention2 is installed by running pip install flash-attn or pip install kernels-community/flash-attn2 in your terminal.
  • Verify that the installation was successful by checking the package list with pip list.
  • If FlashAttention2 is not installed, install it and retry running the code.
  • If the issue persists, check the compatibility of FlashAttention2 with your PyTorch and transformer versions.

Example

No code snippet is provided as the issue is related to package installation.

Notes

The error message indicates that FlashAttention2 is not installed, which is required for the specified attention implementation. Installing the package should resolve the issue.

Recommendation

Apply workaround: Install FlashAttention2 using pip install flash-attn or pip install kernels-community/flash-attn2 to resolve the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

If fa2 is not installed, the code should fall back to kernels-community/flash-attn2 successfully.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING