transformers - 💡(How to fix) Fix [CB] PagedAttentionCache crashes with "Invalid group type: linear_attention" on Qwen3.5 models [7 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44530Fetched 2026-04-08 00:27:52
View on GitHub
Comments
7
Participants
3
Timeline
22
Reactions
0
Timeline (top)
commented ×7mentioned ×6subscribed ×6labeled ×2

Error Message

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda_init_.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Error in generation loop: Invalid group type: linear_attention Traceback (most recent call last): File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop paged_attention_cache = PagedAttentionCache( self.model.config, ...<4 lines>... allow_block_sharing=self._allow_block_sharing, ) File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in init raise ValueError(f"Invalid group type: {group_type}") ValueError: Invalid group type: linear_attention Generation thread terminated unexpectedly. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. --- Running CB Generation Example --- Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last): File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop paged_attention_cache = PagedAttentionCache( self.model.config, ...<4 lines>... allow_block_sharing=self._allow_block_sharing, ) File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in init raise ValueError(f"Invalid group type: {group_type}") ValueError: Invalid group type: linear_attention Generation thread terminated unexpectedly.
Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. Done with batch generation.

--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s

Root Cause

Full traceback:

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]

Code Example

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
--- Running CB Generation Example ---
Error in generation loop: Invalid group type: linear_attention                                                                                
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.                                                                                                    
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
Done with batch generation.
--------------------
--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.2.0
  • Platform: Windows-11-10.0.26200-SP0
  • Python version: 3.13.3
  • Huggingface_hub version: 1.5.0
  • Safetensors version: 0.5.3
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA GeForce RTX 4060 Ti

Who can help?

@remi-or @ArthurZucker @McPatate

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduces with Qwen/Qwen3.5-0.8B using the official continuous batching example script. Run with: python examples/pytorch/continuous_batching_simple.py --samples 5

The error occurs inside PagedAttentionCache.init() when building the KV cache group map. Qwen3.5 uses a hybrid architecture with linear attention layers alongside standard attention layers. The cache.py group type handler does not recognize "linear_attention" as a valid group type, causing an immediate crash before any generation begins.

Full traceback:

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
--- Running CB Generation Example ---
Error in generation loop: Invalid group type: linear_attention                                                                                
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.                                                                                                    
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
Done with batch generation.
--------------------
--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s

Expected behavior

Running examples/pytorch/continuous_batching_simple.py --samples 5 (with Qwen/Qwen3.5-0.8B) should complete successfully and return generated text for all 5 requests, consistent with how it behaves for Qwen/Qwen3-4B-Instruct-2507 (the model used in the official docs example).

PagedAttentionCache should handle linear_attention as a known group type. The crash prevents generate_batch from being usable with any hybrid linear-attention model.

extent analysis

Fix Plan

Update transformers library to a version that supports hybrid attention models

Step 1: Update transformers library

pip install transformers --upgrade

Step 2: Check transformers library version

import transformers
print(transformers.__version__)

The expected output is a version greater than 5.2.0.

Verify the fix

Step 1: Run the example script again

python examples/pytorch/continuous_batching_simple.py --samples 5

Step 2: Check the output

The script should complete successfully and return generated text for all 5 requests.

Extra Tips

  • Make sure to update all dependencies, including transformers, huggingface_hub, and safetensors.
  • If you're using a virtual environment, make sure to activate it before running the script.
  • If you're still experiencing issues, try resetting the environment and re-running the script.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Running examples/pytorch/continuous_batching_simple.py --samples 5 (with Qwen/Qwen3.5-0.8B) should complete successfully and return generated text for all 5 requests, consistent with how it behaves for Qwen/Qwen3-4B-Instruct-2507 (the model used in the official docs example).

PagedAttentionCache should handle linear_attention as a known group type. The crash prevents generate_batch from being usable with any hybrid linear-attention model.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix [CB] PagedAttentionCache crashes with "Invalid group type: linear_attention" on Qwen3.5 models [7 comments, 3 participants]