transformers - 💡(How to fix) Fix [CB] PagedAttentionCache crashes with "Invalid group type: linear_attention" on Qwen3.5 models [7 comments, 3 participants]

Q: Expected behavior

Running `examples/pytorch/continuous_batching_simple.py --samples 5` (with `Qwen/Qwen3.5-0.8B`) should complete successfully and return generated text for all 5 requests, consistent with how it behaves for `Qwen/Qwen3-4B-Instruct-2507` (the model used in the official docs example). `PagedAttentionCache` should handle `linear_attention` as a known group type. The crash prevents `generate_batch` from being usable with any hybrid linear-attention model.

transformers2026-03-08 18:49:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44530•Fetched 2026-04-08 00:27:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×7mentioned ×6subscribed ×6labeled ×2

Error Message

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda_init_.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Error in generation loop: Invalid group type: linear_attention Traceback (most recent call last): File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop paged_attention_cache = PagedAttentionCache( self.model.config, ...<4 lines>... allow_block_sharing=self._allow_block_sharing, ) File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in init raise ValueError(f"Invalid group type: {group_type}") ValueError: Invalid group type: linear_attention Generation thread terminated unexpectedly. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. --- Running CB Generation Example --- Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last): File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop paged_attention_cache = PagedAttentionCache( self.model.config, ...<4 lines>... allow_block_sharing=self._allow_block_sharing, ) File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in init raise ValueError(f"Invalid group type: {group_type}") ValueError: Invalid group type: linear_attention Generation thread terminated unexpectedly.
Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination. Solving 5 requests: 0%| | 0/5 [00:01<?, ?request/s]

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. Done with batch generation.

--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s

Root Cause

Full traceback:

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]

Code Example

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
--- Running CB Generation Example ---
Error in generation loop: Invalid group type: linear_attention                                                                                
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.                                                                                                    
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
Done with batch generation.
--------------------
--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.2.0
Platform: Windows-11-10.0.26200-SP0
Python version: 3.13.3
Huggingface_hub version: 1.5.0
Safetensors version: 0.5.3
Accelerate version: 1.12.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA GeForce RTX 4060 Ti

Who can help?

@remi-or @ArthurZucker @McPatate

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Reproduces with Qwen/Qwen3.5-0.8B using the official continuous batching example script. Run with: python examples/pytorch/continuous_batching_simple.py --samples 5

The error occurs inside PagedAttentionCache.init() when building the KV cache group map. Qwen3.5 uses a hybrid architecture with linear attention layers alongside standard attention layers. The cache.py group type handler does not recognize "linear_attention" as a valid group type, causing an immediate crash before any generation begins.

Full traceback:

C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\cuda\__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading weights: 100%|█████████████████████████████████████████████| 320/320 [00:00<00:00, 553.59it/s, Materializing param=model.norm.weight]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Error in generation loop: Invalid group type: linear_attention
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
--- Running CB Generation Example ---
Error in generation loop: Invalid group type: linear_attention                                                                                
Traceback (most recent call last):
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\continuous_api.py", line 767, in _run_generation_loop
    paged_attention_cache = PagedAttentionCache(
        self.model.config,
    ...<4 lines>...
        allow_block_sharing=self._allow_block_sharing,
    )
  File "C:\Users\maxch\AppData\Local\Programs\Python\Python313\Lib\site-packages\transformers\generation\continuous_batching\cache.py", line 240, in __init__
    raise ValueError(f"Invalid group type: {group_type}")
ValueError: Invalid group type: linear_attention
Generation thread terminated unexpectedly.                                                                                                    
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s]Returning results of generate_batch despite unexpected termination.
Solving 5 requests:   0%|                                                                                         | 0/5 [00:01<?, ?request/s] 

Batch processor was not initialized.
Request req_0 not found in results.
Request req_1 not found in results.
Request req_2 not found in results.
Request req_3 not found in results.
Request req_4 not found in results.
Done with batch generation.
--------------------
--- Finished CB Generation Example ---

CB generation took: 1.01 seconds for 0 tokens. 0.00tok/s

Expected behavior

Running examples/pytorch/continuous_batching_simple.py --samples 5 (with Qwen/Qwen3.5-0.8B) should complete successfully and return generated text for all 5 requests, consistent with how it behaves for Qwen/Qwen3-4B-Instruct-2507 (the model used in the official docs example).

PagedAttentionCache should handle linear_attention as a known group type. The crash prevents generate_batch from being usable with any hybrid linear-attention model.

extent analysis

Fix Plan

Update transformers library to a version that supports hybrid attention models

Step 1: Update transformers library

pip install transformers --upgrade

Step 2: Check transformers library version

import transformers
print(transformers.__version__)

The expected output is a version greater than 5.2.0.

Verify the fix

Step 1: Run the example script again

python examples/pytorch/continuous_batching_simple.py --samples 5

Step 2: Check the output

The script should complete successfully and return generated text for all 5 requests.

Extra Tips

Make sure to update all dependencies, including transformers, huggingface_hub, and safetensors.
If you're using a virtual environment, make sure to activate it before running the script.
If you're still experiencing issues, try resetting the environment and re-running the script.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

PagedAttentionCache should handle linear_attention as a known group type. The crash prevents generate_batch from being usable with any hybrid linear-attention model.

#api #ssr #installation #tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix [CB] PagedAttentionCache crashes with "Invalid group type: linear_attention" on Qwen3.5 models [7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. Done with batch generation.

Root Cause

Full traceback:

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Full traceback:

Expected behavior

extent analysis

Fix Plan

Update transformers library to a version that supports hybrid attention models

Step 1: Update transformers library

Step 2: Check transformers library version

Verify the fix

Step 1: Run the example script again

Step 2: Check the output

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix [CB] PagedAttentionCache crashes with "Invalid group type: linear_attention" on Qwen3.5 models [7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Batch processor was not initialized. Request req_0 not found in results. Request req_1 not found in results. Request req_2 not found in results. Request req_3 not found in results. Request req_4 not found in results. Done with batch generation.

Root Cause

Full traceback:

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Full traceback:

Expected behavior

extent analysis

Fix Plan

Update transformers library to a version that supports hybrid attention models

Step 1: Update transformers library

Step 2: Check transformers library version

Verify the fix

Step 1: Run the example script again

Step 2: Check the output

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING