pytorch - 💡(How to fix) Fix [release 2.11][vllm] Kernels Attention Test - test_mha_attn_varlen_forward_flashinfer error [6 comments, 3 participants]

pytorch2026-03-10 15:09:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177030•Fetched 2026-04-08 00:22:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

atalman

eqy

zou3519

Timeline (top)

subscribed ×21mentioned ×11commented ×6labeled ×6

Error Message

[2026-03-10T09:05:47Z] E cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89

[2026-03-10T09:05:47Z] [2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

Code Example

[2026-03-10T09:05:47Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
--
[2026-03-10T09:05:47Z]
[2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

---

[2026-03-10T13:06:50Z] =================================== FAILURES ===================================
--
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len1] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 3, 4], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[3]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype1-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.float16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"HALF","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":null...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

RAW_BUFFERClick to expand / collapse

09.03.2026 Failure: https://buildkite.com/vllm/ci/builds/55400#019cd7be-8424-4732-8642-71b2ead336fb

[2026-03-10T09:05:47Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
--
[2026-03-10T09:05:47Z]
[2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError


[2026-03-10T13:06:50Z] =================================== FAILURES ===================================
--
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len1] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 3, 4], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[3]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype1-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.float16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"HALF","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":null...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry

extent analysis

Fix Plan

The error message cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89 indicates that the CUDA graph is not supported on the current GPU architecture.

To fix this issue, you can try the following steps:

Update GPU Driver: Ensure that your GPU driver is up-to-date, as newer drivers may include support for the required CUDA graph features.
Use a Different GPU Architecture: If possible, try running the code on a GPU with a Hopper or newer architecture, which supports the required CUDA graph features.
Disable CUDA Graphs: You can try disabling CUDA graphs by setting the CUDA_GRAPH environment variable to 0 or by using the torch.cuda.graphs module to disable graph execution.

Here is an example of how to disable CUDA graphs using the torch.cuda.graphs module:

import torch

# Disable CUDA graphs
torch.cuda.graphs(False)

Alternatively, you can set the CUDA_GRAPH environment variable to 0 before running your code:

export CUDA_GRAPH=0

Note that disabling CUDA graphs may impact performance, so this should be used as a temporary workaround until a more permanent solution can be found.

Verification

To verify that the fix worked, you can try running the code again and checking for any error messages related to CUDA graphs. If the code runs without errors, it should indicate that the fix was successful.

Extra Tips

Ensure that your PyTorch version is up-to-date, as newer versions may include fixes for CUDA graph-related issues.
If you are using a custom CUDA kernel, ensure that it is compatible with your GPU architecture.
Consider filing a bug report with the PyTorch developers if you are unable to resolve the issue, as they may be able to provide additional guidance or fixes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [release 2.11][vllm] Kernels Attention Test - test_mha_attn_varlen_forward_flashinfer error [6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

[2026-03-10T09:05:47Z] E cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89

Code Example

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [release 2.11][vllm] Kernels Attention Test - test_mha_attn_varlen_forward_flashinfer error [6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

[2026-03-10T09:05:47Z] E cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89

Code Example

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING