pytorch - 💡(How to fix) Fix [release 2.11][vllm] Kernels Attention Test - test_mha_attn_varlen_forward_flashinfer error [6 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177030Fetched 2026-04-08 00:22:36
View on GitHub
Comments
6
Participants
3
Timeline
48
Reactions
0
Author
Participants
Timeline (top)
subscribed ×21mentioned ×11commented ×6labeled ×6

Error Message

[2026-03-10T09:05:47Z] E cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89

[2026-03-10T09:05:47Z] [2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

Code Example

[2026-03-10T09:05:47Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
--
[2026-03-10T09:05:47Z]
[2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

---

[2026-03-10T13:06:50Z] =================================== FAILURES ===================================
--
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len1] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 3, 4], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[3]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype1-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.float16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"HALF","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":null...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
RAW_BUFFERClick to expand / collapse

09.03.2026 Failure: https://buildkite.com/vllm/ci/builds/55400#019cd7be-8424-4732-8642-71b2ead336fb

[2026-03-10T09:05:47Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
--
[2026-03-10T09:05:47Z]
[2026-03-10T09:05:47Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
<details>

[2026-03-10T13:06:50Z] =================================== FAILURES ===================================
--
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype0-var_seq_len1] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 3, 4], dtype = torch.bfloat16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[3]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[10368]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"BFLOAT16","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError
[2026-03-10T13:06:50Z] ______ test_mha_attn_varlen_forward_flashinfer[cuda-dtype1-var_seq_len0] _______
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] default_vllm_config = None, var_seq_len = [2, 2], dtype = torch.float16
[2026-03-10T13:06:50Z] device = 'cuda'
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("var_seq_len", VAR_SEQ_LENS)
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize(
[2026-03-10T13:06:50Z]         "dtype",
[2026-03-10T13:06:50Z]         [torch.bfloat16, torch.half],
[2026-03-10T13:06:50Z]     )
[2026-03-10T13:06:50Z]     @pytest.mark.parametrize("device", CUDA_DEVICES)
[2026-03-10T13:06:50Z]     def test_mha_attn_varlen_forward_flashinfer(
[2026-03-10T13:06:50Z]         default_vllm_config,
[2026-03-10T13:06:50Z]         var_seq_len: list[int],
[2026-03-10T13:06:50Z]         dtype: torch.dtype,
[2026-03-10T13:06:50Z]         device: str,
[2026-03-10T13:06:50Z]     ):
[2026-03-10T13:06:50Z]         """Test MMEncoderAttention varlen forward with FLASHINFER backend (head_size=72).
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         Exercises the path that uses --mm-encoder-attn-backend=FLASHINFER with
[2026-03-10T13:06:50Z]         recomputed cu_seqlens, max_seqlen, and sequence_lengths as in qwen3_vl
[2026-03-10T13:06:50Z]         vision encoder.
[2026-03-10T13:06:50Z]         """
[2026-03-10T13:06:50Z]         pytest.importorskip("flashinfer")
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         num_heads = 16
[2026-03-10T13:06:50Z]         head_size = 72
[2026-03-10T13:06:50Z]         set_random_seed(0)
[2026-03-10T13:06:50Z]         torch.set_default_device(device)
[2026-03-10T13:06:50Z]         torch.set_default_dtype(dtype)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         # Override vllm config so get_vit_attn_backend returns FLASHINFER (simulates
[2026-03-10T13:06:50Z]         # --mm-encoder-attn-backend=FLASHINFER).
[2026-03-10T13:06:50Z]         vllm_config = get_current_vllm_config()
[2026-03-10T13:06:50Z]         old_model_config = getattr(vllm_config, "model_config", None)
[2026-03-10T13:06:50Z]         minimal_model_config = type(
[2026-03-10T13:06:50Z]             "MinimalModelConfig",
[2026-03-10T13:06:50Z]             (),
[2026-03-10T13:06:50Z]             {
[2026-03-10T13:06:50Z]                 "multimodal_config": MultiModalConfig(
[2026-03-10T13:06:50Z]                     mm_encoder_attn_backend=AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]                 ),
[2026-03-10T13:06:50Z]             },
[2026-03-10T13:06:50Z]         )()
[2026-03-10T13:06:50Z]         vllm_config.model_config = minimal_model_config
[2026-03-10T13:06:50Z]         try:
[2026-03-10T13:06:50Z]             total_len = sum(var_seq_len)
[2026-03-10T13:06:50Z]             # Stride of second dim = 3 * num_heads * head_size (same as qwen2_5_vl
[2026-03-10T13:06:50Z]             # after qkv rearrange and unbind: qkv shape (b, s, 3, head, head_dim)).
[2026-03-10T13:06:50Z]             qkv = torch.randn(1, total_len, 3, num_heads, head_size)
[2026-03-10T13:06:50Z]             q, k, v = qkv.unbind(dim=2)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = np.array(
[2026-03-10T13:06:50Z]                 [0] + list(itertools.accumulate(var_seq_len)), dtype=np.int32
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             hidden_size = num_heads * head_size
[2026-03-10T13:06:50Z]             tp_size = 1
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             sequence_lengths_np = MMEncoderAttention.maybe_compute_sequence_lengths(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             sequence_lengths = torch.from_numpy(sequence_lengths_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             max_seqlen_val = MMEncoderAttention.compute_max_seqlen(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER, cu_seqlens_np
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             max_seqlen = torch.tensor(max_seqlen_val, device=device, dtype=torch.int32)
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             cu_seqlens_np = MMEncoderAttention.maybe_recompute_cu_seqlens(
[2026-03-10T13:06:50Z]                 AttentionBackendEnum.FLASHINFER,
[2026-03-10T13:06:50Z]                 cu_seqlens_np,
[2026-03-10T13:06:50Z]                 hidden_size,
[2026-03-10T13:06:50Z]                 tp_size,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             cu_seqlens = torch.from_numpy(cu_seqlens_np).to(
[2026-03-10T13:06:50Z]                 device, dtype=torch.int32, non_blocking=True
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]             scale = 1.0 / head_size**0.5
[2026-03-10T13:06:50Z]             attn = MMEncoderAttention(
[2026-03-10T13:06:50Z]                 num_heads,
[2026-03-10T13:06:50Z]                 head_size,
[2026-03-10T13:06:50Z]                 scale=scale,
[2026-03-10T13:06:50Z]                 num_kv_heads=num_heads,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]             assert attn.attn_backend == AttentionBackendEnum.FLASHINFER
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] >           output = attn(
[2026-03-10T13:06:50Z]                 q,
[2026-03-10T13:06:50Z]                 k,
[2026-03-10T13:06:50Z]                 v,
[2026-03-10T13:06:50Z]                 cu_seqlens=cu_seqlens,
[2026-03-10T13:06:50Z]                 max_seqlen=max_seqlen,
[2026-03-10T13:06:50Z]                 sequence_lengths=sequence_lengths,
[2026-03-10T13:06:50Z]             )
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] kernels/attention/test_mha_attn.py:331:
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1779: in _wrapped_call_impl
[2026-03-10T13:06:50Z]     return self._call_impl(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1790: in _call_impl
[2026-03-10T13:06:50Z]     return forward_call(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py:129: in forward
[2026-03-10T13:06:50Z]     return self._forward_method(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:389: in forward_cuda
[2026-03-10T13:06:50Z]     return self._forward_flashinfer(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py:351: in _forward_flashinfer
[2026-03-10T13:06:50Z]     return vit_flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:356: in vit_flashinfer_wrapper
[2026-03-10T13:06:50Z]     return torch.ops.vllm.flashinfer_wrapper(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/utils/_device.py:116: in __torch_function__
[2026-03-10T13:06:50Z]     return func(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/torch/_ops.py:1269: in __call__
[2026-03-10T13:06:50Z]     return self._op(*args, **kwargs)
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/vit_attn_wrappers.py:302: in flashinfer_wrapper
[2026-03-10T13:06:50Z]     output, _ = cudnn_batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:664: in cudnn_batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     return _batch_prefill_with_kv_cache(
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/flashinfer/cudnn/prefill.py:488: in _batch_prefill_with_kv_cache
[2026-03-10T13:06:50Z]     graph, tensors = _build_prefill_graph(
[2026-03-10T13:06:50Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] args = ()
[2026-03-10T13:06:50Z] kwargs = {'actual_seq_lens_kv': tensor([[[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[2]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[0]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] ...[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]],
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]         [[[4608]]]], device='cuda:0', dtype=torch.int32), ...}
[2026-03-10T13:06:50Z] g = {
[2026-03-10T13:06:50Z]     "context": {
[2026-03-10T13:06:50Z]         "compute_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "intermediate_data_type": "FLOAT",
[2026-03-10T13:06:50Z]         "io_data_t...E",
[2026-03-10T13:06:50Z]             "stride": [1179648,72,3456,1],
[2026-03-10T13:06:50Z]             "uid": 3,
[2026-03-10T13:06:50Z]             "uid_assigned": true
[2026-03-10T13:06:50Z]         }
[2026-03-10T13:06:50Z]     }
[2026-03-10T13:06:50Z] }
[2026-03-10T13:06:50Z] tensors = [[{"data_type":"HALF","dim":[8,16,1024,72],"is_pass_by_value":false,"is_virtual":false,"name":"q","pass_by_value":null...:"actual_seq_lens_kv","pass_by_value":null,"reordering_type":"NONE","stride":[1,1,1,1],"uid":101,"uid_assigned":true}]]
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z]     @wraps(func)
[2026-03-10T13:06:50Z]     def wrapper(*args, **kwargs):
[2026-03-10T13:06:50Z]         g, tensors = func(*args, **kwargs)  # Get the result
[2026-03-10T13:06:50Z]         if g.get_execution_plan_count() <= 0:
[2026-03-10T13:06:50Z] >           g.build(heur_modes)  # Build the graph
[2026-03-10T13:06:50Z] E           cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89
[2026-03-10T13:06:50Z]
[2026-03-10T13:06:50Z] /usr/local/lib/python3.12/dist-packages/cudnn/graph.py:68: cudnnGraphNotSupportedError

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry

extent analysis

Fix Plan

The error message cudnn._compiled_module.cudnnGraphNotSupportedError: THD (ragged offset) is only supported in Hopper and above : 89 indicates that the CUDA graph is not supported on the current GPU architecture.

To fix this issue, you can try the following steps:

  • Update GPU Driver: Ensure that your GPU driver is up-to-date, as newer drivers may include support for the required CUDA graph features.
  • Use a Different GPU Architecture: If possible, try running the code on a GPU with a Hopper or newer architecture, which supports the required CUDA graph features.
  • Disable CUDA Graphs: You can try disabling CUDA graphs by setting the CUDA_GRAPH environment variable to 0 or by using the torch.cuda.graphs module to disable graph execution.

Here is an example of how to disable CUDA graphs using the torch.cuda.graphs module:

import torch

# Disable CUDA graphs
torch.cuda.graphs(False)

Alternatively, you can set the CUDA_GRAPH environment variable to 0 before running your code:

export CUDA_GRAPH=0

Note that disabling CUDA graphs may impact performance, so this should be used as a temporary workaround until a more permanent solution can be found.

Verification

To verify that the fix worked, you can try running the code again and checking for any error messages related to CUDA graphs. If the code runs without errors, it should indicate that the fix was successful.

Extra Tips

  • Ensure that your PyTorch version is up-to-date, as newer versions may include fixes for CUDA graph-related issues.
  • If you are using a custom CUDA kernel, ensure that it is compatible with your GPU architecture.
  • Consider filing a bug report with the PyTorch developers if you are unable to resolve the issue, as they may be able to provide additional guidance or fixes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING