vllm - 💡(How to fix) Fix [Bug]: Vision encoder crashes on SM100 (Jetson Thor) — `_vllm_fa2_C` compiled for SM80-only, no override available for vision encoder [5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38411Fetched 2026-04-08 01:41:32
View on GitHub
Comments
5
Participants
4
Timeline
11
Reactions
0
Author
Timeline (top)
commented ×5subscribed ×2cross-referenced ×1labeled ×1

Error Message

File ".../vllm/model_executor/layers/attention/mm_encoder_attention.py", line 402, in forward_cuda return self._forward_fa(query, key, value, cu_seqlens, max_seqlen) File ".../vllm/v1/attention/ops/vit_attn_wrappers.py", line 51, in flash_attn_maxseqlen_wrapper output = flash_attn_varlen_func(...) File ".../vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(...) torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain. (cudaErrorUnsupportedPtxVersion)

Root Cause

The issue is narrow and reproducible: we know that language model path works on this hardware because the setting --attention-backend TRITON_ATTN hides the SM80-compiled _vllm_fa2_C library from running. However, the vision encoder path has no equivalent escape, so any multimodal model that uses MMEncoderAttention crashes unconditionally at startup on SM100 hardware during the profiling pass before the server even starts accepting requests.

Fix Action

Fix / Workaround

Workaround

  • The NGC vLLM 26.02 container (nvcr.io/nvidia/vllm:26.02-py3) does not support qwen3_5_moe architecture (transformers version too old), so switching containers is not a viable workaround.
  • The NVIDIA NIM container (nvcr.io/nim/qwen/qwen3.5-35b-a3b:1.7.0-variant) hits a different but related issue: ptxas-blackwell does not recognize sm_110a, suggesting the Thor sm variant is undertested across the ecosystem.
  • SGLang's FA4 backend explicitly supports SM100 without this issue and may be the longer-term solution, but vLLM is the stable serving stack for this hardware configuration.

Code Example

File ".../vllm/model_executor/layers/attention/mm_encoder_attention.py", line 402, in forward_cuda
    return self._forward_fa(query, key, value, cu_seqlens, max_seqlen)
File ".../vllm/v1/attention/ops/vit_attn_wrappers.py", line 51, in flash_attn_maxseqlen_wrapper
    output = flash_attn_varlen_func(...)
File ".../vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func
    out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(...)
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
(cudaErrorUnsupportedPtxVersion)

---

vllm serve Qwen/Qwen3.5-35B-A3B \
  --attention-backend TRITON_ATTN \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4}'
RAW_BUFFERClick to expand / collapse

Your current environment

Human Context

I'm running vLLM on an NVIDIA AGX Thor Developer Kit (built on the Blackwell architecture, SM110a, and listed as a supported production target in NVIDIA's NGC container release notes.

The issue is narrow and reproducible: we know that language model path works on this hardware because the setting --attention-backend TRITON_ATTN hides the SM80-compiled _vllm_fa2_C library from running. However, the vision encoder path has no equivalent escape, so any multimodal model that uses MMEncoderAttention crashes unconditionally at startup on SM100 hardware during the profiling pass before the server even starts accepting requests.

I've verified this across three different container environments (jetson-containers vLLM 0.19.0, NGC vLLM 26.02, and the NVIDIA NIM container), and the failure is consistent and deterministic. The fix appears conceptually straightforward — extend the attention backend override to MMEncoderAttention the same way it already works for the language model — though I recognize I may be missing architectural constraints that make this harder than it looks.

AI

Environment

  • vLLM version: 0.19.0
  • Hardware: NVIDIA AGX Thor Developer Kit (SM110a / Blackwell)
  • CUDA: 13.2
  • OS: Ubuntu 24.04 aarch64 SBSA / L4T r39.0 / JetPack 7.2
  • Model: Qwen/Qwen3.5-35B-A3B (multimodal, qwen3_5_moe architecture)
  • Container: jetson-containers vLLM build

Bug description

The language model path serves correctly on SM100 using --attention-backend TRITON_ATTN, which bypasses the SM80-compiled _vllm_fa2_C library. However, the vision encoder (MMEncoderAttention) unconditionally calls _vllm_fa2_C.varlen_fwd during the KV cache profiling pass (profile_run → embed_multimodal), with no equivalent backend override available. This causes a hard crash at startup on any SM100 device.

Error

File ".../vllm/model_executor/layers/attention/mm_encoder_attention.py", line 402, in forward_cuda
    return self._forward_fa(query, key, value, cu_seqlens, max_seqlen)
File ".../vllm/v1/attention/ops/vit_attn_wrappers.py", line 51, in flash_attn_maxseqlen_wrapper
    output = flash_attn_varlen_func(...)
File ".../vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func
    out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(...)
torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
(cudaErrorUnsupportedPtxVersion)

Root cause

_vllm_fa2_C is a pre-compiled flash attention library targeting SM80. On SM100 it fails with cudaErrorUnsupportedPtxVersion. The language model path avoids this via --attention-backend TRITON_ATTN, which routes through Triton JIT instead. MMEncoderAttention has no equivalent escape hatch — forward_cuda calls _forward_fa unconditionally regardless of the configured attention backend.

Workaround

--language-model-only disables vision entirely and allows the model to serve text successfully. This is not acceptable for multimodal use cases.

Expected behavior

MMEncoderAttention should respect a backend override (e.g. --mm-attention-backend triton or falling back to Triton when _vllm_fa2_C is unavailable for the target architecture), consistent with how the language model attention path handles SM100.

Steps to reproduce

vllm serve Qwen/Qwen3.5-35B-A3B \
  --attention-backend TRITON_ATTN \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4}'

Crashes during profile_run before serving begins. Reproducible on every startup on SM110a.

Additional context

  • The NGC vLLM 26.02 container (nvcr.io/nvidia/vllm:26.02-py3) does not support qwen3_5_moe architecture (transformers version too old), so switching containers is not a viable workaround.
  • The NVIDIA NIM container (nvcr.io/nim/qwen/qwen3.5-35b-a3b:1.7.0-variant) hits a different but related issue: ptxas-blackwell does not recognize sm_110a, suggesting the Thor sm variant is undertested across the ecosystem.
  • SGLang's FA4 backend explicitly supports SM100 without this issue and may be the longer-term solution, but vLLM is the stable serving stack for this hardware configuration.

🐛 Describe the bug

Human Context

I'm running vLLM on an NVIDIA AGX Thor Developer Kit (built on the Blackwell architecture, SM110a, and listed as a supported production target in NVIDIA's NGC container release notes.

The issue is narrow and reproducible: we know that language model path works on this hardware because the setting --attention-backend TRITON_ATTN hides the SM80-compiled _vllm_fa2_C library from running. However, the vision encoder path has no equivalent escape, so any multimodal model that uses MMEncoderAttention crashes unconditionally at startup on SM100 hardware during the profiling pass before the server even starts accepting requests.

I've verified this across three different container environments (jetson-containers vLLM 0.19.0, NGC vLLM 26.02, and the NVIDIA NIM container), and the failure is consistent and deterministic. The fix appears conceptually straightforward — extend the attention backend override to MMEncoderAttention the same way it already works for the language model — though I recognize I may be missing architectural constraints that make this harder than it looks.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to extend the attention backend override to MMEncoderAttention similar to how it works for the language model. Here are the steps:

  • Modify the mm_encoder_attention.py file to respect the --attention-backend flag.
  • Add a conditional check to use Triton JIT when _vllm_fa2_C is unavailable for the target architecture.

Code Changes

# mm_encoder_attention.py

def forward_cuda(self, query, key, value, cu_seqlens, max_seqlen):
    if self.config.attention_backend == 'TRITON_ATTN':
        # Use Triton JIT
        return self._forward_triton(query, key, value, cu_seqlens, max_seqlen)
    else:
        # Use _vllm_fa2_C
        return self._forward_fa(query, key, value, cu_seqlens, max_seqlen)
# Add a new flag to the config
class Config:
    def __init__(self, attention_backend='DEFAULT'):
        self.attention_backend = attention_backend
# Update the command line argument parser
parser.add_argument('--mm-attention-backend', type=str, default='DEFAULT')

Verification

To verify the fix, run the following command:

vllm serve Qwen/Qwen3.5-35B-A3B \
  --attention-backend TRITON_ATTN \
  --mm-attention-backend TRITON_ATTN \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4}'

If the model serves correctly without crashing, the fix is successful.

Extra Tips

  • Make sure to update the documentation to reflect the new --mm-attention-backend flag.
  • Consider adding a fallback mechanism to use Triton JIT when _vllm_fa2_C is unavailable for the target architecture.
  • Test the fix across different container environments to ensure consistency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING