vllm - 💡(How to fix) Fix [Bug]: CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of BF16 vision encoder on NVIDIA Jetson AGX Thor (vLLM 0.19.0 regression) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40661Fetched 2026-04-23 07:23:34
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0) ... (EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. ... (EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use (EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s (EngineCore_DP0 pid=110) ERROR [core.py:1108] File ".../vllm/model_executor/parameter.py", line 126, in torch_function (EngineCore_DP0 pid=110) ERROR [core.py:1108] return super().torch_function(func, types, args, kwargs) (EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Workaround

Pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) restores functionality. The model runs correctly and serves inference at ~29–52 tokens/sec depending on prompt length.

Code Example

docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  -p 8001:8001 \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve /model \
    --served-model-name Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 --port 8001 \
    --max-model-len 32768 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --dtype auto

---

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

extern_kernels.mm(buf1, reinterpret_tensor(arg6_1, (2048, 64), (1, 2048), 0), out=buf10)

---

# Same flags, same model, same hardware — works with old image:
docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04 \
  vllm serve /model --dtype auto --max-model-len 32768 ...
# → Application startup complete. 
RAW_BUFFERClick to expand / collapse

Your current environment

Container image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor (tag r38.2.arm64-sbsa-cu130-24.04) vLLM version: 0.19.0 Hardware: NVIDIA Jetson AGX Thor — Blackwell GPU (SM 90+), 122 GB unified LPDDR5x memory, CUDA 13.0 OS: Ubuntu 24.04 (ARM64 SBSA) Model: Qwen/Qwen3.6-35B-A3B-FP8 (qwen3_5_moe architecture with vision encoder) Python: 3.12.12

🐛 Describe the bug

vLLM 0.19.0 (Jetson Thor container) crashes with CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of the vision encoder when loading Qwen3.6-35B-A3B-FP8. This is a regression — the same model, same hardware, and same launch flags work correctly on the previous image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04, vLLM 0.16.0rc2).

The crash occurs after model weights are fully loaded, during the CUDA graph compilation phase (Compiling a graph for compile range (1, 2048)). The failing operation is a BF16 GEMM (cublasGemmEx with CUDA_R_16BF) called from the inductor-compiled vision encoder graph.

Reproduction steps

docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  -p 8001:8001 \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve /model \
    --served-model-name Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 --port 8001 \
    --max-model-len 32768 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --dtype auto

Also reproduced with:

  • --dtype bfloat16 (explicit)
  • --dtype float16 (different error path, also fails at startup due to memory not freed from crashed container)
  • --enforce-eager (same crash, happens before CUDA graph capture, during profiling warmup)

Expected behavior

Model loads and serves successfully, as it does on vLLM 0.16.0rc2 with identical flags.

Actual behavior — error log

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Full failing call site:

extern_kernels.mm(buf1, reinterpret_tensor(arg6_1, (2048, 64), (1, 2048), 0), out=buf10)

Working configuration (for comparison)

# Same flags, same model, same hardware — works with old image:
docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04 \
  vllm serve /model --dtype auto --max-model-len 32768 ...
# → Application startup complete. ✅

Additional context

  • The crash occurs during the vision encoder CUDA graph compilation (compile range (1, 2048) for the MM encoder), not in the MoE language model layers.
  • The specific BF16 GEMM dimensions are (2048, 64) × (64, 2048) based on the reinterpret_tensor call.
  • The --enforce-eager flag does not prevent the crash — it fails at the same point (vision encoder warmup profiling), confirming this is not purely a CUDA graph capture issue.
  • Jetson Thor uses a Blackwell (SM 90a) GPU. The CUBLAS_STATUS_EXECUTION_FAILED on BF16 GEMM may be related to inductor-compiled code generating a GEMM shape or alignment not supported by cuBLAS 13.0 on this architecture.
  • Note also seen during 0.19.0 startup (non-fatal warning, may be related): Not enough SMs to use max_autotune_gemm mode

Workaround

Pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) restores functionality. The model runs correctly and serves inference at ~29–52 tokens/sec depending on prompt length.

extent analysis

TL;DR

The most likely fix for the CUBLAS_STATUS_EXECUTION_FAILED error during CUDA graph compilation of the vision encoder is to update the cuBLAS version or modify the GEMM dimensions to be compatible with the Blackwell GPU architecture.

Guidance

  • Verify that the issue is specific to the Blackwell GPU architecture (SM 90a) and cuBLAS 13.0 by testing on different hardware or with a different cuBLAS version.
  • Investigate modifying the GEMM dimensions or alignment to be compatible with cuBLAS 13.0 on the Blackwell GPU architecture.
  • Consider updating to a newer version of cuBLAS that may support the required GEMM shapes and alignments.
  • As a temporary workaround, pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) restores functionality.

Notes

The issue appears to be related to the inductor-compiled code generating a GEMM shape or alignment not supported by cuBLAS 13.0 on the Blackwell GPU architecture. Further investigation is needed to determine the root cause and develop a permanent fix.

Recommendation

Apply the workaround by pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) until a permanent fix is available. This will allow the model to run correctly and serve inference while the issue is being investigated and resolved.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Model loads and serves successfully, as it does on vLLM 0.16.0rc2 with identical flags.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of BF16 vision encoder on NVIDIA Jetson AGX Thor (vLLM 0.19.0 regression) [1 participants]