vllm - 💡(How to fix) Fix [Bug]: CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of BF16 vision encoder on NVIDIA Jetson AGX Thor (vLLM 0.19.0 regression) [1 participants]

vllm2026-04-23 00:14:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40661•Fetched 2026-04-23 07:23:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hvaniya5

Participants

hvaniya5

Error Message

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0) ... (EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. ... (EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use (EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s (EngineCore_DP0 pid=110) ERROR [core.py:1108] File ".../vllm/model_executor/parameter.py", line 126, in torch_function (EngineCore_DP0 pid=110) ERROR [core.py:1108] return super().torch_function(func, types, args, kwargs) (EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Workaround

Pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) restores functionality. The model runs correctly and serves inference at ~29–52 tokens/sec depending on prompt length.

Code Example

docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  -p 8001:8001 \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve /model \
    --served-model-name Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 --port 8001 \
    --max-model-len 32768 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --dtype auto

---

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

extern_kernels.mm(buf1, reinterpret_tensor(arg6_1, (2048, 64), (1, 2048), 0), out=buf10)

---

# Same flags, same model, same hardware — works with old image:
docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04 \
  vllm serve /model --dtype auto --max-model-len 32768 ...
# → Application startup complete. ✅

RAW_BUFFERClick to expand / collapse

Your current environment

Container image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor (tag r38.2.arm64-sbsa-cu130-24.04) vLLM version: 0.19.0 Hardware: NVIDIA Jetson AGX Thor — Blackwell GPU (SM 90+), 122 GB unified LPDDR5x memory, CUDA 13.0 OS: Ubuntu 24.04 (ARM64 SBSA) Model: Qwen/Qwen3.6-35B-A3B-FP8 (qwen3_5_moe architecture with vision encoder) Python: 3.12.12

🐛 Describe the bug

vLLM 0.19.0 (Jetson Thor container) crashes with CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of the vision encoder when loading Qwen3.6-35B-A3B-FP8. This is a regression — the same model, same hardware, and same launch flags work correctly on the previous image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04, vLLM 0.16.0rc2).

The crash occurs after model weights are fully loaded, during the CUDA graph compilation phase (Compiling a graph for compile range (1, 2048)). The failing operation is a BF16 GEMM (cublasGemmEx with CUDA_R_16BF) called from the inductor-compiled vision encoder graph.

Reproduction steps

docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  -p 8001:8001 \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve /model \
    --served-model-name Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 --port 8001 \
    --max-model-len 32768 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --dtype auto

Also reproduced with:

--dtype bfloat16 (explicit)
--dtype float16 (different error path, also fails at startup due to memory not freed from crashed container)
--enforce-eager (same crash, happens before CUDA graph capture, during profiling warmup)

Expected behavior

Model loads and serves successfully, as it does on vLLM 0.16.0rc2 with identical flags.

Actual behavior — error log

(EngineCore_DP0 pid=111) INFO ... Initializing a V1 LLM engine (v0.19.0)
...
(EngineCore_DP0 pid=110) INFO [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
...
(EngineCore_DP0 pid=110) INFO [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=110) INFO [backends.py:390] Compiling a graph for compile range (1, 2048) takes 48.39 s
(EngineCore_DP0 pid=110) ERROR [core.py:1108]   File ".../vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=110) ERROR [core.py:1108]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=110) ERROR [core.py:1108] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Full failing call site:

extern_kernels.mm(buf1, reinterpret_tensor(arg6_1, (2048, 64), (1, 2048), 0), out=buf10)

Working configuration (for comparison)

# Same flags, same model, same hardware — works with old image:
docker run --runtime nvidia \
  -v /path/to/Qwen3.6-35B-A3B-FP8:/model:ro \
  ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04 \
  vllm serve /model --dtype auto --max-model-len 32768 ...
# → Application startup complete. ✅

Additional context

The crash occurs during the vision encoder CUDA graph compilation (compile range (1, 2048) for the MM encoder), not in the MoE language model layers.
The specific BF16 GEMM dimensions are (2048, 64) × (64, 2048) based on the reinterpret_tensor call.
The --enforce-eager flag does not prevent the crash — it fails at the same point (vision encoder warmup profiling), confirming this is not purely a CUDA graph capture issue.
Jetson Thor uses a Blackwell (SM 90a) GPU. The CUBLAS_STATUS_EXECUTION_FAILED on BF16 GEMM may be related to inductor-compiled code generating a GEMM shape or alignment not supported by cuBLAS 13.0 on this architecture.
Note also seen during 0.19.0 startup (non-fatal warning, may be related): Not enough SMs to use max_autotune_gemm mode

Workaround

extent analysis

TL;DR

The most likely fix for the CUBLAS_STATUS_EXECUTION_FAILED error during CUDA graph compilation of the vision encoder is to update the cuBLAS version or modify the GEMM dimensions to be compatible with the Blackwell GPU architecture.

Guidance

Verify that the issue is specific to the Blackwell GPU architecture (SM 90a) and cuBLAS 13.0 by testing on different hardware or with a different cuBLAS version.
Investigate modifying the GEMM dimensions or alignment to be compatible with cuBLAS 13.0 on the Blackwell GPU architecture.
Consider updating to a newer version of cuBLAS that may support the required GEMM shapes and alignments.
As a temporary workaround, pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) restores functionality.

Notes

The issue appears to be related to the inductor-compiled code generating a GEMM shape or alignment not supported by cuBLAS 13.0 on the Blackwell GPU architecture. Further investigation is needed to determine the root cause and develop a permanent fix.

Recommendation

Apply the workaround by pinning to the old image (ghcr.io/nvidia-ai-iot/vllm:0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04) until a permanent fix is available. This will allow the model to run correctly and serve inference while the issue is being investigated and resolved.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Model loads and serves successfully, as it does on vLLM 0.16.0rc2 with identical flags.

#retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of BF16 vision encoder on NVIDIA Jetson AGX Thor (vLLM 0.19.0 regression) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Your current environment

🐛 Describe the bug

Reproduction steps

Expected behavior

Actual behavior — error log

Working configuration (for comparison)

Additional context

Workaround

extent analysis

TL;DR

Guidance

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: CUBLAS_STATUS_EXECUTION_FAILED during CUDA graph compilation of BF16 vision encoder on NVIDIA Jetson AGX Thor (vLLM 0.19.0 regression) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Your current environment

🐛 Describe the bug

Reproduction steps

Expected behavior

Actual behavior — error log

Working configuration (for comparison)

Additional context

Workaround

extent analysis

TL;DR

Guidance

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING