vllm - 💡(How to fix) Fix [Bug]: NVCC compilation error when launching DeepSeek-V4-Flash on H100

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

WorkerProc hit an exception. Traceback (most recent call last): File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory self.model_runner.profile_run() File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run hidden_states, last_hidden_states = self._dummy_run( ^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run outputs = self.model( ^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward hidden_states = self.model( ^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in call output = self.aot_compiled_fn(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in call return self.fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward def forward( File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in call return self.optimized_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<string>", line 177, in execution_fn File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in call return range_entry.runnable(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper graph_output = inductor_compiled_graph(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in call return self._compiled_fn(*args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda> return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None) ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper all_outs = call_func_at_runtime_with_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) ^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper return compiled_fn(runtime_args) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in call return self.current_callable(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run out = model(new_inputs) ^^^^^^^^^^^^^^^^^ File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in call return self._op(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in torch_dispatch res = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in call return self._op(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre tf32_hc_prenorm_gemm( File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm return _tf32_hc_prenorm_gemm_impl( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed" Traceback (most recent call last): File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory self.model_runner.profile_run() File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run hidden_states, last_hidden_states = self._dummy_run( ^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run outputs = self.model( ^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward hidden_states = self.model( ^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in call output = self.aot_compiled_fn(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in call return self.fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward def forward( File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in call return self.optimized_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<string>", line 177, in execution_fn File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in call return range_entry.runnable(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper graph_output = inductor_compiled_graph(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in call return self._compiled_fn(*args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda> return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None) ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper all_outs = call_func_at_runtime_with_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) ^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper return compiled_fn(runtime_args) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in call return self.current_callable(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run out = model(new_inputs) ^^^^^^^^^^^^^^^^^ File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in call return self._op(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in torch_dispatch res = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in call return self._op(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre tf32_hc_prenorm_gemm( File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm return _tf32_hc_prenorm_gemm_impl( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"

Fix Action

Fix / Workaround

Error:

 WorkerProc hit an exception.
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"

Code Example

Your output of `python collect_env.py` here

---

root@glusterfs-07:~# pip list | grep vllm
vllm                                     0.21.0

---

python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nvme1n1/weights/DeepSeek-V4-Flash \
  --served-model-name auto \
  --port 8006 \
  -tp 4 \
  -dp 2 \
  --max-num-seqs 96 \
  --max-model-len 40960 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --kernel-config '{"moe_backend":"auto"}' \
  --kv-cache-dtype fp8 \
  --enable_expert_parallel

---

WorkerProc hit an exception.
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

vllm version:

root@glusterfs-07:~# pip list | grep vllm
vllm                                     0.21.0

Server launch command

python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nvme1n1/weights/DeepSeek-V4-Flash \
  --served-model-name auto \
  --port 8006 \
  -tp 4 \
  -dp 2 \
  --max-num-seqs 96 \
  --max-model-len 40960 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --kernel-config '{"moe_backend":"auto"}' \
  --kv-cache-dtype fp8 \
  --enable_expert_parallel

Error:

 WorkerProc hit an exception.
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"
 Traceback (most recent call last):
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
     self.model_runner.profile_run()
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5948, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5616, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
     hidden_states = self.model(
                     ^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 573, in __call__
     output = self.aot_compiled_fn(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
     return self.fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v4.py", line 1383, in forward
     def forward(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/caching.py", line 217, in __call__
     return self.optimized_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<string>", line 177, in execution_fn
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
     return range_entry.runnable(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 437, in compiled_graph_wrapper
     graph_output = inductor_compiled_graph(*args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
     return self._compiled_fn(*args)
            ^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/standalone_compile.py", line 236, in <lambda>
     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
                                                ^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
     all_outs = call_func_at_runtime_with_args(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
     out = normalize_as_list(f(args))
                             ^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
     return compiled_fn(runtime_args)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__
     return self.current_callable(inputs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_inductor/utils.py", line 3401, in run
     out = model(new_inputs)
           ^^^^^^^^^^^^^^^^^
   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/c265206b500283b2290ad76b210e7e274fdb119d03bee3238a72d6a5b78f8eb6/inductor_cache/qc/cqc2lfclyqeoilaasbba2jrlvd27sdkt2fwisoz4mscdf726lo6w.py", line 560, in call
     buf4 = torch.ops.vllm.mhc_pre.default(buf3, arg3_1, arg4_1, arg5_1, 1e-06, 1e-06, 1e-06, 2.0, 20, n_splits=1)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_compile.py", line 54, in inner
     return disable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
     return fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
     res = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/torch/_ops.py", line 865, in __call__
     return self._op(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
     tf32_hc_prenorm_gemm(
   File "/data/miniconda3/envs/xx/lib/python3.11/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
     return _tf32_hc_prenorm_gemm_impl(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:228): false and "NVCC compilation failed"

I found some deployment details for DeepSeek-V4-Flash on vLLM in DeepWiki. However, when I try to launch the model using the command above, I encounter the nvcc compilation failed error. Is DeepSeek-V4-Flash expected to work properly on H100 GPUs? If so, what might be causing the NVCC compilation error?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NVCC compilation error when launching DeepSeek-V4-Flash on H100