vllm - ✅(Solved) Fix [Bug]: RMSNormGated input_guard breaks torch.compile dynamo tracing [1 pull requests, 1 comments, 1 participants]

vllm2026-04-26 14:33:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40919•Fetched 2026-04-27 05:29:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

izhuhaoran

Participants

izhuhaoran

Timeline (top)

commented ×1cross-referenced ×1labeled ×1

Error Message

WorkerProc hit an exception. Traceback (most recent call last): File "/workspace/zhr/vllm_tmp/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory self.model_runner.profile_run() File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5840, in profile_run hidden_states, last_hidden_states = self._dummy_run( ^^^^^^^^^^^^^^^^ File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5529, in _dummy_run outputs = self.model( ^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_5.py", line 695, in forward hidden_states = self.language_model.model( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/compilation/decorators.py", line 638, in call output = TorchCompileWithNoGuardsWrapper.call( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 197, in call return self._call_with_optional_nvtx_range( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range return callable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1034, in compile_wrapper raise e.with_traceback(None) from e.cause # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Attempted to inline function marked as skipped Explanation: Dynamo developers have intentionally marked that the function device_index.__init__ should not be traced. Hint: Avoid calling the function device_index.__init__. Hint: Apply @torch._dynamo.dont_skip_tracing to the function device_index.__init__ to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function. Hint: Please file an issue to PyTorch.

Developer debug context: qualname: device_index.init, name: init, filename: /home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/accelerator/__init__.py, skip reason: skipped according trace_rules.lookup MOD_SKIPLIST

For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0008.html

from user code: File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 518, in forward hidden_states, residual = layer( File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 408, in forward self.linear_attn( File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 517, in forward self._forward_method(hidden_states, output) File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 592, in forward_cuda core_attn_out = self.norm(core_attn_out, z) File "/workspace/zhr/vllm_tmp/vllm/model_executor/custom_op.py", line 136, in forward return self._forward_method(*args, **kwargs) File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/layernorm.py", line 510, in forward_cuda return rmsnorm_fn( File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/fla/ops/utils.py", line 110, in wrapper ctx = torch.accelerator.device_index(tensor.device.index)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Root Cause

This appears to be caused by input_guard using torch.accelerator.device_index(...), which Dynamo cannot trace it.

Fix Action

Fixed

Fixed by PR: Bugfix: fix RMSNormGated input_guard torch.compile dynamo tracing on CUDA (https://github.com/vllm-project/vllm/pull/40921)

PR fix notes

PR #40921: Bugfix: fix RMSNormGated input_guard torch.compile dynamo tracing on CUDA

Repository: vllm-project/vllm
Author: izhuhaoran
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40921

Description (problem / solution / changelog)

Purpose

This PR fixes #40919 a startup dynamo failure when serving Qwen3.5

vllm/model_executor/layers/fla/ops/utils.py::input_guard currently uses:

torch.accelerator.device_index(tensor.device.index)

On PyTorch 2.11, Dynamo cannot trace device_index.__init__ in fullgraph mode, causing engine initialization to fail during the profiling run.

The fix keeps the generic accelerator path for non-CUDA devices (like tpu), but uses the original CUDA-specific context manager for CUDA tensors.

Changed files

vllm/model_executor/layers/fla/ops/utils.py (modified, +4/-1)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA H20-3e
GPU 1: NVIDIA H20-3e
GPU 2: NVIDIA H20-3e
GPU 3: NVIDIA H20-3e
GPU 4: NVIDIA H20-3e
GPU 5: NVIDIA H20-3e
GPU 6: NVIDIA H20-3e
GPU 7: NVIDIA H20-3e

Nvidia driver version        : 570.133.20
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             160
On-line CPU(s) list:                0-159
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Processor
CPU family:                         6
Model:                              207
Thread(s) per core:                 1
Core(s) per socket:                 80
Socket(s):                          2
Stepping:                           2
BogoMIPS:                           5600.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          3.8 MiB (80 instances)
L1i cache:                          2.5 MiB (80 instances)
L2 cache:                           160 MiB (80 instances)
L3 cache:                           640 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.17.1.4
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.28.9
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu129
[pip3] torchvision==0.26.0+cu129
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.2rc1.dev195+g21792520e (git sha: 21792520e)
vLLM Build Flags:
  CUDA Archs: 8.0;8.6;8.9;9.0;10.0;10.1;10.3;12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	[4mGPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	PHB	SYS	SYS	0-79	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PXB	PHB	SYS	SYS	0-79	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	PHB	PIX	SYS	SYS	0-79	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	PHB	PXB	SYS	SYS	0-79	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PIX	PHB	80-159	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	PXB	PHB	80-159	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	PHB	PIX	80-159	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	PHB	PXB	80-159	1		N/A
NIC0	PIX	PXB	PHB	PHB	SYS	SYS	SYS	SYS	 X 	PHB	SYS	SYS				
NIC1	PHB	PHB	PIX	PXB	SYS	SYS	SYS	SYS	PHB	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	PIX	PXB	PHB	PHB	SYS	SYS	 X 	PHB				
NIC3	SYS	SYS	SYS	SYS	PHB	PHB	PIX	PXB	SYS	SYS	PHB	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_IB_TC=136
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NCCL_MIN_NCHANNELS=4
NCCL_NET_PLUGIN=none
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
TORCH_CUDA_ARCH_LIST=8.0;8.6;8.9;9.0;10.0;10.1;10.3;12.0
NCCL_VERSION=2.27.3
NCCL_SOCKET_IFNAME=net
VLLM_CACHE_ROOT=/root/.cache/vllm
TORCH_INC=/usr/local/lib/python3.12/dist-packages/torch/include
CUDA_INC=/usr/local/cuda/include
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
NCCL_DEBUG=INFO
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NCCL_IB_HCA=mlx5
NVIDIA_PRODUCT_NAME=PyTorch
NCCL_IB_GID_INDEX=3
VLLM_USE_FLASHINFER_SAMPLER=0
NCCL_IB_SPLIT_DATA_ON_QPS=1
CUDA_VERSION=12.9
PYTORCH_VERSION=2.8.0a0+5228986
TORCH_PATH=/usr/local/lib/python3.12/dist-packages/torch
PYTORCH_BUILD_NUMBER=0
CUDA_BIN=/usr/local/cuda/bin
TORCH_VERSION_NUM=280
CUBLASMP_VERSION=0.4.0.789
CUDA_X86_64_LIB=/usr/local/cuda/targets/x86_64-linux/lib/
CUDA_LIB=/usr/local/cuda/lib64
CUDNN_FRONTEND_VERSION=1.12.0
NCCL_IB_QPS_PER_CONNECTION=8
CUDA_TOOLKIT_PATH=/usr/local/cuda
CUDA_VERSION_NUM=129
CUDA_PATH=/usr/local/cuda
VLLM_DO_NOT_TRACK=1
NCCL_IB_TIMEOUT=22
CUDNN_VERSION=9.10.2.21
NCCL_IB_SL=5
TORCH_LIB=/usr/local/lib/python3.12/dist-packages/torch/lib
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/cuda-12.9/targets/x86_64-linux/lib/:/usr/local/cuda/compat/lib:/usr/local/lib/python3.12/dist-packages/aquila_core:/usr/local/lib/python3.12/dist-packages/sniper_codec/lib:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/ffmpeg-7.1/lib:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/local/ffmpeg-7.1/lib64:/usr/local/ffmpeg-7.1/lib:/usr/local/tbb-2022.3.0/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/lib64:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib
NVIDIA_BUILD_ID=177567386
VLLM_NO_USAGE_STATS=1
CUDA_COMPUTE_CAPABILITIES=8.0,8.6,8.9,9.0,10.0,10.1,10.3,12.0
CUDA_DRIVER_VERSION=575.57.08
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.06
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
TORCH_VERSION=2.8.0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

python -m vllm.entrypoints.cli.main Qwen/Qwen3.5-35B-A3B-FP8 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --max-num-seqs 512 --no-enable-prefix-caching --host 127.0.0.1 --port 8000 --no-async-scheduling --compilation-config='{"backend": "eager", "cudagraph_mode": "FULL_AND_PIECEWISE"}' --speculative_config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --reasoning-parser qwen3 --language-model-only

---

WorkerProc hit an exception.
 Traceback (most recent call last):
   File "/workspace/zhr/vllm_tmp/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
     self.model_runner.profile_run()
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5840, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5529, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_5.py", line 695, in forward
     hidden_states = self.language_model.model(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/decorators.py", line 638, in __call__
     output = TorchCompileWithNoGuardsWrapper.__call__(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 197, in __call__
     return self._call_with_optional_nvtx_range(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
     return callable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1034, in compile_wrapper
     raise e.with_traceback(None) from e.__cause__  # User compiler error
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 torch._dynamo.exc.Unsupported: Attempted to inline function marked as skipped
   Explanation: Dynamo developers have intentionally marked that the function `device_index.__init__` should not be traced.
   Hint: Avoid calling the function `device_index.__init__`.
   Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `device_index.__init__` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
   Hint: Please file an issue to PyTorch.
 
   Developer debug context: qualname: device_index.__init__, name: __init__, filename: `/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/accelerator/__init__.py`, skip reason: skipped according trace_rules.lookup MOD_SKIPLIST
 
  For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0008.html
 
 from user code:
    File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 518, in forward
     hidden_states, residual = layer(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 408, in forward
     self.linear_attn(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 517, in forward
     self._forward_method(hidden_states, output)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 592, in forward_cuda
     core_attn_out = self.norm(core_attn_out, z)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/custom_op.py", line 136, in forward
     return self._forward_method(*args, **kwargs)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/layernorm.py", line 510, in forward_cuda
     return rmsnorm_fn(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/fla/ops/utils.py", line 110, in wrapper
     ctx = torch.accelerator.device_index(tensor.device.index)
 
 Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA H20-3e
GPU 1: NVIDIA H20-3e
GPU 2: NVIDIA H20-3e
GPU 3: NVIDIA H20-3e
GPU 4: NVIDIA H20-3e
GPU 5: NVIDIA H20-3e
GPU 6: NVIDIA H20-3e
GPU 7: NVIDIA H20-3e

Nvidia driver version        : 570.133.20
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             160
On-line CPU(s) list:                0-159
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Processor
CPU family:                         6
Model:                              207
Thread(s) per core:                 1
Core(s) per socket:                 80
Socket(s):                          2
Stepping:                           2
BogoMIPS:                           5600.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          3.8 MiB (80 instances)
L1i cache:                          2.5 MiB (80 instances)
L2 cache:                           160 MiB (80 instances)
L3 cache:                           640 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.17.1.4
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.28.9
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu129
[pip3] torchvision==0.26.0+cu129
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.2rc1.dev195+g21792520e (git sha: 21792520e)
vLLM Build Flags:
  CUDA Archs: 8.0;8.6;8.9;9.0;10.0;10.1;10.3;12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	[4mGPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	PHB	SYS	SYS	0-79	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PXB	PHB	SYS	SYS	0-79	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	PHB	PIX	SYS	SYS	0-79	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	PHB	PXB	SYS	SYS	0-79	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PIX	PHB	80-159	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	PXB	PHB	80-159	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	PHB	PIX	80-159	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	PHB	PXB	80-159	1		N/A
NIC0	PIX	PXB	PHB	PHB	SYS	SYS	SYS	SYS	 X 	PHB	SYS	SYS				
NIC1	PHB	PHB	PIX	PXB	SYS	SYS	SYS	SYS	PHB	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	PIX	PXB	PHB	PHB	SYS	SYS	 X 	PHB				
NIC3	SYS	SYS	SYS	SYS	PHB	PHB	PIX	PXB	SYS	SYS	PHB	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_IB_TC=136
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NCCL_MIN_NCHANNELS=4
NCCL_NET_PLUGIN=none
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
TORCH_CUDA_ARCH_LIST=8.0;8.6;8.9;9.0;10.0;10.1;10.3;12.0
NCCL_VERSION=2.27.3
NCCL_SOCKET_IFNAME=net
VLLM_CACHE_ROOT=/root/.cache/vllm
TORCH_INC=/usr/local/lib/python3.12/dist-packages/torch/include
CUDA_INC=/usr/local/cuda/include
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
TORCH_NCCL_USE_COMM_NONBLOCKING=0
NCCL_DEBUG=INFO
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NCCL_IB_HCA=mlx5
NVIDIA_PRODUCT_NAME=PyTorch
NCCL_IB_GID_INDEX=3
VLLM_USE_FLASHINFER_SAMPLER=0
NCCL_IB_SPLIT_DATA_ON_QPS=1
CUDA_VERSION=12.9
PYTORCH_VERSION=2.8.0a0+5228986
TORCH_PATH=/usr/local/lib/python3.12/dist-packages/torch
PYTORCH_BUILD_NUMBER=0
CUDA_BIN=/usr/local/cuda/bin
TORCH_VERSION_NUM=280
CUBLASMP_VERSION=0.4.0.789
CUDA_X86_64_LIB=/usr/local/cuda/targets/x86_64-linux/lib/
CUDA_LIB=/usr/local/cuda/lib64
CUDNN_FRONTEND_VERSION=1.12.0
NCCL_IB_QPS_PER_CONNECTION=8
CUDA_TOOLKIT_PATH=/usr/local/cuda
CUDA_VERSION_NUM=129
CUDA_PATH=/usr/local/cuda
VLLM_DO_NOT_TRACK=1
NCCL_IB_TIMEOUT=22
CUDNN_VERSION=9.10.2.21
NCCL_IB_SL=5
TORCH_LIB=/usr/local/lib/python3.12/dist-packages/torch/lib
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/cuda-12.9/targets/x86_64-linux/lib/:/usr/local/cuda/compat/lib:/usr/local/lib/python3.12/dist-packages/aquila_core:/usr/local/lib/python3.12/dist-packages/sniper_codec/lib:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/ffmpeg-7.1/lib:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/local/ffmpeg-7.1/lib64:/usr/local/ffmpeg-7.1/lib:/usr/local/tbb-2022.3.0/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/lib64:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib
NVIDIA_BUILD_ID=177567386
VLLM_NO_USAGE_STATS=1
CUDA_COMPUTE_CAPABILITIES=8.0,8.6,8.9,9.0,10.0,10.1,10.3,12.0
CUDA_DRIVER_VERSION=575.57.08
PYTORCH_BUILD_VERSION=2.8.0a0+5228986
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=25.06
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
TORCH_VERSION=2.8.0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

</details>

🐛 Describe the bug

When serving Qwen3.5 :

python -m vllm.entrypoints.cli.main Qwen/Qwen3.5-35B-A3B-FP8 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --max-num-seqs 512 --no-enable-prefix-caching --host 127.0.0.1 --port 8000 --no-async-scheduling --compilation-config='{"backend": "eager", "cudagraph_mode": "FULL_AND_PIECEWISE"}' --speculative_config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --reasoning-parser qwen3 --language-model-only

startup fails as torch.compile's dynamo break during the profiling run.

 WorkerProc hit an exception.
 Traceback (most recent call last):
   File "/workspace/zhr/vllm_tmp/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
     self.model_runner.profile_run()
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5840, in profile_run
     hidden_states, last_hidden_states = self._dummy_run(
                                         ^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/v1/worker/gpu_model_runner.py", line 5529, in _dummy_run
     outputs = self.model(
               ^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/cuda_graph.py", line 254, in __call__
     return self.runnable(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
     return forward_call(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_5.py", line 695, in forward
     hidden_states = self.language_model.model(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/decorators.py", line 638, in __call__
     output = TorchCompileWithNoGuardsWrapper.__call__(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 197, in __call__
     return self._call_with_optional_nvtx_range(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/zhr/vllm_tmp/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
     return callable_fn(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1034, in compile_wrapper
     raise e.with_traceback(None) from e.__cause__  # User compiler error
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 torch._dynamo.exc.Unsupported: Attempted to inline function marked as skipped
   Explanation: Dynamo developers have intentionally marked that the function `device_index.__init__` should not be traced.
   Hint: Avoid calling the function `device_index.__init__`.
   Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `device_index.__init__` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
   Hint: Please file an issue to PyTorch.
 
   Developer debug context: qualname: device_index.__init__, name: __init__, filename: `/home/admin/venvs/vllm_tmp/lib/python3.12/site-packages/torch/accelerator/__init__.py`, skip reason: skipped according trace_rules.lookup MOD_SKIPLIST
 
  For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0008.html
 
 from user code:
    File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 518, in forward
     hidden_states, residual = layer(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/models/qwen3_next.py", line 408, in forward
     self.linear_attn(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 517, in forward
     self._forward_method(hidden_states, output)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 592, in forward_cuda
     core_attn_out = self.norm(core_attn_out, z)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/custom_op.py", line 136, in forward
     return self._forward_method(*args, **kwargs)
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/layernorm.py", line 510, in forward_cuda
     return rmsnorm_fn(
   File "/workspace/zhr/vllm_tmp/vllm/model_executor/layers/fla/ops/utils.py", line 110, in wrapper
     ctx = torch.accelerator.device_index(tensor.device.index)
 
 Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

This appears to be caused by input_guard using torch.accelerator.device_index(...), which Dynamo cannot trace it.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely caused by Dynamo's inability to trace the device_index function from torch.accelerator, and a potential fix involves modifying the code to avoid using this function or applying @torch._dynamo.dont_skip_tracing to force tracing.

Guidance

Identify the problematic code: Locate the device_index function call in the codebase, specifically in the forward_cuda method of gdn_linear_attn.py.
Apply @torch._dynamo.dont_skip_tracing: Decorate the device_index function with @torch._dynamo.dont_skip_tracing to force Dynamo to trace into this function, as suggested in the error message.
Modify the code to avoid device_index: If possible, refactor the code to avoid using the device_index function, which is marked as skipped by Dynamo developers.
Set TORCHDYNAMO_VERBOSE=1: Enable verbose mode for Dynamo to get more detailed internal stack traces, which can help with debugging.

Example

import torch

# Assuming device_index is defined in torch.accelerator
from torch.accelerator import device_index

# Apply @torch._dynamo.dont_skip_tracing to force tracing
@torch._dynamo.dont_skip_tracing
def custom_device_index(device):
    return device_index(device)

# Use custom_device_index instead of device_index
ctx = torch.accelerator.custom_device_index(tensor.device.index)

Notes

The provided solution is based on the error message and may require further modifications to work correctly.
It's recommended to file an issue with PyTorch, as suggested in the error message, to get more guidance on resolving this issue.

Recommendation

Apply the @torch._dynamo.dont_skip_tracing decorator to the device_index function to force tracing, as this

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment variable #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: RMSNormGated input_guard breaks torch.compile dynamo tracing [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #40921: Bugfix: fix RMSNormGated input_guard torch.compile dynamo tracing on CUDA

Description (problem / solution / changelog)

Purpose

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: RMSNormGated input_guard breaks torch.compile dynamo tracing [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #40921: Bugfix: fix RMSNormGated input_guard torch.compile dynamo tracing on CUDA

Description (problem / solution / changelog)

Purpose

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING