vllm - 💡(How to fix) Fix [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn't have storage [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37486Fetched 2026-04-08 00:58:27
View on GitHub
Comments
2
Participants
3
Timeline
3
Reactions
0
Timeline (top)
commented ×2labeled ×1

Code Example

cat > /tmp/qwen3_async_tp_dump.py <<'PY'
from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig, CompilationMode, CUDAGraphMode, PassConfig

def main():
    compilation_config = CompilationConfig(
        cudagraph_mode=CUDAGraphMode.NONE,
        mode=CompilationMode.VLLM_COMPILE,
        use_inductor_graph_partition=False,
        inductor_compile_config={"force_disable_caches": True},
        custom_ops=["+quant_fp8", "+rms_norm"],
        pass_config=PassConfig(
            fuse_norm_quant=True,
            fuse_act_quant=True,
            fuse_attn_quant=True,
            enable_qk_norm_rope_fusion=True,
            enable_sp=True,
            fuse_gemm_comms=True,
            fuse_allreduce_rms=False,
            sp_min_token_num=512,
        ),
    )

    llm = LLM(
        model="Qwen/Qwen3-30B-A3B-FP8",
        tensor_parallel_size=2,
        kv_cache_dtype="fp8",
        attention_config={"backend": "FLASHINFER"},
        compilation_config=compilation_config,
        kernel_config={"enable_flashinfer_autotune": False},
        disable_custom_all_reduce=True,
        load_format="dummy",
        hf_overrides={"num_hidden_layers": 4},
        max_model_len=1024,
    )

    outputs = llm.generate(
        [
            "Hello, my name is",
            "The capital of France is",
            "The future of AI is",
            "One short test prompt",
        ],
        SamplingParams(temperature=0),
    )

    for out in outputs:
        print(out.outputs[0].text)

if __name__ == "__main__":
    main()
PY

---

rm -rf /tmp/vllm_dump_async_tp_qwen3
rm -rf /tmp/torchinductor_"$USER"

CUDA_VISIBLE_DEVICES=4,5 \
VLLM_DEBUG_DUMP_PATH=/tmp/vllm_dump_async_tp_qwen3 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_PATTERN_MATCH_DEBUG=1 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python /tmp/qwen3_async_tp_dump.py 2>&1 | tee /tmp/qwen3_async_tp.log
RAW_BUFFERClick to expand / collapse

Your current environment

module file: /root/venv/lib/python3.12/site-packages/deep_gemm/init.py version: 2.3.0 has fp8_gemm_nt: True has transform_sf_into_required_layout: True has get_mk_alignment_for_contiguous_layout: True main branch

🐛 Describe the bug

cat > /tmp/qwen3_async_tp_dump.py <<'PY'
from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig, CompilationMode, CUDAGraphMode, PassConfig

def main():
    compilation_config = CompilationConfig(
        cudagraph_mode=CUDAGraphMode.NONE,
        mode=CompilationMode.VLLM_COMPILE,
        use_inductor_graph_partition=False,
        inductor_compile_config={"force_disable_caches": True},
        custom_ops=["+quant_fp8", "+rms_norm"],
        pass_config=PassConfig(
            fuse_norm_quant=True,
            fuse_act_quant=True,
            fuse_attn_quant=True,
            enable_qk_norm_rope_fusion=True,
            enable_sp=True,
            fuse_gemm_comms=True,
            fuse_allreduce_rms=False,
            sp_min_token_num=512,
        ),
    )

    llm = LLM(
        model="Qwen/Qwen3-30B-A3B-FP8",
        tensor_parallel_size=2,
        kv_cache_dtype="fp8",
        attention_config={"backend": "FLASHINFER"},
        compilation_config=compilation_config,
        kernel_config={"enable_flashinfer_autotune": False},
        disable_custom_all_reduce=True,
        load_format="dummy",
        hf_overrides={"num_hidden_layers": 4},
        max_model_len=1024,
    )

    outputs = llm.generate(
        [
            "Hello, my name is",
            "The capital of France is",
            "The future of AI is",
            "One short test prompt",
        ],
        SamplingParams(temperature=0),
    )

    for out in outputs:
        print(out.outputs[0].text)

if __name__ == "__main__":
    main()
PY
rm -rf /tmp/vllm_dump_async_tp_qwen3
rm -rf /tmp/torchinductor_"$USER"

CUDA_VISIBLE_DEVICES=4,5 \
VLLM_DEBUG_DUMP_PATH=/tmp/vllm_dump_async_tp_qwen3 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_PATTERN_MATCH_DEBUG=1 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python /tmp/qwen3_async_tp_dump.py 2>&1 | tee /tmp/qwen3_async_tp.log

res: https://paste.ubuntu.com/p/4w4FzFnjxv/

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the compilation_config to disable certain optimizations that are causing the issue.

  • Update the compilation_config to set fuse_gemm_comms to False:

compilation_config = CompilationConfig( # ... other configurations ... pass_config=PassConfig( # ... other pass configurations ... fuse_gemm_comms=False, # Disable fuse_gemm_comms ), )

*   Additionally, try setting `enable_sp` to `False` to disable sequence parallelism:
    ```python
compilation_config = CompilationConfig(
    # ... other configurations ...
    pass_config=PassConfig(
        # ... other pass configurations ...
        enable_sp=False,  # Disable sequence parallelism
    ),
)

Verification

To verify that the fix worked, run the modified script and check the output for any errors or issues. You can also check the logs at /tmp/qwen3_async_tp.log for any error messages.

Extra Tips

  • Make sure to clean up any temporary files and directories created during the debugging process.
  • If the issue persists, try resetting the VLLM_DISABLE_COMPILE_CACHE environment variable to 0 to enable compile caching.
  • Refer to the VLLM documentation for more information on configuration options and troubleshooting guides.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING