vllm - 💡(How to fix) Fix [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn't have storage [2 comments, 3 participants]

vllm2026-03-18 22:10:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37486•Fetched 2026-04-08 00:58:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2labeled ×1

Code Example

cat > /tmp/qwen3_async_tp_dump.py <<'PY'
from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig, CompilationMode, CUDAGraphMode, PassConfig

def main():
    compilation_config = CompilationConfig(
        cudagraph_mode=CUDAGraphMode.NONE,
        mode=CompilationMode.VLLM_COMPILE,
        use_inductor_graph_partition=False,
        inductor_compile_config={"force_disable_caches": True},
        custom_ops=["+quant_fp8", "+rms_norm"],
        pass_config=PassConfig(
            fuse_norm_quant=True,
            fuse_act_quant=True,
            fuse_attn_quant=True,
            enable_qk_norm_rope_fusion=True,
            enable_sp=True,
            fuse_gemm_comms=True,
            fuse_allreduce_rms=False,
            sp_min_token_num=512,
        ),
    )

    llm = LLM(
        model="Qwen/Qwen3-30B-A3B-FP8",
        tensor_parallel_size=2,
        kv_cache_dtype="fp8",
        attention_config={"backend": "FLASHINFER"},
        compilation_config=compilation_config,
        kernel_config={"enable_flashinfer_autotune": False},
        disable_custom_all_reduce=True,
        load_format="dummy",
        hf_overrides={"num_hidden_layers": 4},
        max_model_len=1024,
    )

    outputs = llm.generate(
        [
            "Hello, my name is",
            "The capital of France is",
            "The future of AI is",
            "One short test prompt",
        ],
        SamplingParams(temperature=0),
    )

    for out in outputs:
        print(out.outputs[0].text)

if __name__ == "__main__":
    main()
PY

---

rm -rf /tmp/vllm_dump_async_tp_qwen3
rm -rf /tmp/torchinductor_"$USER"

CUDA_VISIBLE_DEVICES=4,5 \
VLLM_DEBUG_DUMP_PATH=/tmp/vllm_dump_async_tp_qwen3 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_PATTERN_MATCH_DEBUG=1 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python /tmp/qwen3_async_tp_dump.py 2>&1 | tee /tmp/qwen3_async_tp.log

RAW_BUFFERClick to expand / collapse

Your current environment

module file: /root/venv/lib/python3.12/site-packages/deep_gemm/init.py version: 2.3.0 has fp8_gemm_nt: True has transform_sf_into_required_layout: True has get_mk_alignment_for_contiguous_layout: True main branch

🐛 Describe the bug

cat > /tmp/qwen3_async_tp_dump.py <<'PY'
from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig, CompilationMode, CUDAGraphMode, PassConfig

def main():
    compilation_config = CompilationConfig(
        cudagraph_mode=CUDAGraphMode.NONE,
        mode=CompilationMode.VLLM_COMPILE,
        use_inductor_graph_partition=False,
        inductor_compile_config={"force_disable_caches": True},
        custom_ops=["+quant_fp8", "+rms_norm"],
        pass_config=PassConfig(
            fuse_norm_quant=True,
            fuse_act_quant=True,
            fuse_attn_quant=True,
            enable_qk_norm_rope_fusion=True,
            enable_sp=True,
            fuse_gemm_comms=True,
            fuse_allreduce_rms=False,
            sp_min_token_num=512,
        ),
    )

    llm = LLM(
        model="Qwen/Qwen3-30B-A3B-FP8",
        tensor_parallel_size=2,
        kv_cache_dtype="fp8",
        attention_config={"backend": "FLASHINFER"},
        compilation_config=compilation_config,
        kernel_config={"enable_flashinfer_autotune": False},
        disable_custom_all_reduce=True,
        load_format="dummy",
        hf_overrides={"num_hidden_layers": 4},
        max_model_len=1024,
    )

    outputs = llm.generate(
        [
            "Hello, my name is",
            "The capital of France is",
            "The future of AI is",
            "One short test prompt",
        ],
        SamplingParams(temperature=0),
    )

    for out in outputs:
        print(out.outputs[0].text)

if __name__ == "__main__":
    main()
PY

rm -rf /tmp/vllm_dump_async_tp_qwen3
rm -rf /tmp/torchinductor_"$USER"

CUDA_VISIBLE_DEVICES=4,5 \
VLLM_DEBUG_DUMP_PATH=/tmp/vllm_dump_async_tp_qwen3 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_PATTERN_MATCH_DEBUG=1 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python /tmp/qwen3_async_tp_dump.py 2>&1 | tee /tmp/qwen3_async_tp.log

res： https://paste.ubuntu.com/p/4w4FzFnjxv/

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the compilation_config to disable certain optimizations that are causing the issue.

Update the compilation_config to set fuse_gemm_comms to False:

compilation_config = CompilationConfig( # ... other configurations ... pass_config=PassConfig( # ... other pass configurations ... fuse_gemm_comms=False, # Disable fuse_gemm_comms ), )

*   Additionally, try setting `enable_sp` to `False` to disable sequence parallelism:
    ```python
compilation_config = CompilationConfig(
    # ... other configurations ...
    pass_config=PassConfig(
        # ... other pass configurations ...
        enable_sp=False,  # Disable sequence parallelism
    ),
)

Verification

To verify that the fix worked, run the modified script and check the output for any errors or issues. You can also check the logs at /tmp/qwen3_async_tp.log for any error messages.

Extra Tips

Make sure to clean up any temporary files and directories created during the debugging process.
If the issue persists, try resetting the VLLM_DISABLE_COMPILE_CACHE environment variable to 0 to enable compile caching.
Refer to the VLLM documentation for more information on configuration options and troubleshooting guides.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn't have storage [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn't have storage [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING