vllm - 💡(How to fix) Fix [Bug]: vLLm 0.17.1 (docker) crash with Qwen 3.5 27B-FP8 in BatchPrefillWithPagedKVCache [5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36828Fetched 2026-04-08 00:34:25
View on GitHub
Comments
5
Participants
4
Timeline
10
Reactions
1
Author
Timeline (top)
commented ×5cross-referenced ×2closed ×1labeled ×1

Error Message

vllm | (EngineCore_DP0 pid=142) File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 666, in paged_run vllm | (EngineCore_DP0 pid=142) paged_run_func( vllm | (EngineCore_DP0 pid=142) File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.call vllm | (EngineCore_DP0 pid=142) File "<unknown>", line 0, in __tvm_ffi_paged_run vllm | (EngineCore_DP0 pid=142) File "/workspace/build/aot/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu", line 330, in BatchPrefillWithPagedKVCacheRun(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Array<long int>, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optionaltvm::ffi::TensorView, int64_t, int64_t, int64_t, bool, tvm::ffi::Optionaltvm::ffi::Tensor, tvm::ffi::Optionaltvm::ffi::Tensor, tvm::ffi::Optionaltvm::ffi::Tensor, tvm::ffi::Optionaltvm::ffi::Tensor, tvm::ffi::Optionaltvm::ffi::Tensor, tvm::ffi::Optionaltvm::ffi::Tensor, double, double, double, double, int64_t)::<lambda()> vllm | (EngineCore_DP0 pid=142) RuntimeError: Check failed: (status == cudaSuccess) is false: BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered vllm | [rank0]:[W311 21:24:30.003274841 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Code Example

Was running in the official 0.17.1 docker, so most of this is not relevant and misleading.
$ nvidia-smi 
Wed Mar 11 18:15:11 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:0C:00.0  On |                  Off |
|  0%   31C    P8             34W /  450W |    1513MiB /  24564MiB |     26%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   31C    P0             41W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

docker-compose config contained:
    image: vllm/vllm-openai:latest
    environment:
      - CUDA_VISIBLE_DEVICES=1
      - TORCH_CUDA_ARCH_LIST=8.0
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - VLLM_SLEEP_WHEN_IDLE=1

---

command: /models/Qwen_Qwen3.5-27B-FP8 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' --served-model-name qwen-3.5-27b home gpt-3.5-turbo default --gpu-memory-utilization=0.97 --enable-prefix-caching --attention-backend FLASHINFER

---

vllm  | (EngineCore_DP0 pid=142)   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 666, in paged_run
vllm  | (EngineCore_DP0 pid=142)     paged_run_func(
vllm  | (EngineCore_DP0 pid=142)   File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.__call__
vllm  | (EngineCore_DP0 pid=142)   File "<unknown>", line 0, in __tvm_ffi_paged_run
vllm  | (EngineCore_DP0 pid=142)   File "/workspace/build/aot/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu", line 330, in BatchPrefillWithPagedKVCacheRun(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Array<long int>, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, bool, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, double, double, double, double, int64_t)::<lambda()>
vllm  | (EngineCore_DP0 pid=142) RuntimeError: Check failed: (status == cudaSuccess) is false: BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered
vllm  | [rank0]:[W311 21:24:30.003274841 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Was running in the official 0.17.1 docker, so most of this is not relevant and misleading.
$ nvidia-smi 
Wed Mar 11 18:15:11 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:0C:00.0  On |                  Off |
|  0%   31C    P8             34W /  450W |    1513MiB /  24564MiB |     26%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   31C    P0             41W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

docker-compose config contained:
    image: vllm/vllm-openai:latest
    environment:
      - CUDA_VISIBLE_DEVICES=1
      - TORCH_CUDA_ARCH_LIST=8.0
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - VLLM_SLEEP_WHEN_IDLE=1
</details>

🐛 Describe the bug

bug occurs after the first inference attempt. started with CUDA_VISIBLE_DEVICES=1 (A100 80GB)

command: /models/Qwen_Qwen3.5-27B-FP8 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' --served-model-name qwen-3.5-27b home gpt-3.5-turbo default --gpu-memory-utilization=0.97 --enable-prefix-caching --attention-backend FLASHINFER

container logs: https://gist.github.com/matatonic/f08eb26807b7ecb7a63ef7aaad7fd476

Ends with:

vllm  | (EngineCore_DP0 pid=142)   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 666, in paged_run
vllm  | (EngineCore_DP0 pid=142)     paged_run_func(
vllm  | (EngineCore_DP0 pid=142)   File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.__call__
vllm  | (EngineCore_DP0 pid=142)   File "<unknown>", line 0, in __tvm_ffi_paged_run
vllm  | (EngineCore_DP0 pid=142)   File "/workspace/build/aot/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_256_head_dim_vo_256_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu", line 330, in BatchPrefillWithPagedKVCacheRun(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Array<long int>, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, bool, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, tvm::ffi::Optional<tvm::ffi::Tensor>, double, double, double, double, int64_t)::<lambda()>
vllm  | (EngineCore_DP0 pid=142) RuntimeError: Check failed: (status == cudaSuccess) is false: BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered
vllm  | [rank0]:[W311 21:24:30.003274841 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves adjusting the CUDA configuration and memory allocation to prevent illegal memory access errors.

  • Step 1: Reduce GPU Memory Utilization
    • Decrease the --gpu-memory-utilization flag value to prevent overloading the GPU memory. For example, change --gpu-memory-utilization=0.97 to --gpu-memory-utilization=0.8.
  • Step 2: Adjust CUDA_VISIBLE_DEVICES
    • Ensure that the CUDA_VISIBLE_DEVICES environment variable is set correctly. In this case, it's set to 1, which corresponds to the A100 80GB GPU.
  • Step 3: Update PYTORCH_ALLOC_CONF
    • Modify the PYTORCH_ALLOC_CONF environment variable to optimize memory allocation. For example, set PYTORCH_ALLOC_CONF=expandable_segments:True,cache_size:1000000000.
  • Step 4: Implement Error Handling
    • Add try-except blocks in the code to catch and handle RuntimeError exceptions, providing more informative error messages and preventing the program from crashing unexpectedly.

Example code snippet for error handling:

try:
    # Code that may raise a RuntimeError
    paged_run_func()
except RuntimeError as e:
    print(f"RuntimeError caught: {e}")
    # Additional error handling or cleanup code

Verification

To verify that the fix worked:

  1. Run the command with the adjusted --gpu-memory-utilization flag value and updated PYTORCH_ALLOC_CONF environment variable.
  2. Monitor the GPU memory usage and program execution to ensure that the illegal memory access error is resolved.
  3. Test the program with different input scenarios to confirm that it runs smoothly without crashing.

Extra Tips

  • Regularly review and update the CUDA drivers and PyTorch versions to ensure compatibility and optimal performance.
  • Consider implementing more robust error handling mechanisms, such as retrying failed operations or providing fallback solutions, to improve the program's overall reliability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING