vllm - 💡(How to fix) Fix [Bug]: agrs Workspace Buffer Sizing Overflow at Large EP [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41858Fetched 2026-05-07 03:32:22
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×1mentioned ×1subscribed ×1

The allgather_reducescatter MoE workspace buffer is sized during the warmup/profile run (which uses a small batch), then locked. During CUDA graph capture (which uses max batch size), the workspace is too small, causing a runtime error.

Error Message

RuntimeError: Workspace requires 3598.00 MB, current size is 1806.00 MB

Root Cause

The workspace allocation logic appears to be:

  1. During warmup/profile: allocate workspace for warmup_batch_size tokens
  2. Lock workspace size
  3. During CUDA graph capture: attempt to use workspace for max_cudagraph_capture_size tokens
  4. Workspace too small → crash

The ratio is consistently ~2×:

  • 3598 / 1806 ≈ 2.0
  • 224 / 112 = 2.0

This suggests the workspace is sized for the profile batch but CUDA graph capture needs 2× larger batches.

Fix Action

Workaround

  • Use --enforce-eager (disables CUDA graphs, avoids workspace sizing) — but then hits Gloo timeout at EP=64
  • Reduce --max-cudagraph-capture-size (may work but reduces graph efficiency)
  • Use smaller EP groups (EP=32 works fine)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

---

VLLM_USE_DEEP_GEMM=0 vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 4 \
  --data-parallel-size 16 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 128 \
  --trust-remote-code

---

RuntimeError: Workspace requires 3598.00 MB, current size is 1806.00 MB

---

RuntimeError: Workspace requires 224.00 MB, current size is 112.00 MB
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Summary

The allgather_reducescatter MoE workspace buffer is sized during the warmup/profile run (which uses a small batch), then locked. During CUDA graph capture (which uses max batch size), the workspace is too small, causing a runtime error.

Severity

High — blocks agrs at EP≥32 with CUDA graphs.

Environment

ComponentVersion
vLLM0.20.0
PyTorch2.11.0+cu130
CUDA13.0.3
NVIDIA Driver580.105.08
NCCL2.30.4-1

Hardware

  • Instance: 8× AWS p5en.48xlarge
  • GPU: 64× NVIDIA H200

collect_env.py Output

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

Steps to Reproduce

  1. Deploy DeepSeek-V3-0324 across 8 nodes with agrs + CUDA graphs:
VLLM_USE_DEEP_GEMM=0 vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 4 \
  --data-parallel-size 16 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 128 \
  --trust-remote-code
  1. Model loads, warmup completes (small batch → small workspace allocated).
  2. CUDA graph capture begins (max batch → needs larger workspace).
  3. Crash: workspace too small.

Error Messages

With default KV blocks (~50K):

RuntimeError: Workspace requires 3598.00 MB, current size is 1806.00 MB

With reduced KV blocks (30K):

RuntimeError: Workspace requires 224.00 MB, current size is 112.00 MB

Root Cause

The workspace allocation logic appears to be:

  1. During warmup/profile: allocate workspace for warmup_batch_size tokens
  2. Lock workspace size
  3. During CUDA graph capture: attempt to use workspace for max_cudagraph_capture_size tokens
  4. Workspace too small → crash

The ratio is consistently ~2×:

  • 3598 / 1806 ≈ 2.0
  • 224 / 112 = 2.0

This suggests the workspace is sized for the profile batch but CUDA graph capture needs 2× larger batches.

Expected Behavior

Workspace should be pre-allocated for the maximum possible batch size (CUDA graph capture size), not the profile batch size. Or it should be dynamically resizable.

Workaround

  • Use --enforce-eager (disables CUDA graphs, avoids workspace sizing) — but then hits Gloo timeout at EP=64
  • Reduce --max-cudagraph-capture-size (may work but reduces graph efficiency)
  • Use smaller EP groups (EP=32 works fine)

Suggested Fix

In the workspace allocation code, use max(profile_batch_size, max_cudagraph_capture_size) when determining workspace size, or make workspace resizable.

cc: @nkumaraws

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: agrs Workspace Buffer Sizing Overflow at Large EP [1 participants]