vllm - ✅(Solved) Fix [Bug]: ROCm: tries to allocate 192GB VRAM for Qwen3.5 0.8B [1 pull requests, 7 comments, 7 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36890Fetched 2026-04-08 00:43:46
View on GitHub
Comments
7
Participants
7
Timeline
27
Reactions
0
Timeline (top)
commented ×7subscribed ×7mentioned ×5cross-referenced ×2

Error Message

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 192.00 GiB. GPU 0 has a total capacity of 31.98 GiB of which 27.67 GiB is free. Of the allocated memory 3.92 GiB is allocated by PyTorch, and 128.17 MiB is reserved by PyTorch but unallocated.

Fix Action

Fix / Workaround

============================== CPU Info

架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual 字节序: Little Endian CPU: 36 在线 CPU 列表: 0-35 厂商 ID: GenuineIntel 型号名称: Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz CPU 系列: 6 型号: 63 每个核的线程数: 2 每个座的核数: 18 座: 1 步进: 2 CPU(s) scaling MHz: 47% CPU 最大 MHz: 3800.0000 CPU 最小 MHz: 1200.0000 BogoMIPS: 4591.62 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d 虚拟化: VT-x L1d 缓存: 576 KiB (18 instances) L1i 缓存: 576 KiB (18 instances) L2 缓存: 4.5 MiB (18 instances) L3 缓存: 45 MiB (1 instance) NUMA 节点: 1 NUMA 节点0 CPU: 0-35 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

PR fix notes

PR #38334: [ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM

Description (problem / solution / changelog)

Summary

  • On ROCm devices without efficient SDPA backends (e.g. gfx906), running multimodal models (Qwen3.5, etc.) triggers PyTorch's SDPA "math" backend, which allocates the full N×N attention matrix for the vision encoder. For typical vision frames this requires ~128-192GB, causing OOM even on datacenter GPUs.
  • This adds Triton prefill attention as a fallback before TORCH_SDPA in get_vit_attn_backend(). The Triton kernel uses tiled/blocked computation that avoids the N² memory allocation.

Fallback chain (updated)

  1. AITER Flash Attention (gfx9 with AITER)
  2. Flash Attention (gfx9 with flash_attn, fp16/bf16)
  3. Flash Attention Triton (gfx1x RDNA3/4, fp16/bf16)
  4. Triton Attention (fp16/bf16) ← NEW
  5. Torch SDPA (final fallback, non-fp16/bf16 only)

Fixes #36890 Related: #27706

Test plan

  • Verify Qwen3.5-0.8B starts without OOM on gfx906
  • Verify existing ViT attention tests still pass on MI300X/MI325X
  • Verify RDNA3/4 devices still use Flash Attention Triton (not the new fallback)

🤖 Generated with Claude Code

Changed files

  • vllm/platforms/rocm.py (modified, +10/-0)

Code Example

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Manjaro Linux (x86_64)
GCC version                  : (GCC) 15.2.1 20260209
Clang version                : 21.1.8
CMake version                : version 4.2.3
Libc version                 : glibc-2.43

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.2.26043

==============================
      Python Environment
==============================
Python version               : 3.14.3 (main, Feb 13 2026, 15:31:44) [GCC 15.2.1 20260209] (64-bit runtime)
Python platform              : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : AMD Radeon Graphics (gfx906:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.2.26043
MIOpen runtime version       : 3.5.1
Is XNNPACK available         : True

==============================
          CPU Info
==============================
架构:                                   x86_64
CPU 运行模式:                           32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
字节序:                                 Little Endian
CPU:                                     36
在线 CPU 列表:                          0-35
厂商 IDGenuineIntel
型号名称:                               Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz
CPU 系列:                               6
型号:                                   63
每个核的线程数:                         2
每个座的核数:                           18
座:                                     1
步进:                                   2
CPU(s) scaling MHz:                      47%
CPU 最大 MHz:                           3800.0000
CPU 最小 MHz:                           1200.0000
BogoMIPS:                               4591.62
标记:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
虚拟化:                                 VT-x
L1d 缓存:                               576 KiB (18 instances)
L1i 缓存:                               576 KiB (18 instances)
L2 缓存:                                4.5 MiB (18 instances)
L3 缓存:                                45 MiB (1 instance)
NUMA 节点:                              1
NUMA 节点0 CPU0-35
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             KVM: Mitigation: Split huge pages
Vulnerability L1tf:                      Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                       Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:                  Mitigation; PTI
Vulnerability Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.4.3
[pip3] pyzmq==27.1.0
[pip3] transformers==4.57.6
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.2.26043-9999
vLLM Version                 : 0.17.1rc1.dev100+g9e19f8338.d20260312 (git sha: 9e19f8338, date: 20260312)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         
GPU0   0            

================================= Hops between two GPUs ==================================
       GPU0         
GPU0   0            

=============================== Link Type between two GPUs ===============================
       GPU0         
GPU0   0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
CUDA_PATH=/opt/cuda
PYTORCH_ROCM_ARCH=gfx906
VLLM_TARGET_DEVICE=rocm
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_lygod

---

vllm serve Qwen/Qwen3.5-0.8B --max-model-len 2621

---

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 192.00 GiB. GPU 0 has a total capacity of 31.98 GiB of which 27.67 GiB is free. Of the allocated memory 3.92 GiB is allocated by PyTorch, and 128.17 MiB is reserved by PyTorch but unallocated.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Manjaro Linux (x86_64)
GCC version                  : (GCC) 15.2.1 20260209
Clang version                : 21.1.8
CMake version                : version 4.2.3
Libc version                 : glibc-2.43

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.2.26043

==============================
      Python Environment
==============================
Python version               : 3.14.3 (main, Feb 13 2026, 15:31:44) [GCC 15.2.1 20260209] (64-bit runtime)
Python platform              : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : AMD Radeon Graphics (gfx906:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.2.26043
MIOpen runtime version       : 3.5.1
Is XNNPACK available         : True

==============================
          CPU Info
==============================
架构:                                   x86_64
CPU 运行模式:                           32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
字节序:                                 Little Endian
CPU:                                     36
在线 CPU 列表:                          0-35
厂商 ID:                                GenuineIntel
型号名称:                               Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz
CPU 系列:                               6
型号:                                   63
每个核的线程数:                         2
每个座的核数:                           18
座:                                     1
步进:                                   2
CPU(s) scaling MHz:                      47%
CPU 最大 MHz:                           3800.0000
CPU 最小 MHz:                           1200.0000
BogoMIPS:                               4591.62
标记:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
虚拟化:                                 VT-x
L1d 缓存:                               576 KiB (18 instances)
L1i 缓存:                               576 KiB (18 instances)
L2 缓存:                                4.5 MiB (18 instances)
L3 缓存:                                45 MiB (1 instance)
NUMA 节点:                              1
NUMA 节点0 CPU:                         0-35
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             KVM: Mitigation: Split huge pages
Vulnerability L1tf:                      Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                       Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:                  Mitigation; PTI
Vulnerability Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.4.3
[pip3] pyzmq==27.1.0
[pip3] transformers==4.57.6
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.2.26043-9999
vLLM Version                 : 0.17.1rc1.dev100+g9e19f8338.d20260312 (git sha: 9e19f8338, date: 20260312)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         
GPU0   0            

================================= Hops between two GPUs ==================================
       GPU0         
GPU0   0            

=============================== Link Type between two GPUs ===============================
       GPU0         
GPU0   0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
CUDA_PATH=/opt/cuda
PYTORCH_ROCM_ARCH=gfx906
VLLM_TARGET_DEVICE=rocm
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_lygod
</details>

🐛 Describe the bug

Running vLLM with Qwen/Qwen3.5-0.8B on ROCm causes the engine to fail during initialization.

vllm serve Qwen/Qwen3.5-0.8B --max-model-len 2621

vLLM attempts to allocate ~192GB of GPU memory, even though the GPU only has 32GB VRAM and the model itself is only 0.8B parameters.

This full logs in:

<details> <summary>Full debug log</summary>

vllm_serve.log

the key logs

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 192.00 GiB. GPU 0 has a total capacity of 31.98 GiB of which 27.67 GiB is free. Of the allocated memory 3.92 GiB is allocated by PyTorch, and 128.17 MiB is reserved by PyTorch but unallocated.
</details>

Extra: use --language-model-only not trigger the vram issue

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of vLLM attempting to allocate excessive GPU memory, we can try the following steps:

  • Reduce model size or complexity: If possible, try using a smaller model or reducing the model's complexity to decrease memory requirements.
  • Increase GPU memory: If feasible, consider upgrading to a GPU with more VRAM.
  • Optimize vLLM configuration: Try adjusting vLLM's configuration to reduce memory allocation. This can be done by:
    • Setting the --max-model-len parameter to a lower value.
    • Using the --language-model-only flag, as mentioned in the extra notes.
    • Experimenting with different values for TORCHINDUCTOR_COMPILE_THREADS and TORCHINDUCTOR_CACHE_DIR.
  • Implement memory-efficient training: Consider using techniques like gradient checkpointing, mixed precision training, or model pruning to reduce memory usage.

Example code to reduce model size:

import torch

# Load the model
model = torch.load('Qwen/Qwen3.5-0.8B.pth')

# Reduce model size by removing unnecessary layers or parameters
model = torch.nn.DataParallel(model, device_ids=[0])

# Save the reduced model
torch.save(model, 'reduced_Qwen/Qwen3.5-0.8B.pth')

Verification

To verify that the fix worked, run the vLLM serve command again with the adjusted configuration or reduced model size:

vllm serve reduced_Qwen/Qwen3.5-0.8B --max-model-len 2621

Monitor the GPU memory usage and verify that it no longer attempts to allocate excessive memory.

Extra Tips

  • Regularly clean up temporary files and cache directories to prevent memory leaks.
  • Consider using tools like nvidia-smi or rocm-smi to monitor GPU memory usage and identify potential issues.
  • If using a cloud-based GPU service, ensure that the instance type has sufficient VRAM to run the model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: ROCm: tries to allocate 192GB VRAM for Qwen3.5 0.8B [1 pull requests, 7 comments, 7 participants]