vllm - ✅(Solved) Fix [Bug] V1 engine hangs on encoder cache profiling on AMD gfx1151 (MIOpen missing solver DB) [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37472Fetched 2026-04-08 00:58:29
View on GitHub
Comments
2
Participants
3
Timeline
18
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4commented ×2cross-referenced ×2

vLLM V1 engine hangs indefinitely during initialization when serving any model with a vision encoder on AMD gfx1151 (Strix Halo / Radeon 8060S). The hang occurs at encoder cache profiling where embed_multimodal() triggers MIOpen convolution operations that never complete.

Error Message

MIOpen(HIP): Error [...] Could not open metadata file: .../gfx1151_ConvHipImplicitGemm3DGroupFwdXdlops_metadata.tn.model

Root Cause

_maybe_initialize_encoder_cache() in gpu_model_runner.py calls self.model.embed_multimodal() with dummy inputs, triggering MIOpen convolution operations. MIOpen has no pre-compiled solver database for gfx1151, causing exhaustive kernel search that either hangs or takes hours.

Env vars MIOPEN_DEBUG_DISABLE_FIND_DB=1, MIOPEN_FIND_ENFORCE=NONE, MIOPEN_DISABLE_CACHE=1 do NOT prevent the hang — the convolution kernel itself blocks.

Fix Action

Workaround

Comment out lines 5509-5525 in gpu_model_runner.py:

sed -i '5509,5525s/^/#/' $(find . -name gpu_model_runner.py -path "*/vllm/v1/worker/*")

This disables vision encoder profiling. Text-only inference works normally afterward.

PR fix notes

PR #38455: [ROCm] Add RDNA 3.5/4 device IDs (gfx1150, gfx1151, gfx1201)

Description (problem / solution / changelog)

Summary

Adds 3 missing entries to _ROCM_DEVICE_ID_NAME_MAP in vllm/platforms/rocm.py:

Device IDNameArchitectureHardware
0x150eAMD_Radeon_890Mgfx1150Strix Point APU
0x1586AMD_Radeon_8060Sgfx1151Strix Halo APU
0x7550AMD_Radeon_RX9070XTgfx1201Navi 48 discrete

Without these entries, get_device_name() falls back to amdsmi["market_name"] which returns the generic string "AMD Radeon Graphics" for APU devices, causing downstream name-based logic to misbehave.

Related issues: #36615, #37151, #37472, #32180

Device ID sources (high confidence)

  • 0x150e — ROCm issue #5433, lspci reports from Strix Point users
  • 0x1586 — PyTorch forums (Chip ID 5510/0x1586), GPU spec databases, multiple ROCm issue reports
  • 0x7550 — PyTorch forums (Chip ID 0x7550), ROCm TheRock issue #745

Note: The base RX 9070 (non-XT) may share device ID 0x7550 with the XT variant — this is called out as the naming is best-effort until hardware-verified.

Tests added

New file tests/rocm/test_device_id_map.py with 3 offline unit tests:

  • test_rocm_device_id_map_format — validates all keys are lowercase hex, values non-empty/no-spaces
  • test_rocm_device_id_map_known_entries — spot-checks new and existing entries
  • test_rocm_device_id_map_no_duplicate_keys — ensures no duplicate device IDs

Duplicate-work check

gh pr list --repo vllm-project/vllm --state open --search "gfx1151 device id"

No overlapping PRs. PR #37189 adds amdsmi WSL2 fallback (different scope).

Test commands and results

pre-commit run --all-files  # All hooks passed (ruff, mypy, typos, SPDX)

Tests are offline (no GPU required).

AI assistance disclosure

This PR was developed with AI assistance (Claude). All changes were reviewed by a human and independently verified by a separate reviewer with fresh context.

Co-authored-by: Claude

Changed files

  • vllm/platforms/rocm.py (modified, +6/-1)

PR #38555: [ROCm] Skip encoder cache profiling on consumer RDNA GPUs

Description (problem / solution / changelog)

Summary

Skip the encoder cache profiling pass during profile_run() on consumer RDNA 3/3.5 GPUs to prevent an indefinite hang caused by missing MIOpen solver databases.

Problem

When serving multimodal/vision models (e.g., Qwen3-VL) on consumer RDNA GPUs, the V1 engine hangs forever on startup. During profile_run(), _maybe_initialize_encoder_cache() calls embed_multimodal() with dummy inputs, which triggers MIOpen convolution. Datacenter GPUs (MI300/MI350) have pre-compiled solver databases so this is instant. Consumer RDNA GPUs (gfx1100-gfx1103, gfx1150-gfx1151) lack these databases, causing MIOpen to start an exhaustive autotuning search that never completes.

Affects: RX 7900 XTX/XT/GRE (gfx1100), RX 7800 XT (gfx1101), RX 7600 XT/7600 (gfx1102), Radeon 780M iGPU (gfx1103), Radeon 890M (gfx1150), Radeon 8060S/Strix Halo (gfx1151).

Fix

  1. Add on_consumer_rdna() detection helper to vllm/platforms/rocm.py that identifies consumer RDNA arches
  2. In gpu_model_runner.py, guard the encoder profiling block - skip only the encoder warm-up pass on consumer RDNA GPUs
  3. Log a warning explaining the limitation and suggesting alternatives

The rest of profile_run() (text decoder dummy run, sampler/pooler profiling, encoder_cache.clear(), gc.collect()) still executes so memory profiling is not disrupted.

Why this is not duplicating an existing PR

  • PR #37370 (Add Encoder Dummy Run) is a DRAFT with an empty description, tagged needs-rebase, and addresses a different scope (Model Runner V2 architecture). No functional overlap.
  • Searched for open PRs with "encoder cache hang", "MIOpen RDNA", "encoder profiling rocm" - none found.

Test commands run

pre-commit run --files vllm/platforms/rocm.py vllm/v1/worker/gpu_model_runner.py
# Result: All 14 hooks passed (ruff, mypy, typos, SPDX, forbidden imports, etc.)

Hardware verification requires a consumer RDNA GPU (gfx1103 available for testing).

AI assistance disclosure

AI assistance was used (Claude) for implementation. All changed lines reviewed by human submitter.

Closes #37472

Changed files

  • vllm/platforms/rocm.py (modified, +32/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +67/-32)

Code Example

vllm serve Qwen/Qwen3.5-35B-A3B --enforce-eager --dtype float16 --trust-remote-code

---

INFO: Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
MIOpen(HIP): Error [...] Could not open metadata file: .../gfx1151_ConvHipImplicitGemm3DGroupFwdXdlops_metadata.tn.model

---

sed -i '5509,5525s/^/#/' $(find . -name gpu_model_runner.py -path "*/vllm/v1/worker/*")
RAW_BUFFERClick to expand / collapse

Description

vLLM V1 engine hangs indefinitely during initialization when serving any model with a vision encoder on AMD gfx1151 (Strix Halo / Radeon 8060S). The hang occurs at encoder cache profiling where embed_multimodal() triggers MIOpen convolution operations that never complete.

Environment

  • GPU: AMD Radeon 8060S (gfx1151, RDNA 3.5 iGPU, 128GB unified LPDDR5X)
  • vLLM: 0.17.1rc1.dev169 and 0.17.2rc1.dev71
  • PyTorch: 2.10-2.12 (TheRock nightlies for gfx1151)
  • ROCm: TheRock 7.11-7.13 nightlies
  • OS: Fedora 43

Reproduction

vllm serve Qwen/Qwen3.5-35B-A3B --enforce-eager --dtype float16 --trust-remote-code

Server logs show:

INFO: Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
MIOpen(HIP): Error [...] Could not open metadata file: .../gfx1151_ConvHipImplicitGemm3DGroupFwdXdlops_metadata.tn.model

Then hangs forever. Health endpoint never returns 200.

Root Cause

_maybe_initialize_encoder_cache() in gpu_model_runner.py calls self.model.embed_multimodal() with dummy inputs, triggering MIOpen convolution operations. MIOpen has no pre-compiled solver database for gfx1151, causing exhaustive kernel search that either hangs or takes hours.

Env vars MIOPEN_DEBUG_DISABLE_FIND_DB=1, MIOPEN_FIND_ENFORCE=NONE, MIOPEN_DISABLE_CACHE=1 do NOT prevent the hang — the convolution kernel itself blocks.

Workaround

Comment out lines 5509-5525 in gpu_model_runner.py:

sed -i '5509,5525s/^/#/' $(find . -name gpu_model_runner.py -path "*/vllm/v1/worker/*")

This disables vision encoder profiling. Text-only inference works normally afterward.

Suggested Fix

Add a check in _maybe_initialize_encoder_cache() to skip profiling when:

  1. No multimodal inputs are expected (--limit-mm-per-prompt or text-only serving)
  2. MIOpen solver DB is missing for the current GPU architecture
  3. A new flag like --skip-encoder-profiling is set

Related Issues

  • #32180 (V1 engine crash on gfx1151)
  • #37151 (HSA segfault on gfx1151)

Additional Context

This affects ALL Qwen3.5 MoE models (including -Base variants) because they all include a vision encoder. The gfx1151 (Strix Halo) is increasingly popular for local LLM hosting due to its 128GB unified memory.

extent analysis

Fix Plan

To resolve the issue, we need to modify the _maybe_initialize_encoder_cache() function in gpu_model_runner.py to skip profiling under certain conditions. Here are the steps:

  • Add a new flag --skip-encoder-profiling to the vllm serve command.
  • Modify the _maybe_initialize_encoder_cache() function to check for the following conditions:
    • No multimodal inputs are expected (--limit-mm-per-prompt or text-only serving)
    • MIOpen solver DB is missing for the current GPU architecture
    • The --skip-encoder-profiling flag is set
  • If any of these conditions are met, skip the profiling step.

Example code:

def _maybe_initialize_encoder_cache(self):
    # ... existing code ...

    # Check if profiling should be skipped
    if (self.args.limit_mm_per_prompt or self.args.text_only) or \
       not self._has_miopen_solver_db() or \
       self.args.skip_encoder_profiling:
        # Skip profiling
        return

    # ... existing code ...

You can add the --skip-encoder-profiling flag to the vllm serve command like this:

vllm serve Qwen/Qwen3.5-35B-A3B --enforce-eager --dtype float16 --trust-remote-code --skip-encoder-profiling

Verification

To verify that the fix worked, run the vllm serve command with the --skip-encoder-profiling flag and check that the server starts successfully and responds to requests.

Extra Tips

  • Make sure to update the vllm version to the latest one that includes the fix.
  • If you are using a custom gpu_model_runner.py file, make sure to apply the changes to that file as well.
  • You can also consider setting the MIOPEN_DEBUG_DISABLE_FIND_DB and MIOPEN_FIND_ENFORCE environment variables to 1 and NONE respectively to disable the MIOpen solver database search. However, this may not prevent the hang in all cases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING