vllm - 💡(How to fix) Fix [Bug]: runai_streamer + MTP drafter fails to load weights from model_streamer local cache

StepCodex · 2026-05-08T11:17:05Z

[vllm] Your current environment docker image vllm-openai:v0.20.1 🐛 Describe the bug - Model: s3://models/Qwen Qwen3.6-35B-A3B-FP8/ - Loader: runai streamer -… ## Fix / Workaround Would appreciate insight if this is a known bug or wrong configuration, or if a workaround exists. ### Your current environment docker image vllm-openai:v0.20.1 ### 🐛 Describe the bug - Model: s3://models/Qwen_Qwen3.6-35B-A3B-FP8/ - Loader: runai_streamer - All model files present in S3 (see listing below). **Behavior:** - If I start vllm (vllm-openai:v0.20.1) without MTP (speculative decoding) enabled, the model loads and serves correctly from S3 via runai_streamer. - If I add `speculative_config={"method": "mtp", ...}` (e.g., for Qwen-3 MTP), the engine fails with: ``` RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/ ` ``` (See full log excerpt below.) - Review of the S3 path shows all expected safetensors (layers-*.safetensors, mtp.safetensors, outside.safetensors, model.safetensors.index.json, etc.) are present. - The error only happens when MTP is enabled, and only during the drafter model load phase. **Theory / Tracing:** - vLLM loads the main model weights from S3 directly via runai_streamer. - For speculative MTP/drafter, vLLM attempts to reload (from a local cache path like `/root/.cache/vllm/assets/model_streamer/ `) and uses the runai_streamer loader's list_safetensors() function. - This function (by design) is non-recursive and only searches for top-level `*.safetensors` files. - The model_streamer cache structure does NOT expose the weights at the root, and so loader returns none found, even though the S3 source is correct. **Log excerpt (abridged):** ``` (Worker_TP1 pid=170) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16` (Worker_TP0 pid=165) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16` (EngineCore pid=150) ERROR ... Exception: WorkerProc initialization failed due to an exception in a background process ... (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. ``` **Sample S3 path listing (abridged):** ```text layers-0.safetensors layers-1.safetensors ... layers-39.safetensors mtp.safetensors outside.safetensors model.safetensors.index.json config.json README.md ... ``` **Summary:** - Model works with runai_streamer loader unless MTP is enabled. - With MTP, drafter load fails due to not finding weights in the local streamer cache (used by vLLM for the draft model). - The root cause seems to be vLLM pointing the loader at a local cache dir whose layout doesn't expose safetensors at the root, which list_safetensors() does not recurse into. - Possibly a vLLM cache layout or loader usage bug, or a need for recursive search in the loader for this scenario. Would appreciate insight if this is a known bug or wrong configuration, or if a workaround exists. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-08 11:17:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

RuntimeError: Cannot find any safetensors model weights with /root/.cache/vllm/assets/model_streamer/<hash>

Root Cause

Log excerpt (abridged):

(Worker_TP1 pid=170) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(Worker_TP0 pid=165) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(EngineCore pid=150) ERROR ... Exception: WorkerProc initialization failed due to an exception in a background process ...
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above.

Fix Action

Fix / Workaround

Would appreciate insight if this is a known bug or wrong configuration, or if a workaround exists.

Code Example

RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/<hash>`

---

(Worker_TP1 pid=170) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(Worker_TP0 pid=165) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(EngineCore pid=150) ERROR ... Exception: WorkerProc initialization failed due to an exception in a background process ...
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above.

---

layers-0.safetensors
layers-1.safetensors
...
layers-39.safetensors
mtp.safetensors
outside.safetensors
model.safetensors.index.json
config.json
README.md
...

RAW_BUFFERClick to expand / collapse

Your current environment

docker image vllm-openai:v0.20.1

🐛 Describe the bug

Model: s3://models/Qwen_Qwen3.6-35B-A3B-FP8/
Loader: runai_streamer
All model files present in S3 (see listing below).

Behavior:

If I start vllm (vllm-openai:v0.20.1) without MTP (speculative decoding) enabled, the model loads and serves correctly from S3 via runai_streamer.
If I add speculative_config={"method": "mtp", ...} (e.g., for Qwen-3 MTP), the engine fails with:

RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/<hash>`

(See full log excerpt below.)

Review of the S3 path shows all expected safetensors (layers-*.safetensors, mtp.safetensors, outside.safetensors, model.safetensors.index.json, etc.) are present.
The error only happens when MTP is enabled, and only during the drafter model load phase.

Theory / Tracing:

vLLM loads the main model weights from S3 directly via runai_streamer.
For speculative MTP/drafter, vLLM attempts to reload (from a local cache path like /root/.cache/vllm/assets/model_streamer/<hash>) and uses the runai_streamer loader's list_safetensors() function.
This function (by design) is non-recursive and only searches for top-level *.safetensors files.
The model_streamer cache structure does NOT expose the weights at the root, and so loader returns none found, even though the S3 source is correct.

Log excerpt (abridged):

(Worker_TP1 pid=170) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(Worker_TP0 pid=165) ERROR ... RuntimeError: Cannot find any safetensors model weights with `/root/.cache/vllm/assets/model_streamer/d7905c16`
(EngineCore pid=150) ERROR ... Exception: WorkerProc initialization failed due to an exception in a background process ...
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above.

Sample S3 path listing (abridged):

layers-0.safetensors
layers-1.safetensors
...
layers-39.safetensors
mtp.safetensors
outside.safetensors
model.safetensors.index.json
config.json
README.md
...

Summary:

Model works with runai_streamer loader unless MTP is enabled.
With MTP, drafter load fails due to not finding weights in the local streamer cache (used by vLLM for the draft model).
The root cause seems to be vLLM pointing the loader at a local cache dir whose layout doesn't expose safetensors at the root, which list_safetensors() does not recurse into.
Possibly a vLLM cache layout or loader usage bug, or a need for recursive search in the loader for this scenario.

Would appreciate insight if this is a known bug or wrong configuration, or if a workaround exists.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: runai_streamer + MTP drafter fails to load weights from model_streamer local cache

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: runai_streamer + MTP drafter fails to load weights from model_streamer local cache

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING