vllm - 💡(How to fix) Fix [Usage]: Run:ai S3 streamer crashing when loading model from an S3 compatible object storage [1 comments, 1 participants]

vllm2026-04-09 08:07:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39396•Fetched 2026-04-10 03:40:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

SherifWaly

Participants

SherifWaly

Timeline (top)

commented ×1labeled ×1mentioned ×1subscribed ×1

Error Message

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer: 6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s] (Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer: 7% Completed | 14247/208550 [00:15<03:20, 970.14it/s] (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last): (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) current_data_size, chunk_count_in_batch = self.prefill( (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) chunk_item = next(chunk_gen) (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) yield from ready_chunks_iterator (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) yield from self.request_ready_chunks() (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) file_relative_index, chunk_relative_index = runai_response(self.streamer) (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}") (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last): (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) current_data_size, chunk_count_in_batch = self.prefill( (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) chunk_item = next(chunk_gen) (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) yield from ready_chunks_iterator (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) yield from self.request_ready_chunks() (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) file_relative_index, chunk_relative_index = runai_response(self.streamer) (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}") (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer: 7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]

Code Example

I'm using k8s with the vllm/vllm-openai:v0.17.0 docker image

---

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --decode-context-parallel-size 8 \

---

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s]
(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:15<03:20, 970.14it/s]
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last):
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     chunk_item = next(chunk_gen)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from ready_chunks_iterator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from self.request_ready_chunks()
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last):
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     chunk_item = next(chunk_gen)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from ready_chunks_iterator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from self.request_ready_chunks()
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]

RAW_BUFFERClick to expand / collapse

Your current environment

I'm using k8s with the vllm/vllm-openai:v0.17.0 docker image

How would you like to use vllm

I'm serving kimi-k2.5 with vLLM using run:ai streamer after uploading to Coreweave S3 object compatible storage on H200/B200 nodes

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --decode-context-parallel-size 8 \

I'm also using AWS_EC2_METADATA_DISABLED: true

It usually works well getting 1GB/s per GPU so around 8GB/s total but issues is happening when we are launching like 10+ servers at the same as the S3 throughput naturally decreases per replica to around 400MB/s per GPU but some servers gets an error while streaming the model with the following error and they keep hanging forever without crashing

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s]
(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:15<03:20, 970.14it/s]
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last):
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     chunk_item = next(chunk_gen)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from ready_chunks_iterator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from self.request_ready_chunks()
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last):
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     chunk_item = next(chunk_gen)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from ready_chunks_iterator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from self.request_ready_chunks()
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]

We are planning on serving models more widely from S3 so need to ensure it works even if load times increase when having more servers loading at same time, any idea if there are any changes that could be applied?

Thanks

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Consider adjusting the --tensor-parallel-size and --decode-context-parallel-size parameters to reduce the load on S3 when launching multiple servers simultaneously.

Guidance

Review the S3 bucket configuration to ensure it can handle the increased load when multiple servers are launched at the same time.
Experiment with reducing the --tensor-parallel-size and --decode-context-parallel-size parameters to decrease the parallelism and alleviate the load on S3.
Monitor the S3 throughput and adjust the parameters accordingly to find a balance between performance and reliability.
Consider implementing a queueing system or a load balancer to manage the incoming requests and prevent overwhelming the S3 bucket.

Example

No specific code example is provided, but the command used to serve the model can be modified to adjust the parallelism parameters, for example:

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --decode-context-parallel-size 4

Notes

The issue seems to be related to the increased load on S3 when multiple servers are launched simultaneously, causing a decrease in throughput and resulting in errors. Adjusting the parallelism parameters and reviewing the S3 bucket configuration may help alleviate the issue.

Recommendation

Apply a workaround by adjusting the --tensor-parallel-size and --decode-context-parallel-size parameters to reduce the load on S3, and monitor the performance to find the optimal balance between parallelism and reliability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#authentication setup #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Usage]: Run:ai S3 streamer crashing when loading model from an S3 compatible object storage [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Usage]: Run:ai S3 streamer crashing when loading model from an S3 compatible object storage [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING