vllm - 💡(How to fix) Fix [Usage]: Run:ai S3 streamer crashing when loading model from an S3 compatible object storage [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39396Fetched 2026-04-10 03:40:51
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×1labeled ×1mentioned ×1subscribed ×1

Error Message

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer: 6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s] (Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer: 7% Completed | 14247/208550 [00:15<03:20, 970.14it/s] (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last): (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) current_data_size, chunk_count_in_batch = self.prefill( (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) chunk_item = next(chunk_gen) (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) yield from ready_chunks_iterator (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) yield from self.request_ready_chunks() (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) file_relative_index, chunk_relative_index = runai_response(self.streamer) (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}") (Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last): (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) current_data_size, chunk_count_in_batch = self.prefill( (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) chunk_item = next(chunk_gen) (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) yield from ready_chunks_iterator (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) yield from self.request_ready_chunks() (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) file_relative_index, chunk_relative_index = runai_response(self.streamer) (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}") (Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer: 7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]

Code Example

I'm using k8s with the vllm/vllm-openai:v0.17.0 docker image

---

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --decode-context-parallel-size 8 \

---

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s]
(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:15<03:20, 970.14it/s]
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last):
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     chunk_item = next(chunk_gen)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from ready_chunks_iterator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from self.request_ready_chunks()
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last):
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     chunk_item = next(chunk_gen)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from ready_chunks_iterator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from self.request_ready_chunks()
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]
RAW_BUFFERClick to expand / collapse

Your current environment

I'm using k8s with the vllm/vllm-openai:v0.17.0 docker image

How would you like to use vllm

I'm serving kimi-k2.5 with vLLM using run:ai streamer after uploading to Coreweave S3 object compatible storage on H200/B200 nodes

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --decode-context-parallel-size 8 \

I'm also using AWS_EC2_METADATA_DISABLED: true

It usually works well getting 1GB/s per GPU so around 8GB/s total but issues is happening when we are launching like 10+ servers at the same as the S3 throughput naturally decreases per replica to around 400MB/s per GPU but some servers gets an error while streaming the model with the following error and they keep hanging forever without crashing

(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   6% Completed | 12127/208550 [00:13<03:16, 1001.31it/s]
(Worker pid=1254) (Worker_TP0_DCP0 pid=1254) Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:15<03:20, 970.14it/s]
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 6 error: Could not receive runai_call response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) Traceback (most recent call last):
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     chunk_item = next(chunk_gen)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from ready_chunks_iterator
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     yield from self.request_ready_chunks()
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1260) (Worker_TP6_DCP6 pid=1260) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) [2026-04-09 08:00:15] ERROR distributed_streamer.py:446: [RunAI Streamer][Distributed] rank 4 error: Could not receive runai_call response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) Traceback (most recent call last):
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 414, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     current_data_size, chunk_count_in_batch = self.prefill(
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 469, in prefill
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     chunk_item = next(chunk_gen)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/distributed_streamer/distributed_streamer.py", line 403, in chunk_generator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from ready_chunks_iterator
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 127, in get_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     yield from self.request_ready_chunks()
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/file_streamer/file_streamer.py", line 148, in request_ready_chunks
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     file_relative_index, chunk_relative_index = runai_response(self.streamer)
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)   File "/usr/local/lib/python3.12/dist-packages/runai_model_streamer/libstreamer/libstreamer.py", line 89, in runai_response
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258)     raise ValueError(f"Could not receive runai_response from libstreamer due to: {error_msg}")
(Worker pid=1258) (Worker_TP4_DCP4 pid=1258) ValueError: Could not receive runai_response from libstreamer due to: b'File access error'

Loading safetensors using Runai Model Streamer:   7% Completed | 14247/208550 [00:30<03:20, 970.14it/s]

We are planning on serving models more widely from S3 so need to ensure it works even if load times increase when having more servers loading at same time, any idea if there are any changes that could be applied?

Thanks

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Consider adjusting the --tensor-parallel-size and --decode-context-parallel-size parameters to reduce the load on S3 when launching multiple servers simultaneously.

Guidance

  • Review the S3 bucket configuration to ensure it can handle the increased load when multiple servers are launched at the same time.
  • Experiment with reducing the --tensor-parallel-size and --decode-context-parallel-size parameters to decrease the parallelism and alleviate the load on S3.
  • Monitor the S3 throughput and adjust the parameters accordingly to find a balance between performance and reliability.
  • Consider implementing a queueing system or a load balancer to manage the incoming requests and prevent overwhelming the S3 bucket.

Example

No specific code example is provided, but the command used to serve the model can be modified to adjust the parallelism parameters, for example:

vllm serve s3://hf-cache/kimi-k2.5 \
    --port "8000" \
    --host 0.0.0.0 \
    --load-format runai_streamer \
    --model-loader-extra-config '{"distributed":true,"memory_limit":10000000000}' \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --decode-context-parallel-size 4

Notes

The issue seems to be related to the increased load on S3 when multiple servers are launched simultaneously, causing a decrease in throughput and resulting in errors. Adjusting the parallelism parameters and reviewing the S3 bucket configuration may help alleviate the issue.

Recommendation

Apply a workaround by adjusting the --tensor-parallel-size and --decode-context-parallel-size parameters to reduce the load on S3, and monitor the performance to find the optimal balance between parallelism and reliability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING