vllm - ✅(Solved) Fix [Bug]: DP>1 doesn't work with weight syncing [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36932Fetched 2026-04-08 00:43:30
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

Error Message

INFO 03-12 21:50:50 [serve.py:100] Defaulting api_server_count to data_parallel_size (2). INFO 03-12 21:50:50 [utils.py:302] INFO 03-12 21:50:50 [utils.py:302] █ █ █▄ ▄█ INFO 03-12 21:50:50 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 INFO 03-12 21:50:50 [utils.py:302] █▄█▀ █ █ █ █ model facebook/opt-125m INFO 03-12 21:50:50 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ INFO 03-12 21:50:50 [utils.py:302] INFO 03-12 21:50:50 [utils.py:238] non-default args: {'model_tag': 'facebook/opt-125m', 'api_server_count': 2, 'model': 'facebook/opt-125m', 'enforce_eager': True, 'load_format': 'dummy', 'data_parallel_size': 2, 'weight_transfer_config': WeightTransferConfig(backend='nccl')} INFO 03-12 21:50:50 [model.py:531] Resolved architecture: OPTForCausalLM INFO 03-12 21:50:50 [model.py:1554] Using max model len 2048 INFO 03-12 21:50:51 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192. INFO 03-12 21:50:51 [vllm.py:747] Asynchronous scheduling is enabled. WARNING 03-12 21:50:51 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none WARNING 03-12 21:50:51 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-12 21:50:51 [vllm.py:957] Cudagraph is disabled under eager mode INFO 03-12 21:50:51 [utils.py:865] Started DP Coordinator process (PID: 770158) INFO 03-12 21:50:51 [utils.py:217] Started 2 API server processes (EngineCore_DP1 pid=770162) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_DP1 pid=770162) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1 (EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=dummy, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=facebook/opt-125m, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=770161) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1 (ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM (ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048 (ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM (ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048 (ApiServer_0 pid=770163) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192. (ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled. (ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode (ApiServer_1 pid=770164) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192. (ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled. (ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine (Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46093 backend=nccl (Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=1 distributed_init_method=tcp://127.0.0.1:40161 backend=nccl (Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (Worker pid=771043) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader (Worker pid=771044) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader (Worker pid=771043) (Worker pid=771043) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m... (Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m... (Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [flash_attn.py:587] Using FlashAttention version 3 (Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:53 [gpu_model_runner.py:4364] Model loading took 0.24 GiB memory and 5.898658 seconds (EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens (EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x (Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:55 [gpu_worker.py:424] Available KV cache memory: 70.38 GiB (EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens (EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x (Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,284 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... (Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,298 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... (Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends (Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends (EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds (EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds INFO 03-12 21:52:56 [coordinator.py:203] All engine subscriptions received by DP coordinator (EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled. (ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats. (ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats. (EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-12 21:52:56 [utils.py:248] Waiting for API servers to complete ... (EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode (EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode (ApiServer_0 pid=770163) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate'] (ApiServer_1 pid=770164) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate'] (ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [init.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production! (ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [init.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production! (ApiServer_0 pid=770163) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing... (ApiServer_1 pid=770164) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing... (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last): (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] await self._preprocess_chat( (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] (conversation,), (engine_prompt,) = await renderer.render_chat_async( (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] for conv, prompt in await asyncio.gather(*rendered): (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] prompt_raw = safe_apply_chat_template( (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] raise ChatTemplateResolutionError( (ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one. (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last): (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] await self._preprocess_chat( (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] (conversation,), (engine_prompt,) = await renderer.render_chat_async( (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] for conv, prompt in await asyncio.gather(*rendered): (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] prompt_raw = safe_apply_chat_template( (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] ^^^^^^^^^^^^^^^^^^^^^^^^^ (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] raise ChatTemplateResolutionError( (ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one. (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 1 on http://0.0.0.0:8000 (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:38] Available routes are: (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000 (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:38] Available routes are: (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (ApiServer_0 pid=770163) INFO: Started server process [770163] (ApiServer_0 pid=770163) INFO: Waiting for application startup. (ApiServer_1 pid=770164) INFO: Started server process [770164] (ApiServer_1 pid=770164) INFO: Waiting for application startup. (ApiServer_0 pid=770163) INFO: Application startup complete. (ApiServer_1 pid=770164) INFO: Application startup complete. (ApiServer_1 pid=770164) INFO: 127.0.0.1:40240 - "POST /v1/completions HTTP/1.1" 200 OK (ApiServer_1 pid=770164) INFO: 127.0.0.1:40242 - "GET /get_world_size HTTP/1.1" 200 OK (EngineCore_DP1 pid=770162) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=770161) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP1 pid=770162) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=770161) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP1 pid=770162) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=770161) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 2649.998 BogoMIPS: 5299.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

PR fix notes

PR #36940: [BUG] Fix rank calculation in NCCLWeightTransferEngine

Description (problem / solution / changelog)

Purpose

closes https://github.com/vllm-project/vllm/issues/36932 Rank calculations were not being made correctly, parallel_config.data_parallel_rank for non moe models is always 0, switched to using self.parallel_config.data_parallel_index instead.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/distributed/weight_transfer/nccl_engine.py (modified, +1/-1)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
Clang version                : Could not collect
CMake version                : version 3.27.7
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Sep  2 2025, 14:20:58) [Clang 20.1.4 ] (64-bit runtime)
Python platform              : Linux-5.15.0-1048-aws-x86_64-with-glibc2.31

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.1.105
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H100 80GB HBM3
Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             96
On-line CPU(s) list:                0-95
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7R13 Processor
Stepping:                           1
CPU MHz:                            2649.998
BogoMIPS:                           5299.99
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          3 MiB
L1i cache:                          3 MiB
L2 cache:                           48 MiB
L3 cache:                           384 MiB
NUMA node0 CPU(s):                  0-47
NUMA node1 CPU(s):                  48-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.2.0
[pip3] triton==3.6.0
[conda] flashinfer-python         0.5.3                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.4.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.4.1                    pypi_0    pypi
[conda] nvidia-ml-py              13.590.48                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0                   pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0                   pypi_0    pypi
[conda] torchvision               0.25.0                   pypi_0    pypi
[conda] transformers              4.57.6                   pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-10	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
LD_LIBRARY_PATH=/fsx/qgallouedec/miniconda3/lib/python3.13/site-packages/nvidia/nvjitlink/lib:/fsx/qgallouedec/miniconda3/lib/python3.13/site-packages/nvidia/nvjitlink/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_quentin_gallouedec

---

VLLM_SERVER_DEV_MODE=1 vllm serve facebook/opt-125m \
        --enforce-eager \
        --data-parallel-size 2 \
        --weight-transfer-config '{"backend": "nccl"}' \
        --load-format dummy

---

# weight_sync_demo.py
import threading

import requests
import torch
from openai import OpenAI
from transformers import AutoModelForCausalLM

from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
from vllm.utils.network_utils import get_ip, get_open_port

BASE_URL = "http://localhost:8000"
MODEL = "facebook/opt-125m"


def generate(client, prompt):
    return client.completions.create(model=MODEL, prompt=prompt, max_tokens=32, temperature=0).choices[0].text


def main():
    # Load transformers model on the GPU (idx=2)
    model = AutoModelForCausalLM.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="cuda:0")

    client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="EMPTY")
    prompt = "The capital of France is"

    print("Before:", generate(client, prompt))

    master_address, master_port = get_ip(), get_open_port()
    server_world_size = requests.get(f"{BASE_URL}/get_world_size").json()["world_size"]
    world_size = server_world_size + 1  # trainer + all vLLM workers

    # Init vLLM side in background (blocks until NCCL handshake completes)
    threading.Thread(
        target=lambda: requests.post(
            f"{BASE_URL}/init_weight_transfer_engine",
            json={
                "init_info": {
                    "master_address": master_address,
                    "master_port": master_port,
                    "rank_offset": 1,
                    "world_size": world_size,
                }
            },
            timeout=300,
        )
    ).start()

    group = NCCLWeightTransferEngine.trainer_init(
        {"master_address": master_address, "master_port": master_port, "world_size": world_size}
    )
    requests.post(f"{BASE_URL}/pause")
    params = list(model.named_parameters())
    threading.Thread(
        target=lambda: requests.post(
            f"{BASE_URL}/update_weights",
            json={
                "update_info": {
                    "names": [n for n, _ in params],
                    "dtype_names": [str(p.dtype).split(".")[-1] for _, p in params],
                    "shapes": [list(p.shape) for _, p in params],
                    "packed": True,
                }
            },
            timeout=300,
        )
    ).start()

    NCCLWeightTransferEngine.trainer_send_weights(
        iterator=model.named_parameters(),
        trainer_args=NCCLTrainerSendWeightsArgs(group=group, packed=True),
    )

    requests.post(f"{BASE_URL}/resume")

    print("After: ", generate(client, prompt))


if __name__ == "__main__":
    main()

---

CUDA_VISIBLE_DEVICES=2 python weight_sync_demo.py

---

INFO 03-12 21:50:50 [serve.py:100] Defaulting api_server_count to data_parallel_size (2).
INFO 03-12 21:50:50 [utils.py:302] 
INFO 03-12 21:50:50 [utils.py:302]        █     █     █▄   ▄█
INFO 03-12 21:50:50 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
INFO 03-12 21:50:50 [utils.py:302]   █▄█▀ █     █     █     █  model   facebook/opt-125m
INFO 03-12 21:50:50 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
INFO 03-12 21:50:50 [utils.py:302] 
INFO 03-12 21:50:50 [utils.py:238] non-default args: {'model_tag': 'facebook/opt-125m', 'api_server_count': 2, 'model': 'facebook/opt-125m', 'enforce_eager': True, 'load_format': 'dummy', 'data_parallel_size': 2, 'weight_transfer_config': WeightTransferConfig(backend='nccl')}
INFO 03-12 21:50:50 [model.py:531] Resolved architecture: OPTForCausalLM
INFO 03-12 21:50:50 [model.py:1554] Using max model len 2048
INFO 03-12 21:50:51 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-12 21:50:51 [vllm.py:747] Asynchronous scheduling is enabled.
WARNING 03-12 21:50:51 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 03-12 21:50:51 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-12 21:50:51 [vllm.py:957] Cudagraph is disabled under eager mode
INFO 03-12 21:50:51 [utils.py:865] Started DP Coordinator process (PID: 770158)
INFO 03-12 21:50:51 [utils.py:217] Started 2 API server processes
(EngineCore_DP1 pid=770162) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP1 pid=770162) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=dummy, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=facebook/opt-125m, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=770161) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode
INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine
INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine
(Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46093 backend=nccl
(Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=1 distributed_init_method=tcp://127.0.0.1:40161 backend=nccl
(Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=771043) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader
(Worker pid=771044) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader
(Worker pid=771043) (Worker pid=771043) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m...
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m...
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:53 [gpu_model_runner.py:4364] Model loading took 0.24 GiB memory and 5.898658 seconds
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:55 [gpu_worker.py:424] Available KV cache memory: 70.38 GiB
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x
(Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,284 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,298 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds
INFO 03-12 21:52:56 [coordinator.py:203] All engine subscriptions received by DP coordinator
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats.
(ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats.
(EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-12 21:52:56 [utils.py:248] Waiting for API servers to complete ...
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode
(ApiServer_0 pid=770163) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate']
(ApiServer_1 pid=770164) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate']
(ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [__init__.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production!
(ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [__init__.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production!
(ApiServer_0 pid=770163) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing...
(ApiServer_1 pid=770164) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing...
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last):
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     await self._preprocess_chat(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     (conversation,), (engine_prompt,) = await renderer.render_chat_async(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     for conv, prompt in await asyncio.gather(*rendered):
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     prompt_raw = safe_apply_chat_template(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                  ^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     raise ChatTemplateResolutionError(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last):
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     await self._preprocess_chat(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     (conversation,), (engine_prompt,) = await renderer.render_chat_async(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     for conv, prompt in await asyncio.gather(*rendered):
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     prompt_raw = safe_apply_chat_template(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                  ^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     raise ChatTemplateResolutionError(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 1 on http://0.0.0.0:8000
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:38] Available routes are:
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:38] Available routes are:
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO:     Started server process [770163]
(ApiServer_0 pid=770163) INFO:     Waiting for application startup.
(ApiServer_1 pid=770164) INFO:     Started server process [770164]
(ApiServer_1 pid=770164) INFO:     Waiting for application startup.
(ApiServer_0 pid=770163) INFO:     Application startup complete.
(ApiServer_1 pid=770164) INFO:     Application startup complete.
(ApiServer_1 pid=770164) INFO:     127.0.0.1:40240 - "POST /v1/completions HTTP/1.1" 200 OK
(ApiServer_1 pid=770164) INFO:     127.0.0.1:40242 - "GET /get_world_size HTTP/1.1" 200 OK
(EngineCore_DP1 pid=770162) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP1 pid=770162) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP1 pid=770162) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

---

Loading weights: 100%|| 197/197 [00:00<00:00, 2126.36it/s, Materializing param=model.decoder.layer
The tied weights mapping and config for this model specifies to tie model.decoder.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
Before: <s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
INFO 03-12 21:54:02 [pynccl.py:111] vLLM is using nccl==2.27.5
Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 91, in <module>
    main()
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 60, in main
    group = NCCLWeightTransferEngine.trainer_init(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/weight_transfer/nccl_engine.py", line 312, in trainer_init
    return NCCLWeightTransferEngine._stateless_init_process_group(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/weight_transfer/nccl_engine.py", line 333, in _stateless_init_process_group
    pynccl = PyNcclCommunicator(pg, device=device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 139, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 407, in ncclCommInitRank
    self.NCCL_CHECK(
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 373, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: remote process exited or there was a network error
Exception in thread Thread-2 (<lambda>):
Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 571, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 490, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=300)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 46, in <lambda>
    target=lambda: requests.post(
                   ^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/adapters.py", line 690, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=300)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
Clang version                : Could not collect
CMake version                : version 3.27.7
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Sep  2 2025, 14:20:58) [Clang 20.1.4 ] (64-bit runtime)
Python platform              : Linux-5.15.0-1048-aws-x86_64-with-glibc2.31

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.1.105
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H100 80GB HBM3
Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             96
On-line CPU(s) list:                0-95
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7R13 Processor
Stepping:                           1
CPU MHz:                            2649.998
BogoMIPS:                           5299.99
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          3 MiB
L1i cache:                          3 MiB
L2 cache:                           48 MiB
L3 cache:                           384 MiB
NUMA node0 CPU(s):                  0-47
NUMA node1 CPU(s):                  48-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.2.0
[pip3] triton==3.6.0
[conda] flashinfer-python         0.5.3                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.4.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.4.1                    pypi_0    pypi
[conda] nvidia-ml-py              13.590.48                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0                   pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0                   pypi_0    pypi
[conda] torchvision               0.25.0                   pypi_0    pypi
[conda] transformers              4.57.6                   pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-10	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
LD_LIBRARY_PATH=/fsx/qgallouedec/miniconda3/lib/python3.13/site-packages/nvidia/nvjitlink/lib:/fsx/qgallouedec/miniconda3/lib/python3.13/site-packages/nvidia/nvjitlink/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_quentin_gallouedec
</details>

🐛 Describe the bug

VLLM_SERVER_DEV_MODE=1 vllm serve facebook/opt-125m \
        --enforce-eager \
        --data-parallel-size 2 \
        --weight-transfer-config '{"backend": "nccl"}' \
        --load-format dummy

Note that changing to --data-parallel-size 1 would make everything work just fine.

# weight_sync_demo.py
import threading

import requests
import torch
from openai import OpenAI
from transformers import AutoModelForCausalLM

from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
from vllm.utils.network_utils import get_ip, get_open_port

BASE_URL = "http://localhost:8000"
MODEL = "facebook/opt-125m"


def generate(client, prompt):
    return client.completions.create(model=MODEL, prompt=prompt, max_tokens=32, temperature=0).choices[0].text


def main():
    # Load transformers model on the GPU (idx=2)
    model = AutoModelForCausalLM.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="cuda:0")

    client = OpenAI(base_url=f"{BASE_URL}/v1", api_key="EMPTY")
    prompt = "The capital of France is"

    print("Before:", generate(client, prompt))

    master_address, master_port = get_ip(), get_open_port()
    server_world_size = requests.get(f"{BASE_URL}/get_world_size").json()["world_size"]
    world_size = server_world_size + 1  # trainer + all vLLM workers

    # Init vLLM side in background (blocks until NCCL handshake completes)
    threading.Thread(
        target=lambda: requests.post(
            f"{BASE_URL}/init_weight_transfer_engine",
            json={
                "init_info": {
                    "master_address": master_address,
                    "master_port": master_port,
                    "rank_offset": 1,
                    "world_size": world_size,
                }
            },
            timeout=300,
        )
    ).start()

    group = NCCLWeightTransferEngine.trainer_init(
        {"master_address": master_address, "master_port": master_port, "world_size": world_size}
    )
    requests.post(f"{BASE_URL}/pause")
    params = list(model.named_parameters())
    threading.Thread(
        target=lambda: requests.post(
            f"{BASE_URL}/update_weights",
            json={
                "update_info": {
                    "names": [n for n, _ in params],
                    "dtype_names": [str(p.dtype).split(".")[-1] for _, p in params],
                    "shapes": [list(p.shape) for _, p in params],
                    "packed": True,
                }
            },
            timeout=300,
        )
    ).start()

    NCCLWeightTransferEngine.trainer_send_weights(
        iterator=model.named_parameters(),
        trainer_args=NCCLTrainerSendWeightsArgs(group=group, packed=True),
    )

    requests.post(f"{BASE_URL}/resume")

    print("After: ", generate(client, prompt))


if __name__ == "__main__":
    main()
CUDA_VISIBLE_DEVICES=2 python weight_sync_demo.py

Server logs

INFO 03-12 21:50:50 [serve.py:100] Defaulting api_server_count to data_parallel_size (2).
INFO 03-12 21:50:50 [utils.py:302] 
INFO 03-12 21:50:50 [utils.py:302]        █     █     █▄   ▄█
INFO 03-12 21:50:50 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
INFO 03-12 21:50:50 [utils.py:302]   █▄█▀ █     █     █     █  model   facebook/opt-125m
INFO 03-12 21:50:50 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
INFO 03-12 21:50:50 [utils.py:302] 
INFO 03-12 21:50:50 [utils.py:238] non-default args: {'model_tag': 'facebook/opt-125m', 'api_server_count': 2, 'model': 'facebook/opt-125m', 'enforce_eager': True, 'load_format': 'dummy', 'data_parallel_size': 2, 'weight_transfer_config': WeightTransferConfig(backend='nccl')}
INFO 03-12 21:50:50 [model.py:531] Resolved architecture: OPTForCausalLM
INFO 03-12 21:50:50 [model.py:1554] Using max model len 2048
INFO 03-12 21:50:51 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-12 21:50:51 [vllm.py:747] Asynchronous scheduling is enabled.
WARNING 03-12 21:50:51 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 03-12 21:50:51 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-12 21:50:51 [vllm.py:957] Cudagraph is disabled under eager mode
INFO 03-12 21:50:51 [utils.py:865] Started DP Coordinator process (PID: 770158)
INFO 03-12 21:50:51 [utils.py:217] Started 2 API server processes
(EngineCore_DP1 pid=770162) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP1 pid=770162) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=dummy, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=facebook/opt-125m, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=770161) WARNING 03-12 21:51:46 [multiproc_executor.py:945] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=770161) INFO 03-12 21:51:46 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=26.0.163.147 (local), world_size=1, local_world_size=1
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:531] Resolved architecture: OPTForCausalLM
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [model.py:1554] Using max model len 2048
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(ApiServer_0 pid=770163) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(ApiServer_0 pid=770163) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(ApiServer_1 pid=770164) WARNING 03-12 21:51:48 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(ApiServer_1 pid=770164) INFO 03-12 21:51:48 [vllm.py:957] Cudagraph is disabled under eager mode
INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine
INFO 03-12 21:52:43 [factory.py:100] Creating weight transfer engine: NCCLWeightTransferEngine
(Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46093 backend=nccl
(Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1393] world_size=1 rank=0 local_rank=1 distributed_init_method=tcp://127.0.0.1:40161 backend=nccl
(Worker pid=771044) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=771043) INFO 03-12 21:52:45 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=771043) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader
(Worker pid=771044) INFO 03-12 21:52:46 [base.py:106] Offloader set to NoopOffloader
(Worker pid=771043) (Worker pid=771043) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m...
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:46 [gpu_model_runner.py:4281] Starting to load model facebook/opt-125m...
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:52 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:53 [gpu_model_runner.py:4364] Model loading took 0.24 GiB memory and 5.898658 seconds
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x
(Worker pid=771044) (Worker pid=771044) INFO 03-12 21:52:55 [gpu_worker.py:424] Available KV cache memory: 70.38 GiB
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1314] GPU KV cache size: 2,049,856 tokens
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [kv_cache_utils.py:1319] Maximum concurrency for 2,048 tokens per request: 1000.91x
(Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,284 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,298 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=771044) (Worker pid=771044) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=771043) (Worker pid=771043) 2026-03-12 21:52:55,308 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.08 seconds
INFO 03-12 21:52:56 [coordinator.py:203] All engine subscriptions received by DP coordinator
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled.
(ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats.
(ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [loggers.py:1271] AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats.
(EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=770161) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP1 pid=770162) WARNING 03-12 21:52:56 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-12 21:52:56 [utils.py:248] Waiting for API servers to complete ...
(EngineCore_DP0 pid=770161) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode
(EngineCore_DP1 pid=770162) INFO 03-12 21:52:56 [vllm.py:957] Cudagraph is disabled under eager mode
(ApiServer_0 pid=770163) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate']
(ApiServer_1 pid=770164) INFO 03-12 21:52:56 [api_server.py:495] Supported tasks: ['generate']
(ApiServer_0 pid=770163) WARNING 03-12 21:52:56 [__init__.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production!
(ApiServer_1 pid=770164) WARNING 03-12 21:52:56 [__init__.py:14] SECURITY WARNING: Development endpoints are enabled! This should NOT be used in production!
(ApiServer_0 pid=770163) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing...
(ApiServer_1 pid=770164) INFO 03-12 21:52:56 [serving.py:185] Warming up chat template processing...
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last):
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     await self._preprocess_chat(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     (conversation,), (engine_prompt,) = await renderer.render_chat_async(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     for conv, prompt in await asyncio.gather(*rendered):
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     prompt_raw = safe_apply_chat_template(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]                  ^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214]     raise ChatTemplateResolutionError(
(ApiServer_0 pid=770163) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Chat template warmup failed
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] Traceback (most recent call last):
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 201, in warmup
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     await self._preprocess_chat(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/engine/serving.py", line 982, in _preprocess_chat
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     (conversation,), (engine_prompt,) = await renderer.render_chat_async(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/base.py", line 755, in render_chat_async
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     for conv, prompt in await asyncio.gather(*rendered):
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 694, in render_messages_async
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     prompt_raw = safe_apply_chat_template(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]                  ^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]   File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/renderers/hf.py", line 459, in safe_apply_chat_template
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214]     raise ChatTemplateResolutionError(
(ApiServer_1 pid=770164) ERROR 03-12 21:52:57 [serving.py:214] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 1 on http://0.0.0.0:8000
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:38] Available routes are:
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(ApiServer_1 pid=770164) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:38] Available routes are:
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /sleep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /wake_up, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_sleeping, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /collective_rpc, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_prefix_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_mm_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /reset_encoder_cache, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /tokenize, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /detokenize, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /load, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /version, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /health, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /metrics, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /server_info, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/models, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /ping, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /invocations, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /pause, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /resume, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_paused, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /init_weight_transfer_engine, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /update_weights, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /get_world_size, Methods: GET
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO 03-12 21:52:57 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(ApiServer_0 pid=770163) INFO:     Started server process [770163]
(ApiServer_0 pid=770163) INFO:     Waiting for application startup.
(ApiServer_1 pid=770164) INFO:     Started server process [770164]
(ApiServer_1 pid=770164) INFO:     Waiting for application startup.
(ApiServer_0 pid=770163) INFO:     Application startup complete.
(ApiServer_1 pid=770164) INFO:     Application startup complete.
(ApiServer_1 pid=770164) INFO:     127.0.0.1:40240 - "POST /v1/completions HTTP/1.1" 200 OK
(ApiServer_1 pid=770164) INFO:     127.0.0.1:40242 - "GET /get_world_size HTTP/1.1" 200 OK
(EngineCore_DP1 pid=770162) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:55:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP1 pid=770162) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:56:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP1 pid=770162) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=770161) INFO 03-12 21:57:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

Client logs

Loading weights: 100%|█| 197/197 [00:00<00:00, 2126.36it/s, Materializing param=model.decoder.layer
The tied weights mapping and config for this model specifies to tie model.decoder.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
Before: <s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
INFO 03-12 21:54:02 [pynccl.py:111] vLLM is using nccl==2.27.5
Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 91, in <module>
    main()
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 60, in main
    group = NCCLWeightTransferEngine.trainer_init(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/weight_transfer/nccl_engine.py", line 312, in trainer_init
    return NCCLWeightTransferEngine._stateless_init_process_group(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/weight_transfer/nccl_engine.py", line 333, in _stateless_init_process_group
    pynccl = PyNcclCommunicator(pg, device=device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 139, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 407, in ncclCommInitRank
    self.NCCL_CHECK(
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 373, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: remote process exited or there was a network error
Exception in thread Thread-2 (<lambda>):
Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 571, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 490, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=300)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/admin/home/quentin_gallouedec/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/fsx/qgallouedec/ultra-scale-rl/../vllm/examples/weight_sync_demo.py", line 46, in <lambda>
    target=lambda: requests.post(
                   ^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/ultra-scale-rl/.venv/lib/python3.12/site-packages/requests/adapters.py", line 690, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=300)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the NCCL communication between the trainer and the vLLM workers. To fix this, we can try the following steps:

  • Check NCCL version: Ensure that the NCCL version is compatible with the CUDA version. In this case, the NCCL version is 2.27.5, and the CUDA version is 12.1.105.
  • Verify environment variables: Make sure that the NCCL_BLOCKING_WAIT environment variable is set to 1. This can help resolve issues related to NCCL timeouts.
  • Increase NCCL timeout: Try increasing the NCCL timeout value by setting the NCCL_TIMEOUT environment variable to a higher value (e.g., 600 seconds).
  • Check network connectivity: Ensure that the network connection between the trainer and the vLLM workers is stable and not causing any issues.

Here's an example of how to set the NCCL_BLOCKING_WAIT and NCCL_TIMEOUT environment variables in Python:

import os

os.environ["NCCL_BLOCKING_WAIT"] = "1"
os.environ["NCCL_TIMEOUT"] = "600"

Additionally, you can try to modify the weight_sync_demo.py script to handle the NCCL timeout exception and retry the weight transfer operation:

import time

while True:
    try:
        group = NCCLWeightTransferEngine.trainer_init(
            {"master_address": master_address, "master_port": master_port, "world_size": world_size}
        )
        break
    except RuntimeError as e:
        if "NCCL error: remote process exited or there was a network error" in str(e):
            print("NCCL timeout error, retrying...")
            time.sleep(10)
        else:
            raise

Verification

To verify that the fix worked, you can try running the weight_sync_demo.py script again and check if the weight transfer operation completes successfully. You can also monitor the NCCL logs to see if there are any error messages related to timeouts or network issues.

Extra Tips

  • Make sure to check the vLLM documentation for any specific requirements or recommendations for using NCCL with vLLM.
  • If you're using a cloud-based environment, ensure that the network security group rules allow for communication between the trainer and the vLLM workers.
  • Consider using a more robust method for handling NCCL timeouts, such as using a retry mechanism with exponential backoff.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: DP>1 doesn't work with weight syncing [1 pull requests, 1 comments, 2 participants]