vllm - ✅(Solved) Fix [Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 [1 pull requests, 16 comments, 4 participants]

vllm2026-03-18 06:43:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37392•Fetched 2026-04-08 00:57:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

watch-Ultra

Participants

lz19833006728-ops

uekaterinauelizabethar2175-crypto

watch-Ultra

ZJY0516

Timeline (top)

commented ×16mentioned ×7subscribed ×7labeled ×1

Error Message

(Worker pid=1057374) (Worker_TP0 pid=1057374) Exception in thread WorkerAsyncOutputCopy: (Worker pid=1057374) (Worker_TP0 pid=1057374) Traceback (most recent call last): (Worker pid=1057374) (Worker_TP0 pid=1057374) torch.AcceleratorError: CUDA error: an illegal memory access was encountered [rank0]:[E318 06:18:55.090076229 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f63fa2ddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) [rank2]:[E318 06:18:55.092308347 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) [rank3]:[E318 06:18:55.092349122 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) [rank1]:[E318 06:18:55.092355396 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f2730cddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f63fa2ddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f63fa2ddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)

Root Cause

(EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [multiproc_executor.py:261] Worker proc VllmWorker-3 died unexpectedly, shutting down executor. (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/root/models/Qwen3.5-122B-A10B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='/root/models/Qwen3.5-122B-A10B-FP8', num_spec_tokens=2), tokenizer='/root/models/Qwen3.5-122B-A10B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-122B-A10B-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [10240], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 56, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-9076389db5d12bd8-a49a4b99,prompt_token_ids_len=2723,prefill_token_ids_len=None,mm_features=[MultiModalFeatureSpec(data={'image_grid_thw': MultiModalFieldElem(data=tensor([ 1, 74, 110]), field=MultiModalBatchedField(keep_on_cpu=True)), 'pixel_values': MultiModalFieldElem(data=tensor([[-0.0197, -0.0039, 0.0354, ..., 0.0275, 0.1924, 0.3418], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] [ 0.2471, 0.2393, 0.2393, ..., -0.5312, -0.6094, -0.6562], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] [ 0.3809, 0.3496, 0.3262, ..., -0.6875, -0.6875, -0.6797], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] ..., (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] [ 0.1377, 0.2158, 0.3418, ..., 0.0903, 0.2393, 0.2793], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] [-0.2080, -0.2637, -0.3105, ..., -0.5312, -0.4980, -0.4824], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] [-0.0510, -0.0118, 0.0510, ..., -0.1611, -0.1299, -0.1299]], (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalFlatField(keep_on_cpu=False, slices=[[slice(0, 8140, None)]], dim=0))}, modality='image', identifier='cafc2d8226c177349a8442c766aec467ad05deb420ec1d0209d0944628128256', mm_position=PlaceholderRange(offset=41, length=2035, is_embed=None), mm_hash='cafc2d8226c177349a8442c766aec467ad05deb420ec1d0209d0944628128256')],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=259421, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([312, 313, 314], [315, 316, 317], [318, 319, 320], [321, 322, 323]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-96c0a31f23f356b1-bb6c038c'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[1115],num_output_tokens=[74]), num_scheduled_tokens={chatcmpl-96c0a31f23f356b1-bb6c038c: 3, chatcmpl-9076389db5d12bd8-a49a4b99: 2723}, total_num_scheduled_tokens=2726, scheduled_spec_decode_tokens={chatcmpl-96c0a31f23f356b1-bb6c038c: [-1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[321, 322, 323]) (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.028714107365792718, encoder_cache_usage=0.18670654296875, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] return super().result() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] raise self._exception (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] response = self.aggregate(get_response()) (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] status, result = mq.dequeue( (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] with self.acquire_read(timeout, cancel, indefinite) as buf: (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/lib/python3.12/contextlib.py", line 137, in enter (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] return next(self.gen) (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 537, in acquire_read (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] raise RuntimeError("cancelled") (EngineCore_DP0 pid=1057179) ERROR 03-18 06:18:56 [core.py:1102] RuntimeError: cancelled (EngineCore_DP0 pid=1057179) Process EngineCore_DP0: (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1055786) ERROR 03-18 06:18:56 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore_DP0 pid=1057179) Traceback (most recent call last): (EngineCore_DP0 pid=1057179) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=1057179) self.run() (EngineCore_DP0 pid=1057179) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=1057179) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=1057179) raise e (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=1057179) engine_core.run_busy_loop() (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=1057179) self._process_engine_step() (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=1057179) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=1057179) model_output = future.result() (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=1057179) return super().result() (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=1057179) return self.__get_result() (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=1057179) raise self._exception (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=1057179) response = self.aggregate(get_response()) (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=1057179) status, result = mq.dequeue( (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue (EngineCore_DP0 pid=1057179) with self.acquire_read(timeout, cancel, indefinite) as buf: (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/lib/python3.12/contextlib.py", line 137, in enter (EngineCore_DP0 pid=1057179) return next(self.gen) (EngineCore_DP0 pid=1057179) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=1057179) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 537, in acquire_read (EngineCore_DP0 pid=1057179) raise RuntimeError("cancelled") (EngineCore_DP0 pid=1057179) RuntimeError: cancelled (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] Error in chat completion stream generator. (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] Traceback (most recent call last): (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] async for res in result_generator: (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 583, in generate (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] out = q.get_nowait() or await q.get() (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] ^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] raise output (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] outputs = await engine_core.get_output_async() (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] raise self._format_exception(outputs) from None (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] Error in chat completion stream generator. (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] Traceback (most recent call last): (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] async for res in result_generator: (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 583, in generate (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] out = q.get_nowait() or await q.get() (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] ^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] raise output (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] async for res in result_generator: (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 583, in generate (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] out = q.get_nowait() or await q.get() (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] ^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] raise output (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] outputs = await engine_core.get_output_async() (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] raise self._format_exception(outputs) from None (APIServer pid=1055786) ERROR 03-18 06:18:56 [serving.py:1390] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1055786) INFO: 192.168.102.242:53058 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1055786) INFO: 192.168.102.242:53066 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1055786) INFO: Shutting down (APIServer pid=1055786) INFO: Waiting for application shutdown. (APIServer pid=1055786) INFO: Application shutdown complete. (APIServer pid=1055786) INFO: Finished server process [1055786]

Fix Action

Fixed

Fixed by PR: Fix Spec Decode + NCCL Illegal Memory Access (https://github.com/vllm-project/vllm/pull/37412)

PR fix notes

PR #37412: Fix Spec Decode + NCCL Illegal Memory Access

Repository: vllm-project/vllm
Author: xueliangyang-oeuler
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37412

Description (problem / solution / changelog)

Fix for issue #37392: CUDA errorIllegalMemoryAccess during speculative decoding

The error occurs in get_output() when synchronizing the async copy event. This can happen when tensors are prematurely deallocated in speculative decoding with async scheduling, causing the NCCL communication to fail with illegal memory access.

This fix adds error handling to provide more informative error messages when CUDA synchronization fails, helping to diagnose the root cause.

Note: A complete fix may require deeper investigation into the async scheduling and tensor lifecycle management in speculative decoding.

Purpose

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +14/-1)

RAW_BUFFERClick to expand / collapse

Your current environment

v0.17.1

🐛 Describe the bug

(APIServer pid=1055786) INFO 03-18 06:18:54 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.60, Accepted throughput: 0.30 tokens/s, Drafted throughput: 1.00 tokens/s, Accepted: 3 tokens, Drafted: 10 tokens, Per-position acceptance rate: 0.600, 0.000, Avg Draft acceptance rate: 30.0% (APIServer pid=1055786) INFO: 192.168.102.242:53054 - "POST /v1/chat/completions HTTP/1.1" 200 OK (Worker pid=1057374) (Worker_TP0 pid=1057374) Exception in thread WorkerAsyncOutputCopy: (Worker pid=1057374) (Worker_TP0 pid=1057374) Traceback (most recent call last): (Worker pid=1057374) (Worker_TP0 pid=1057374) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner 5 [0] NCCL INFO [Proxy Service] Device 0 CPU core 37 927edc551c14:1057374:1079898 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 927edc551c14:1057374:1079898 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer 927edc551c14:1057374:1079898 [0] NCCL INFO CC Off, workFifoBytes 1048576 927edc551c14:1057374:1079898 [0] NCCL INFO ncclCommInitRankConfig comm 0x5c442270 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 54000 commId 0xbcf0cd7746e6ef56 - Init COMPLETE 927edc551c14:1057374:1079898 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 0.07 (kernels 0.02, alloc 0.01, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.01, rest 0.01) 927edc551c14:1057374:1079912 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM 927edc551c14:1057374:1079912 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM 927edc551c14:1057374:1079912 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM 927edc551c14:1057374:1079912 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM 927edc551c14:1057374:1079912 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 (Worker pid=1057374) (Worker_TP0 pid=1057374) self.run() (Worker pid=1057374) (Worker_TP0 pid=1057374) File "/usr/lib/python3.12/threading.py", line 1012, in run (Worker pid=1057374) (Worker_TP0 pid=1057374) self._target(*self._args, **self._kwargs) (Worker pid=1057374) (Worker_TP0 pid=1057374) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 860, in async_output_busy_loop (Worker pid=1057374) (Worker_TP0 pid=1057374) self.enqueue_output(output) (Worker pid=1057374) (Worker_TP0 pid=1057374) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in enqueue_output (Worker pid=1057374) (Worker_TP0 pid=1057374) output = output.get_output() (Worker pid=1057374) (Worker_TP0 pid=1057374) ^^^^^^^^^^^^^^^^^^^ (Worker pid=1057374) (Worker_TP0 pid=1057374) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output (Worker pid=1057374) (Worker_TP0 pid=1057374) self.async_copy_ready_event.synchronize() (Worker pid=1057374) (Worker_TP0 pid=1057374) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (Worker pid=1057374) (Worker_TP0 pid=1057374) Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (Worker pid=1057374) (Worker_TP0 pid=1057374) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (Worker pid=1057374) (Worker_TP0 pid=1057374) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (Worker pid=1057374) (Worker_TP0 pid=1057374) Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (Worker pid=1057374) (Worker_TP0 pid=1057374) [rank0]:[E318 06:18:55.090076229 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search forcudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f63fa2ddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7f63fa3770e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f62b5fed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f62b5ffa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f62b5ffdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f62b6000085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7f63bb6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7f63fb014ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7f63fb0a5a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM 927edc551c14:1057376:1079911 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM 927edc551c14:1057376:1079911 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM 927edc551c14:1057376:1079911 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 0 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM 927edc551c14:1057377:1079910 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM 927edc551c14:1057377:1079910 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM 927edc551c14:1057377:1079910 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 L INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM 927edc551c14:1057375:1079913 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM [rank2]:[E318 06:18:55.092308347 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7f59178e60e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f57d35ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) 927edc551c14:1057375:1079913 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f57d35fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f57d35fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f57d3600085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) 927edc551c14:1057375:1079913 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 frame #6: <unknown function> + 0xdc253 (0x7f58d8cb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7f5918525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7f59185b6a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E318 06:18:55.092349122 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7fc2b0b320e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fc16c7ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fc16c7fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fc16c7fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fc16c800085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7fc271eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7fc2b1789ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7fc2b181aa84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E318 06:18:55.092355396 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f2730cddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7f2730d770e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f25ec9ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f25ec9fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f25ec9fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f25eca00085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7f26f20b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7f2731a01ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7f2731a92a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendErrorterminate called after throwing an instance of '' c10::DistBackendError' ' what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f63fa2ddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x9bd860 (0x7f62b5830860 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x7f63bb6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x7f63fb014ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f63fb0a5a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7fc2b0b320e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fc16c7ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fc16c7fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fc16c7fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fc16c800085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7fc271eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7fc2b1789ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7fc2b181aa84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fc2b0772fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x9bd860 (0x7fc16c030860 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x7fc271eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x7fc2b1789ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7fc2b181aa84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7f59178e60e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f57d35ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f57d35fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f57d35fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f57d3600085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7f58d8cb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7f5918525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7f59185b6a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5917572fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x9bd860 (0x7f57d2e30860 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x7f58d8cb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x7f5918525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f59185b6a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f2730cddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xc0e0 (0x7f2730d770e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f25ec9ed3a0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f25ec9fa518 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f25ec9fdfe9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f25eca00085 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #6: <unknown function> + 0xdc253 (0x7f26f20b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #7: <unknown function> + 0x94ac3 (0x7f2731a01ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #8: clone + 0x44 (0x7f2731a92a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f2730cddfdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x9bd860 (0x7f25ec230860 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x7f26f20b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x7f2731a01ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f2731a92a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The error message indicates a CUDA error: an illegal memory access was encountered. This suggests that there's an issue with the GPU memory access.

To fix this issue, you can try the following steps:

Update CUDA and cuDNN: Ensure that you are using the latest versions of CUDA and cuDNN.
Check GPU Memory: Verify that your GPU has sufficient memory to handle the model and the input data.
Reduce Model Size: If possible, try reducing the size of the model to reduce the memory requirements.
Batch Size: Reduce the batch size to reduce the memory usage.
Disable CUDAGraphs: Try disabling CUDAGraphs by setting cudagraph_mode to NONE in the compilation config.

Here's an example of how you can disable CUDAGraphs:

compilation_config = {
    # ... other config options ...
    'cudagraph_mode': 'NONE',
}

Enable CUDA_LAUNCH_BLOCKING: Enable CUDA_LAUNCH_BLOCKING by setting the environment variable CUDA_LAUNCH_BLOCKING=1 before running your application.

Verification

To verify that the fix worked, you can try running your application again and check for any error messages related to CUDA or memory access. If the issue is resolved, you should no longer see the error message.

Extra Tips

Make sure to check the documentation for any specific requirements or recommendations for your model and GPU.
If you're using a large model, consider using model parallelism or data parallelism to reduce the memory requirements.
If you're still experiencing issues, try debugging your application using tools like cuda-memcheck or nvprof to identify the source of the memory access error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 [1 pull requests, 16 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37412: Fix Spec Decode + NCCL Illegal Memory Access

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 [1 pull requests, 16 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37412: Fix Spec Decode + NCCL Illegal Memory Access

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING