vllm - 💡(How to fix) Fix [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 [13 comments, 7 participants]

vllm2026-03-21 14:44:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37749•Fetched 2026-04-08 01:13:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×13subscribed ×12mentioned ×4closed ×1

Error Message

After upgrading from 0.17.1 to v0.18.0, all my qwen 3.5 model-container didnt start anymore. They just restarting without an error. When i use 0.17.1 everything works again. qwen3-embedding and qwen3-reranker works on version 18. it looks like an qwen 3.5 problem. When I use the latest nightly build, I see the same behavior, but at least I get the following error (APIServer pid=1) ERROR 03-21 14:18:13 [core_client.py:654] Engine core proc EngineCore died unexpectedly, shutting down client. (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] Traceback (most recent call last): (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 663, in output_handler (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 972, in get_output_async (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. warnings.warn('resource_tracker: There appear to be %d `

Root Cause

(APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] █ █ █▄ ▄█ (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.2rc1.dev201+g0d50fa1db (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] █▄█▀ █ █ █ █ model /root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4 (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:297] (APIServer pid=1) INFO 03-21 14:13:50 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': '/root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4', 'max_model_len': 65536, 'served_model_name': ['Qwen3.5-27B'], 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 12288, 'max_num_seqs': 8, 'enable_chunked_prefill': True} (APIServer pid=1) INFO 03-21 14:13:57 [model.py:533] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) INFO 03-21 14:13:57 [model.py:1582] Using max model len 65536 (APIServer pid=1) INFO 03-21 14:13:57 [cache.py:225] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor. (APIServer pid=1) INFO 03-21 14:13:57 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=12288. (APIServer pid=1) WARNING 03-21 14:13:57 [config.py:388] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled (APIServer pid=1) INFO 03-21 14:13:57 [config.py:408] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe. (APIServer pid=1) INFO 03-21 14:13:57 [config.py:228] Setting attention block size to 1568 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-21 14:13:57 [config.py:259] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-21 14:13:57 [vllm.py:750] Asynchronous scheduling is enabled. (APIServer pid=1) INFO 03-21 14:13:58 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode. (EngineCore pid=82) INFO 03-21 14:14:04 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev201+g0d50fa1db) with config: model='/root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4', speculative_config=None, tokenizer='/root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-27B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [12288], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore pid=82) INFO 03-21 14:14:05 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode. (EngineCore pid=82) INFO 03-21 14:14:05 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.22:33361 backend=nccl (EngineCore pid=82) INFO 03-21 14:14:05 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=82) INFO 03-21 14:14:06 [gpu_model_runner.py:4493] Starting to load model /root/.cache/huggingface/cyankiwi_Qwen3.5-27B-AWQ-BF16-INT4... (EngineCore pid=82) INFO 03-21 14:14:06 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention (EngineCore pid=82) INFO 03-21 14:14:06 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (EngineCore pid=82) INFO 03-21 14:14:06 [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel (EngineCore pid=82) INFO 03-21 14:14:06 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16 (EngineCore pid=82) INFO 03-21 14:14:07 [cuda.py:334] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN']. (EngineCore pid=82) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore pid=82) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 17% Completed | 1/6 [00:03<00:17, 3.58s/it] Loading safetensors checkpoint shards: 33% Completed | 2/6 [00:07<00:16, 4.02s/it] Loading safetensors checkpoint shards: 50% Completed | 3/6 [00:12<00:12, 4.13s/it] Loading safetensors checkpoint shards: 67% Completed | 4/6 [00:16<00:08, 4.13s/it] Loading safetensors checkpoint shards: 83% Completed | 5/6 [00:19<00:03, 3.81s/it] Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:20<00:00, 2.83s/it] Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:20<00:00, 3.41s/it] (EngineCore pid=82) (EngineCore pid=82) INFO 03-21 14:14:28 [default_loader.py:384] Loading weights took 20.49 seconds (EngineCore pid=82) INFO 03-21 14:14:29 [gpu_model_runner.py:4578] Model loading took 25.16 GiB memory and 22.098130 seconds (EngineCore pid=82) INFO 03-21 14:14:40 [backends.py:1046] Using cache directory: /root/.cache/vllm/torch_compile_cache/c3d3e00439/rank_0_0/backbone for vLLM's torch.compile (EngineCore pid=82) INFO 03-21 14:14:40 [backends.py:1106] Dynamo bytecode transform time: 11.48 s (EngineCore pid=82) INFO 03-21 14:14:43 [backends.py:371] Cache the graph of compile range (1, 12288) for later use (EngineCore pid=82) INFO 03-21 14:16:07 [backends.py:389] Compiling a graph for compile range (1, 12288) takes 86.35 s (EngineCore pid=82) INFO 03-21 14:16:10 [decorators.py:638] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/88ccd35a425a239116d4253886a6568d54ad28e2d4b5d9be1fe9d23faf68e779/rank_0_0/model (EngineCore pid=82) INFO 03-21 14:16:10 [monitor.py:48] torch.compile took 101.72 s in total (EngineCore pid=82) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (48). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...]. (EngineCore pid=82) return fn(*contiguous_args, **contiguous_kwargs) (EngineCore pid=82) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (32) < num_heads (48). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...]. (EngineCore pid=82) return fn(*contiguous_args, **contiguous_kwargs) (EngineCore pid=82) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (48). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...]. (EngineCore pid=82) return fn(*contiguous_args, **contiguous_kwargs) (EngineCore pid=82) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (32) < num_heads (48). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...]. (EngineCore pid=82) return fn(*contiguous_args, **contiguous_kwargs) (EngineCore pid=82) INFO 03-21 14:17:52 [monitor.py:76] Initial profiling/warmup run took 101.50 s (EngineCore pid=82) INFO 03-21 14:17:59 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=16 (EngineCore pid=82) INFO 03-21 14:17:59 [gpu_model_runner.py:5623] Profiling CUDA graph memory: PIECEWISE=5 (largest=16), FULL=4 (largest=8) (EngineCore pid=82) INFO 03-21 14:18:03 [gpu_model_runner.py:5702] Estimated CUDA graph memory: 0.56 GiB total (EngineCore pid=82) INFO 03-21 14:18:03 [gpu_worker.py:436] Available KV cache memory: 7.37 GiB (EngineCore pid=82) INFO 03-21 14:18:03 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9142 to maintain the same effective KV cache size. (EngineCore pid=82) INFO 03-21 14:18:03 [kv_cache_utils.py:1319] GPU KV cache size: 59,584 tokens (EngineCore pid=82) INFO 03-21 14:18:03 [kv_cache_utils.py:1324] Maximum concurrency for 65,536 tokens per request: 3.21x Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:00<00:00, 9.19it/s] Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:00<00:00, 5.36it/s] (EngineCore pid=82) INFO 03-21 14:18:06 [gpu_model_runner.py:5762] Graph capturing finished in 2 secs, took 0.55 GiB (EngineCore pid=82) INFO 03-21 14:18:06 [gpu_worker.py:597] CUDA graph pool memory: 0.55 GiB (actual), 0.56 GiB (estimated), difference: 0.01 GiB (2.5%). (EngineCore pid=82) INFO 03-21 14:18:06 [core.py:281] init engine (profile, create kv cache, warmup model) took 217.31 seconds (EngineCore pid=82) INFO 03-21 14:18:06 [vllm.py:750] Asynchronous scheduling is enabled. (APIServer pid=1) INFO 03-21 14:18:06 [api_server.py:590] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-21 14:18:07 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) WARNING 03-21 14:18:07 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm. (APIServer pid=1) INFO 03-21 14:18:08 [hf.py:320] Detected the chat template content format to be 'string'. You can set --chat-template-content-formatto override this. (APIServer pid=1) ERROR 03-21 14:18:13 [core_client.py:654] Engine core proc EngineCore died unexpectedly, shutting down client. (APIServer pid=1) INFO 03-21 14:18:14 [base.py:213] Multi-modal warmup completed in 6.266s (APIServer pid=1) INFO 03-21 14:18:14 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000 (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:37] Available routes are: (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /docs, Methods: HEAD, GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /redoc, Methods: HEAD, GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-21 14:18:14 [launcher.py:46] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] Traceback (most recent call last): (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 663, in output_handler (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 972, in get_output_async (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-21 14:18:15 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: Application startup complete. (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1] /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d

RAW_BUFFERClick to expand / collapse

Your current environment

Model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 + QuantTrio/Qwen3.5-27B-AWQ Inference Framework: vLLM 0.18.0 GPU Hardware: Multiple A100 40GB (one model / card) Deployment Mode: vLLM as Docker

Parameter: --gpu-memory-utilization 0.90 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --language-model-only --max-model-len 65536 --max-num-batched-tokens 12288 --kv-cache-dtype fp8 --max-num-seqs 8 --enable-chunked-prefill --enable-prefix-caching

🐛 Describe the bug

When I use the latest nightly build, I see the same behavior, but at least I get the following error

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the vLLM engine core process dying unexpectedly. To fix this, we can try the following steps:

Increase GPU memory utilization: Try increasing the --gpu-memory-utilization parameter to a higher value, such as 0.95, to allocate more GPU memory to the model.
Disable prefix caching: Try disabling prefix caching by setting --enable-prefix-caching to False, as it may be causing issues with the model loading.
Reduce max model length: Try reducing the --max-model-len parameter to a smaller value, such as 4096, to reduce the memory requirements of the model.
Update vLLM: Make sure you are running the latest version of vLLM, as newer versions may have fixed bugs that could be causing this issue.

Example code to update the vLLM configuration:

import subprocess

# Update the vLLM configuration
subprocess.run([
    "vllm",
    "--gpu-memory-utilization", "0.95",
    "--enable-prefix-caching", "False",
    "--max-model-len", "4096",
    # Other parameters...
])

Alternatively, you can also try updating the vLLM configuration using a configuration file. Create a file named vllm_config.json with the following contents:

{
    "gpu_memory_utilization": 0.95,
    "enable_prefix_caching": false,
    "max_model_len": 4096
}

Then, run vLLM with the updated configuration file:

subprocess.run([
    "vllm",
    "--config", "vllm_config.json",
    # Other parameters...
])

Verification

To verify that the fix worked, you can check the vLLM logs for any error messages. If the model is loading successfully, you should see a message indicating that the model has been loaded and is ready for use.

You can also try running a test query to verify that the model is working correctly:

import requests

# Send a test query to the vLLM API
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={"prompt": "Hello, world!", "max_tokens": 100}
)

# Check the response for any error messages
if response.status_code != 200:
    print("Error:", response.text)
else:
    print("Response:", response.json())

Extra Tips

Make sure you have the latest version of vLLM installed, as newer versions may have fixed bugs that could be causing this issue.
If you are still experiencing issues, try reducing the --max-num-batched-tokens parameter to a smaller value,

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tensor shape #model loading #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 [13 comments, 7 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 [13 comments, 7 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING