vllm - 💡(How to fix) Fix [Bug]: vLLM docker container with Qwen3.5 - Connection error [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39319Fetched 2026-04-09 07:51:55
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Error Message

and I get just a very immediate and simple error "Connection error" and nothing more from the agent itself. Moreover, if I try to reach http://localhost:8002/v1 it says "Impossible to reach the site" (I am using Streamlit, so for example I can reach http://localhost:8501/).

Root Cause

I searched for a solution everywhere, but nothing. For this reason I am here and I really hope to find help, because I do not know what the problem is if docker, cuda, vLLM, etc. Or maybe I just have to wait for it to load the models.

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:16:10) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-140-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.6.85
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version        : 535.183.06
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.3
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.17.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.4
[pip3] nvidia-ml-py==13.590.44
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] onnxruntime==1.24.4
[pip3] onnxruntime-gpu==1.23.2
[pip3] pyzmq==27.1.0
[pip3] sentence-transformers==5.2.2
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                           0.5.3            pypi_0              pypi
[conda] numpy                                       2.2.6            pypi_0              pypi
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0              pypi
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0              pypi
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0              pypi
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0              pypi
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0              pypi
[conda] nvidia-cudnn-frontend                       1.17.0           pypi_0              pypi
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0              pypi
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0              pypi
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0              pypi
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0              pypi
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0              pypi
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0              pypi
[conda] nvidia-cutlass-dsl                          4.3.4            pypi_0              pypi
[conda] nvidia-ml-py                                13.590.44        pypi_0              pypi
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0              pypi
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0              pypi
[conda] nvidia-nvshmem-cu12                         3.3.20           pypi_0              pypi
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0              pypi
[conda] pyzmq                                       27.1.0           pypi_0              pypi
[conda] sentence-transformers                       5.2.2            pypi_0              pypi
[conda] torch                                       2.9.0            pypi_0              pypi
[conda] torchaudio                                  2.9.0            pypi_0              pypi
[conda] torchvision                                 0.24.0           pypi_0              pypi
[conda] transformers                                4.57.6           pypi_0              pypi
[conda] triton                                      3.5.0            pypi_0              pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

---

services:

  vllm_main_agent:
    image: vllm/vllm-openai
    container_name: vllm_main_agent
    runtime: nvidia
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface  
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3.5-9B
      --trust-remote-code 
      --tensor-parallel-size 1
      --max-model-len 40960
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --enable-prefix-caching
      --gdn-prefill-backend triton

  vllm_mem0_extractor:
    image: vllm/vllm-openai
    container_name: vllm_mem0_extractor
    runtime: nvidia
    ipc: host
    ports:
      - "8001:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - VLLM_WORKER_MULTIPROC_METHOD=spawn  
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3-14B-AWQ
      --quantization awq
      --tensor-parallel-size 1
      --max-model-len 32768
      --reasoning-parser qwen3
      --enable-prefix-caching
      --enforce-eager  

  vllm_guardian:
    image: vllm/vllm-openai
    container_name: vllm_guardian
    runtime: nvidia
    ipc: host
    ports:
      - "8002:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3-14B-AWQ
      --quantization awq
      --tensor-parallel-size 1
      --max-model-len 2048
      --reasoning-parser qwen3
      --enable-prefix-caching

---

llm_guardian = ChatOpenAI(
        model="Qwen/Qwen3-14B-AWQ"
        base_url="http://localhost:8002/v1", 
        api_key="empty",                     
        temperature=0.0,
        max_tokens=4096,
        streaming=True
    )

---

sudo docker exec -it vllm_guardian /bin/bash
root@123456789:/vllm-workspace# top
top - 16:32:54 up 111 days,  6:30,  0 users,  load average: 2.13, 2.94, 2.69
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.6 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem : 773707.1 total, 288383.8 free,  37586.5 used, 447736.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 730545.0 avail Mem
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:16:10) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-140-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.6.85
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version        : 535.183.06
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.3
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.17.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.4
[pip3] nvidia-ml-py==13.590.44
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] onnxruntime==1.24.4
[pip3] onnxruntime-gpu==1.23.2
[pip3] pyzmq==27.1.0
[pip3] sentence-transformers==5.2.2
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                           0.5.3            pypi_0              pypi
[conda] numpy                                       2.2.6            pypi_0              pypi
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0              pypi
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0              pypi
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0              pypi
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0              pypi
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0              pypi
[conda] nvidia-cudnn-frontend                       1.17.0           pypi_0              pypi
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0              pypi
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0              pypi
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0              pypi
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0              pypi
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0              pypi
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0              pypi
[conda] nvidia-cutlass-dsl                          4.3.4            pypi_0              pypi
[conda] nvidia-ml-py                                13.590.44        pypi_0              pypi
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0              pypi
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0              pypi
[conda] nvidia-nvshmem-cu12                         3.3.20           pypi_0              pypi
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0              pypi
[conda] pyzmq                                       27.1.0           pypi_0              pypi
[conda] sentence-transformers                       5.2.2            pypi_0              pypi
[conda] torch                                       2.9.0            pypi_0              pypi
[conda] torchaudio                                  2.9.0            pypi_0              pypi
[conda] torchvision                                 0.24.0           pypi_0              pypi
[conda] transformers                                4.57.6           pypi_0              pypi
[conda] triton                                      3.5.0            pypi_0              pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
</details>

🐛 Describe the bug

Good morning everyone,

I am trying to develop an agent using LangGraph and I started with Ollama. Now I wanted to switch to vLLM (so I am new to it), using Qwen3.5-9b and Qwen3-14b. I have already downloaded them locally and I can find them in my hub dir of HuggingFace. Following several tutorials, I wrote this yaml file to launch three different docker containers for three different LLMs in background while I am running the main agent. here is the file:

services:

  vllm_main_agent:
    image: vllm/vllm-openai
    container_name: vllm_main_agent
    runtime: nvidia
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface  
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3.5-9B
      --trust-remote-code 
      --tensor-parallel-size 1
      --max-model-len 40960
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --enable-prefix-caching
      --gdn-prefill-backend triton

  vllm_mem0_extractor:
    image: vllm/vllm-openai
    container_name: vllm_mem0_extractor
    runtime: nvidia
    ipc: host
    ports:
      - "8001:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - VLLM_WORKER_MULTIPROC_METHOD=spawn  
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3-14B-AWQ
      --quantization awq
      --tensor-parallel-size 1
      --max-model-len 32768
      --reasoning-parser qwen3
      --enable-prefix-caching
      --enforce-eager  

  vllm_guardian:
    image: vllm/vllm-openai
    container_name: vllm_guardian
    runtime: nvidia
    ipc: host
    ports:
      - "8002:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped
    command: >
      --model Qwen/Qwen3-14B-AWQ
      --quantization awq
      --tensor-parallel-size 1
      --max-model-len 2048
      --reasoning-parser qwen3
      --enable-prefix-caching

Then if I run "sudo docker compose up -d vllm_guardian" and logs, it says:

(EngineCore pid=445) INFO 04-08 16:07:22 [gpu_model_runner.py:4735] Starting to load model Qwen/Qwen3-14B-AWQ... (EngineCore pid=445) INFO 04-08 16:07:23 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (EngineCore pid=445) INFO 04-08 16:07:23 [flash_attn.py:596] Using FlashAttention version 3 (EngineCore pid=445) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore pid=445) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.

and here remains for even HOURS like it is waiting for something. I run the agent with these lines of code to call vLLM:

llm_guardian = ChatOpenAI(
        model="Qwen/Qwen3-14B-AWQ"
        base_url="http://localhost:8002/v1", 
        api_key="empty",                     
        temperature=0.0,
        max_tokens=4096,
        streaming=True
    )

and I get just a very immediate and simple error "Connection error" and nothing more from the agent itself. Moreover, if I try to reach http://localhost:8002/v1 it says "Impossible to reach the site" (I am using Streamlit, so for example I can reach http://localhost:8501/). Finally, entering the docker container vllm_guardian gives this result:

sudo docker exec -it vllm_guardian /bin/bash
root@123456789:/vllm-workspace# top
top - 16:32:54 up 111 days,  6:30,  0 users,  load average: 2.13, 2.94, 2.69
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.6 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem : 773707.1 total, 288383.8 free,  37586.5 used, 447736.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 730545.0 avail Mem

I searched for a solution everywhere, but nothing. For this reason I am here and I really hope to find help, because I do not know what the problem is if docker, cuda, vLLM, etc. Or maybe I just have to wait for it to load the models.

Please, any suggestion would be super appreciated. Thank you in advance,

Matteo

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to the model loading process taking an excessively long time or getting stuck, causing a connection error when trying to reach the vLLM service.

Guidance

  1. Verify model loading: Check the logs of the vllm_guardian container to see if there are any errors or warnings that could indicate why the model is not loading properly.
  2. Check resource utilization: The top command output shows low CPU usage, but it's essential to monitor the resource utilization of the container to ensure it's not running out of memory or other resources.
  3. Investigate CUDA and cuDNN versions: The warning messages about deprecated CUDA modules might be related to the issue; ensure that the CUDA and cuDNN versions are compatible with the vLLM requirements.
  4. Test with a smaller model: Try loading a smaller model to see if the issue persists, which could help determine if the problem is specific to the large model or a more general issue.
  5. Check the Docker container configuration: Review the Docker Compose file to ensure that the configuration is correct, and the container has the necessary resources and dependencies to run the vLLM service.

Example

No specific code example is provided, as the issue seems to be related to the configuration and environment rather than a specific code snippet.

Notes

The issue might be related to the specific model or environment configuration. Further investigation is needed to determine the root cause.

Recommendation

Apply a workaround by testing with a smaller model or adjusting the Docker container configuration to allocate more resources, as the issue might be related to resource constraints or model size.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING