vllm - 💡(How to fix) Fix [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37590Fetched 2026-04-08 01:04:33
View on GitHub
Comments
3
Participants
2
Timeline
9
Reactions
0
Participants
Timeline (top)
commented ×3renamed ×2closed ×1labeled ×1

Error Message

Traceback (most recent call last): File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 571, in getresponse httplib_response = super().getresponse() ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 1430, in getresponse response.begin() File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 300, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

Code Example

Traceback (most recent call last):
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 571, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

---

#!/usr/bin/env python3
# -*- coding: utf-8

"""
client:
    http.client.RemoteDisconnected: Remote end closed connection without response

server vllm:
    no error log    
"""

import requests
import json

VLLM_API_BASE = 'https://....'
VLLM_AUTH_TOKEN = 'sk-xxxx'


def get_content_large_table():
    """text from https://qwen.ai/blog?id=qwen3.5 """

    return """
    We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

    Qwen3.5-Plus is the hosted model available via Alibaba Cloud Model Studio, featuring:
        a 1M context window by default
        official built-in tools and adaptive tool use

Performance

Below we present the comprehensive evaluation of our models against frontier models in a wide range of evaluation tasks, covering different tasks and modalities.
Language
	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
Knowledge
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
Instruction Following
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
Long Context
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
Reasoning
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
General Agent
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
Search Agent
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
Multilingualism
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
Coding Agent
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

* HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MCP-Mark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
* Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
* BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
* WideSearch: we use a 256k context window without any context management.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.
"""


request_json = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Translate following content to Chinese\n\n" + get_content_large_table()
        }
    ],
    "model": "qwen3.5-122b-a10b",
    "max_tokens": 100000,
    "temperature": 0.7,
    "top_p": 0.8,
    "seed": 10000,
    "stream": False,
    # "stream_options": {
    #     "include_usage": True
    # },
    #"tools": tools,
    "parallel_tool_calls": False,
    "chat_template_kwargs": {
        "enable_thinking": False
    }
}

resp = requests.post(f"{VLLM_API_BASE}/v1/chat/completions",
                     headers={
                         "Authorization": f"Bearer {VLLM_AUTH_TOKEN}",
                         'Content-Type': 'application/json',
                         'Accept': 'application/json',
                     },
                     data=json.dumps(request_json, ensure_ascii=False))

from pprint import pprint
pprint(resp.json(), compact=True, width=300)
RAW_BUFFERClick to expand / collapse

Your current environment

<details>

Host: CUDA 13.1 GPU: Blackwall: RTX Pro 6000 x2 docker image: vllm/vllm-openai:nightly arguments: RedHatAI/Qwen3.5-122B-A10B-NVFP4 --tensor-parallel-size 2 --served-model-name qwen3.5-122b-a10b --reasoning-parser qwen3 --max-model-len 204800

</details>

🐛 Describe the bug

Sometimes when prompts contain large tables, server tends to silently disconnect from the client. No any logs on server side.

If prompts in script truncated to some degree, vllm can respond correctly.

Exception on Client Side:

Traceback (most recent call last):
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/opt/AI/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 571, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

Script to reproduce:

#!/usr/bin/env python3
# -*- coding: utf-8

"""
client:
    http.client.RemoteDisconnected: Remote end closed connection without response

server vllm:
    no error log    
"""

import requests
import json

VLLM_API_BASE = 'https://....'
VLLM_AUTH_TOKEN = 'sk-xxxx'


def get_content_large_table():
    """text from https://qwen.ai/blog?id=qwen3.5 """

    return """
    We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

    Qwen3.5-Plus is the hosted model available via Alibaba Cloud Model Studio, featuring:
        a 1M context window by default
        official built-in tools and adaptive tool use

Performance

Below we present the comprehensive evaluation of our models against frontier models in a wide range of evaluation tasks, covering different tasks and modalities.
Language
	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
Knowledge
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
Instruction Following
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
Long Context
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
Reasoning
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
General Agent
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
Search Agent
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
Multilingualism
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
Coding Agent
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

* HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MCP-Mark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
* Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
* BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
* WideSearch: we use a 256k context window without any context management.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.
"""


request_json = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Translate following content to Chinese\n\n" + get_content_large_table()
        }
    ],
    "model": "qwen3.5-122b-a10b",
    "max_tokens": 100000,
    "temperature": 0.7,
    "top_p": 0.8,
    "seed": 10000,
    "stream": False,
    # "stream_options": {
    #     "include_usage": True
    # },
    #"tools": tools,
    "parallel_tool_calls": False,
    "chat_template_kwargs": {
        "enable_thinking": False
    }
}

resp = requests.post(f"{VLLM_API_BASE}/v1/chat/completions",
                     headers={
                         "Authorization": f"Bearer {VLLM_AUTH_TOKEN}",
                         'Content-Type': 'application/json',
                         'Accept': 'application/json',
                     },
                     data=json.dumps(request_json, ensure_ascii=False))

from pprint import pprint
pprint(resp.json(), compact=True, width=300)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of the server silently disconnecting from the client when prompts contain large tables, we need to adjust the request payload to avoid overloading the server. Here are the steps:

  • Increase the timeout: Set a higher timeout value for the request to allow the server more time to process large requests.
  • Chunk large requests: Split large requests into smaller chunks to avoid overloading the server.
  • Optimize server configuration: Adjust server settings to handle large requests more efficiently.

Code Changes

Here's an example of how you can modify the request to increase the timeout and chunk large requests:

import requests
import json

# ... (rest of the code remains the same)

# Increase the timeout
timeout = 300  # 5 minutes

# Chunk large requests
def chunk_request(request_json, chunk_size=10000):
    messages = request_json["messages"]
    chunked_messages = []
    current_chunk = ""
    for message in messages:
        if len(current_chunk) + len(message["content"]) > chunk_size:
            chunked_messages.append({"messages": [{"role": message["role"], "content": current_chunk}]})
            current_chunk = message["content"]
        else:
            if current_chunk:
                current_chunk += "\n" + message["content"]
            else:
                current_chunk = message["content"]
    if current_chunk:
        chunked_messages.append({"messages": [{"role": messages[0]["role"], "content": current_chunk}]})
    return chunked_messages

chunked_requests = chunk_request(request_json)

for chunk in chunked_requests:
    resp = requests.post(f"{VLLM_API_BASE}/v1/chat/completions",
                         headers={
                             "Authorization": f"Bearer {VLLM_AUTH_TOKEN}",
                             'Content-Type': 'application/json',
                             'Accept': 'application/json',
                         },
                         data=json.dumps(chunk, ensure_ascii=False),
                         timeout=timeout)

    from pprint import pprint
    pprint(resp.json(), compact=True, width=300)

Verification

To verify that the fix worked, you can test the modified code with large requests and check if the server responds correctly without disconnecting.

Extra Tips

  • Make sure to adjust the chunk size according to your server's capabilities and the size of your requests.
  • Consider implementing a retry mechanism to handle cases where the server still disconnects due to unexpected issues.
  • Monitor your server's performance and adjust the configuration as needed to ensure efficient handling of large requests.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response [3 comments, 2 participants]