vllm - 💡(How to fix) Fix [Bug]: Qwen3-VL-2B-Instruct Geo3K accuracy score lower than SGLang with deterministic sampling [1 pull requests]

vllm2026-05-25 13:46:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I observe a large accuracy gap between vLLM and SGLang when serving Qwen3-VL-2B-Instruct on a VLM benchmark geo3k, even when using deterministic decoding and the official vLLM OpenAI-compatible /v1/chat/completions endpoint.

The gap reproduces with:

same model: Qwen3-VL-2B-Instruct
same Geo3K test set: 601 samples
same sampling:
- temperature=0
- top_p=1.0
- top_k=-1
- max_tokens=4096
- seed=42
one worker per GPU
router policy: round_robin
8 GPUs

Results:

api	number	score
vLLM official /v1/chat/completions	116 / 601	0.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate	109 / 601	0.18136439267886856
SGLang /generate	173 / 601	0.2878535773710483

Root Cause

The gap reproduces with:

same model: Qwen3-VL-2B-Instruct
same Geo3K test set: 601 samples
same sampling:
- temperature=0
- top_p=1.0
- top_k=-1
- max_tokens=4096
- seed=42
one worker per GPU
router policy: round_robin
8 GPUs

Results:

api	number	score
vLLM official /v1/chat/completions	116 / 601	0.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate	109 / 601	0.18136439267886856
SGLang /generate	173 / 601	0.2878535773710483

Fix Action

Fixed

Fixed by PR: Fix Qwen3-VL deepstack inputs under torch.compile (https://github.com/vllm-project/vllm/pull/43617)

Code Example

OS: Ubuntu 22.04.5 LTS
GPU: 8 x NVIDIA H20-3e
NVIDIA driver: 580.105.08
CUDA runtime version: 12.9.86
PyTorch: 2.11.0+cu129
CUDA used to build PyTorch: 12.9
Python: 3.12.13
vLLM Version: 0.21.0
flashinfer-python: 0.6.8.post1
transformers: 5.8.1
triton: 3.6.0
flash-attn: 2.7.4.post1
transformer_engine_torch: 2.10.0

---

#!/usr/bin/env bash
set -euo pipefail

MODEL_PATH=/root/models/Qwen3-VL-2B-Instruct
HOST=127.0.0.1
GPU_IDS=(0 1 2 3 4 5 6 7)
WORKER_PORT_BASE=18100
ROUTER_PORT=18080
SEED=42

export CUDA_HOME=/usr/local/cuda-12.9
export CUDA_PATH="${CUDA_HOME}"
export PATH="/path/to/venv/bin:${CUDA_HOME}/bin:${PATH}"
export NO_PROXY="127.0.0.1,localhost,${HOST}"
export no_proxy="${NO_PROXY}"

WORKER_URLS=()
for i in "${!GPU_IDS[@]}"; do
  gpu="${GPU_IDS[$i]}"
  port=$((WORKER_PORT_BASE + i))
  WORKER_URLS+=("http://${HOST}:${port}")

  CUDA_VISIBLE_DEVICES="${gpu}" vllm serve "${MODEL_PATH}" \
    --host "${HOST}" \
    --port "${port}" \
    --trust-remote-code \
    --seed "${SEED}" \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --generation-config vllm \
    > "vllm_worker_gpu${gpu}_port${port}.log" 2>&1 &
done

# Wait until all worker /health endpoints are ready here.

python -m vllm_router.launch_router \
  --host "${HOST}" \
  --port "${ROUTER_PORT}" \
  --worker-urls "${WORKER_URLS[@]}" \
  --policy round_robin \
  --request-timeout-secs 14400 \
  --log-level info \
  > vllm_router.log 2>&1 &

---

import base64
import io
import json
import asyncio
from pathlib import Path

import aiohttp
import pandas as pd
from PIL import Image

# I use the same math verifier as my RL eval pipeline.
from slime.rollout.rm_hub.math_utils import grade_answer_verl


MODEL = "/root/models/Qwen3-VL-2B-Instruct"
DATASET = "/root/datasets/geo3k_imgurl/test.parquet"
BASE_URL = "http://127.0.0.1:18080"
OUTPUT = "vllm_chat_geo3k_eval_full_8gpu_router.json"

MAX_TOKENS = 4096
TEMPERATURE = 0.0
TOP_P = 1.0
TOP_K = -1
SEED = 42
CONCURRENCY = 64


def decode_data_url(data_url: str) -> Image.Image:
    if data_url.startswith("data:"):
        _, encoded = data_url.split(",", 1)
    else:
        encoded = data_url
    return Image.open(io.BytesIO(base64.b64decode(encoded))).convert("RGB")


def image_to_png_data_url(image: Image.Image) -> str:
    buf = io.BytesIO()
    image.convert("RGB").save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode("utf-8")


def as_list(value):
    if value is None:
        return []
    if hasattr(value, "tolist"):
        value = value.tolist()
    if isinstance(value, list):
        return value
    return [value]


def build_openai_messages(problem: str, image_urls: list[str]):
    parts = []
    chunks = problem.split("<image>")
    for i, chunk in enumerate(chunks):
        if chunk:
            parts.append({"type": "text", "text": chunk})
        if i < len(chunks) - 1:
            parts.append({
                "type": "image_url",
                "image_url": {"url": image_urls[i]},
            })
    return [{"role": "user", "content": parts}]


async def post_json(session, url, payload):
    async with session.post(url, json=payload, timeout=aiohttp.ClientTimeout(total=14400)) as resp:
        text = await resp.text()
        if resp.status >= 400:
            raise RuntimeError(f"{resp.status}: {text[:1000]}")
        return json.loads(text)


async def main():
    df = pd.read_parquet(DATASET)

    samples = []
    for i, row in df.iterrows():
        pil_images = [decode_data_url(x) for x in as_list(row["images"])]
        image_urls = [image_to_png_data_url(img) for img in pil_images]
        messages = build_openai_messages(str(row["problem"]), image_urls)
        samples.append({
            "index": int(i),
            "messages": messages,
            "label": str(row["answer"]),
        })

    sem = asyncio.Semaphore(CONCURRENCY)
    rows = []

    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
        async def one(sample):
            async with sem:
                payload = {
                    "model": MODEL,
                    "messages": sample["messages"],
                    "max_tokens": MAX_TOKENS,
                    "temperature": TEMPERATURE,
                    "top_p": TOP_P,
                    "top_k": TOP_K,
                    "seed": SEED,
                    "skip_special_tokens": False,
                    "spaces_between_special_tokens": False,
                }
                data = await post_json(session, f"{BASE_URL}/v1/chat/completions", payload)
                text = data["choices"][0]["message"]["content"] or ""
                reward = 1 if grade_answer_verl(text, sample["label"]) else 0
                return {
                    "index": sample["index"],
                    "reward": reward,
                    "label": sample["label"],
                    "response_preview": text[:1000],
                }

        tasks = [asyncio.create_task(one(s)) for s in samples]
        for t in asyncio.as_completed(tasks):
            rows.append(await t)

    rows.sort(key=lambda x: x["index"])
    score = sum(r["reward"] for r in rows) / len(rows)

    result = {
        "backend": "vllm_chat",
        "num_samples": len(rows),
        "score": score,
        "correct": sum(r["reward"] for r in rows),
        "sampling": {
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
            "top_k": TOP_K,
            "max_tokens": MAX_TOKENS,
            "seed": SEED,
        },
        "rows": rows,
    }
    Path(OUTPUT).write_text(json.dumps(result, ensure_ascii=False, indent=2))
    print(json.dumps({k: result[k] for k in ["backend", "num_samples", "correct", "score", "sampling"]}, indent=2))


asyncio.run(main())

---

{
  "backend": "vllm_chat",
  "num_samples": 601,
  "correct": 116,
  "score": 0.1930116472545757,
  "sampling": {
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "max_tokens": 4096,
    "seed": 42
  }
}

---

CUDA_VISIBLE_DEVICES="${gpu}" python -m sglang.launch_server \
  --model-path /root/models/Qwen3-VL-2B-Instruct \
  --host 127.0.0.1 \
  --port "${port}" \
  --trust-remote-code \
  --random-seed 42 \
  --mem-fraction-static 0.6 \
  --context-length 262144 \
  --sampling-defaults openai \
  --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256

---

{
  "text": "<chat-templated prompt>",
  "image_data": ["data:image/png;base64,..."],
  "sampling_params": {
    "max_new_tokens": 4096,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "sampling_seed": 42,
    "skip_special_tokens": false,
    "spaces_between_special_tokens": false
  },
  "return_logprob": false
}

---

{
  "backend": "sglang",
  "num_samples": 601,
  "correct": 173,
  "score": 0.2878535773710483
}

---

{
  "backend": "vllm_render_generate",
  "num_samples": 601,
  "correct": 109,
  "score": 0.18136439267886856
}

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS: Ubuntu 22.04.5 LTS
GPU: 8 x NVIDIA H20-3e
NVIDIA driver: 580.105.08
CUDA runtime version: 12.9.86
PyTorch: 2.11.0+cu129
CUDA used to build PyTorch: 12.9
Python: 3.12.13
vLLM Version: 0.21.0
flashinfer-python: 0.6.8.post1
transformers: 5.8.1
triton: 3.6.0
flash-attn: 2.7.4.post1
transformer_engine_torch: 2.10.0

</details>

🐛 Describe the bug

Summary

The gap reproduces with:

same model: Qwen3-VL-2B-Instruct
same Geo3K test set: 601 samples
same sampling:
- temperature=0
- top_p=1.0
- top_k=-1
- max_tokens=4096
- seed=42
one worker per GPU
router policy: round_robin
8 GPUs

Results:

api	number	score
vLLM official /v1/chat/completions	116 / 601	0.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate	109 / 601	0.18136439267886856
SGLang /generate	173 / 601	0.2878535773710483

Environment

OS: Linux GPU: NVIDIA H20-3e, 143771 MiB NVIDIA driver: 580.105.08 CUDA toolkit: 12.9, nvcc V12.9.86

Python: 3.12.13

vLLM: 0.21.0+cu129 SGLang: 0.5.10.post1 Torch: 2.11.0+cu129 torch.version.cuda: 12.9 Transformers: 5.8.1 Triton: 3.6.0 cuda-python: 12.9.0 flash-attn: 2.7.4.post1 Pillow: 12.2.0 pandas: 3.0.3 aiohttp: 3.13.5

vLLM official chat reproduction

Start 8 single-GPU vLLM workers and a round-robin vLLM router:

#!/usr/bin/env bash
set -euo pipefail

MODEL_PATH=/root/models/Qwen3-VL-2B-Instruct
HOST=127.0.0.1
GPU_IDS=(0 1 2 3 4 5 6 7)
WORKER_PORT_BASE=18100
ROUTER_PORT=18080
SEED=42

export CUDA_HOME=/usr/local/cuda-12.9
export CUDA_PATH="${CUDA_HOME}"
export PATH="/path/to/venv/bin:${CUDA_HOME}/bin:${PATH}"
export NO_PROXY="127.0.0.1,localhost,${HOST}"
export no_proxy="${NO_PROXY}"

WORKER_URLS=()
for i in "${!GPU_IDS[@]}"; do
  gpu="${GPU_IDS[$i]}"
  port=$((WORKER_PORT_BASE + i))
  WORKER_URLS+=("http://${HOST}:${port}")

  CUDA_VISIBLE_DEVICES="${gpu}" vllm serve "${MODEL_PATH}" \
    --host "${HOST}" \
    --port "${port}" \
    --trust-remote-code \
    --seed "${SEED}" \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --generation-config vllm \
    > "vllm_worker_gpu${gpu}_port${port}.log" 2>&1 &
done

# Wait until all worker /health endpoints are ready here.

python -m vllm_router.launch_router \
  --host "${HOST}" \
  --port "${ROUTER_PORT}" \
  --worker-urls "${WORKER_URLS[@]}" \
  --policy round_robin \
  --request-timeout-secs 14400 \
  --log-level info \
  > vllm_router.log 2>&1 &

Then send official OpenAI-compatible chat completion requests:

import base64
import io
import json
import asyncio
from pathlib import Path

import aiohttp
import pandas as pd
from PIL import Image

# I use the same math verifier as my RL eval pipeline.
from slime.rollout.rm_hub.math_utils import grade_answer_verl


MODEL = "/root/models/Qwen3-VL-2B-Instruct"
DATASET = "/root/datasets/geo3k_imgurl/test.parquet"
BASE_URL = "http://127.0.0.1:18080"
OUTPUT = "vllm_chat_geo3k_eval_full_8gpu_router.json"

MAX_TOKENS = 4096
TEMPERATURE = 0.0
TOP_P = 1.0
TOP_K = -1
SEED = 42
CONCURRENCY = 64


def decode_data_url(data_url: str) -> Image.Image:
    if data_url.startswith("data:"):
        _, encoded = data_url.split(",", 1)
    else:
        encoded = data_url
    return Image.open(io.BytesIO(base64.b64decode(encoded))).convert("RGB")


def image_to_png_data_url(image: Image.Image) -> str:
    buf = io.BytesIO()
    image.convert("RGB").save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode("utf-8")


def as_list(value):
    if value is None:
        return []
    if hasattr(value, "tolist"):
        value = value.tolist()
    if isinstance(value, list):
        return value
    return [value]


def build_openai_messages(problem: str, image_urls: list[str]):
    parts = []
    chunks = problem.split("<image>")
    for i, chunk in enumerate(chunks):
        if chunk:
            parts.append({"type": "text", "text": chunk})
        if i < len(chunks) - 1:
            parts.append({
                "type": "image_url",
                "image_url": {"url": image_urls[i]},
            })
    return [{"role": "user", "content": parts}]


async def post_json(session, url, payload):
    async with session.post(url, json=payload, timeout=aiohttp.ClientTimeout(total=14400)) as resp:
        text = await resp.text()
        if resp.status >= 400:
            raise RuntimeError(f"{resp.status}: {text[:1000]}")
        return json.loads(text)


async def main():
    df = pd.read_parquet(DATASET)

    samples = []
    for i, row in df.iterrows():
        pil_images = [decode_data_url(x) for x in as_list(row["images"])]
        image_urls = [image_to_png_data_url(img) for img in pil_images]
        messages = build_openai_messages(str(row["problem"]), image_urls)
        samples.append({
            "index": int(i),
            "messages": messages,
            "label": str(row["answer"]),
        })

    sem = asyncio.Semaphore(CONCURRENCY)
    rows = []

    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
        async def one(sample):
            async with sem:
                payload = {
                    "model": MODEL,
                    "messages": sample["messages"],
                    "max_tokens": MAX_TOKENS,
                    "temperature": TEMPERATURE,
                    "top_p": TOP_P,
                    "top_k": TOP_K,
                    "seed": SEED,
                    "skip_special_tokens": False,
                    "spaces_between_special_tokens": False,
                }
                data = await post_json(session, f"{BASE_URL}/v1/chat/completions", payload)
                text = data["choices"][0]["message"]["content"] or ""
                reward = 1 if grade_answer_verl(text, sample["label"]) else 0
                return {
                    "index": sample["index"],
                    "reward": reward,
                    "label": sample["label"],
                    "response_preview": text[:1000],
                }

        tasks = [asyncio.create_task(one(s)) for s in samples]
        for t in asyncio.as_completed(tasks):
            rows.append(await t)

    rows.sort(key=lambda x: x["index"])
    score = sum(r["reward"] for r in rows) / len(rows)

    result = {
        "backend": "vllm_chat",
        "num_samples": len(rows),
        "score": score,
        "correct": sum(r["reward"] for r in rows),
        "sampling": {
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
            "top_k": TOP_K,
            "max_tokens": MAX_TOKENS,
            "seed": SEED,
        },
        "rows": rows,
    }
    Path(OUTPUT).write_text(json.dumps(result, ensure_ascii=False, indent=2))
    print(json.dumps({k: result[k] for k in ["backend", "num_samples", "correct", "score", "sampling"]}, indent=2))


asyncio.run(main())

Observed vLLM official chat result:

{
  "backend": "vllm_chat",
  "num_samples": 601,
  "correct": 116,
  "score": 0.1930116472545757,
  "sampling": {
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "max_tokens": 4096,
    "seed": 42
  }
}

SGLang comparison

SGLang workers were launched with the same model and deterministic sampling:

CUDA_VISIBLE_DEVICES="${gpu}" python -m sglang.launch_server \
  --model-path /root/models/Qwen3-VL-2B-Instruct \
  --host 127.0.0.1 \
  --port "${port}" \
  --trust-remote-code \
  --random-seed 42 \
  --mem-fraction-static 0.6 \
  --context-length 262144 \
  --sampling-defaults openai \
  --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256

SGLang requests used:

{
  "text": "<chat-templated prompt>",
  "image_data": ["data:image/png;base64,..."],
  "sampling_params": {
    "max_new_tokens": 4096,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "sampling_seed": 42,
    "skip_special_tokens": false,
    "spaces_between_special_tokens": false
  },
  "return_logprob": false
}

Observed SGLang result:

{
  "backend": "sglang",
  "num_samples": 601,
  "correct": 173,
  "score": 0.2878535773710483
}

Additional vLLM path tested

I also tested vLLM's disaggregated multimodal flow:

/v1/chat/completions/render
/inference/v1/generate

This path uses locally computed HF tokenizer/processor prompt ids and aligns the vLLM render output to the local prompt ids.

Result:

{
  "backend": "vllm_render_generate",
  "num_samples": 601,
  "correct": 109,
  "score": 0.18136439267886856
}

So the official /v1/chat/completions path is slightly better, but the gap remains large.

Expected behavior

With deterministic decoding and the same model, vLLM and SGLang do not have to produce bit-identical tokens, but I would expect the aggregate Geo3K accuracy to be much closer, especially when using the official vLLM OpenAI-compatible multimodal chat API.

Actual behavior

vLLM is significantly worse than SGLang on the same model and dataset:

vLLM official chat: 0.1930 SGLang: 0.2879 SGLang has 57 more correct answers out of 601.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3-VL-2B-Instruct Geo3K accuracy score lower than SGLang with deterministic sampling [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

Summary

Environment

vLLM official chat reproduction

SGLang comparison

Additional vLLM path tested

Expected behavior

Actual behavior

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

TRENDING