vllm - 💡(How to fix) Fix [Bug]: Qwen3-VL-2B-Instruct Geo3K accuracy score lower than SGLang with deterministic sampling [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I observe a large accuracy gap between vLLM and SGLang when serving Qwen3-VL-2B-Instruct on a VLM benchmark geo3k, even when using deterministic decoding and the official vLLM OpenAI-compatible /v1/chat/completions endpoint.

The gap reproduces with:

  • same model: Qwen3-VL-2B-Instruct
  • same Geo3K test set: 601 samples
  • same sampling:
    • temperature=0
    • top_p=1.0
    • top_k=-1
    • max_tokens=4096
    • seed=42
  • one worker per GPU
  • router policy: round_robin
  • 8 GPUs

Results:

apinumberscore
vLLM official /v1/chat/completions116 / 6010.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate109 / 6010.18136439267886856
SGLang /generate173 / 6010.2878535773710483

Root Cause

I observe a large accuracy gap between vLLM and SGLang when serving Qwen3-VL-2B-Instruct on a VLM benchmark geo3k, even when using deterministic decoding and the official vLLM OpenAI-compatible /v1/chat/completions endpoint.

The gap reproduces with:

  • same model: Qwen3-VL-2B-Instruct
  • same Geo3K test set: 601 samples
  • same sampling:
    • temperature=0
    • top_p=1.0
    • top_k=-1
    • max_tokens=4096
    • seed=42
  • one worker per GPU
  • router policy: round_robin
  • 8 GPUs

Results:

apinumberscore
vLLM official /v1/chat/completions116 / 6010.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate109 / 6010.18136439267886856
SGLang /generate173 / 6010.2878535773710483

Fix Action

Fixed

Code Example

OS: Ubuntu 22.04.5 LTS
GPU: 8 x NVIDIA H20-3e
NVIDIA driver: 580.105.08
CUDA runtime version: 12.9.86
PyTorch: 2.11.0+cu129
CUDA used to build PyTorch: 12.9
Python: 3.12.13
vLLM Version: 0.21.0
flashinfer-python: 0.6.8.post1
transformers: 5.8.1
triton: 3.6.0
flash-attn: 2.7.4.post1
transformer_engine_torch: 2.10.0

---

#!/usr/bin/env bash
set -euo pipefail

MODEL_PATH=/root/models/Qwen3-VL-2B-Instruct
HOST=127.0.0.1
GPU_IDS=(0 1 2 3 4 5 6 7)
WORKER_PORT_BASE=18100
ROUTER_PORT=18080
SEED=42

export CUDA_HOME=/usr/local/cuda-12.9
export CUDA_PATH="${CUDA_HOME}"
export PATH="/path/to/venv/bin:${CUDA_HOME}/bin:${PATH}"
export NO_PROXY="127.0.0.1,localhost,${HOST}"
export no_proxy="${NO_PROXY}"

WORKER_URLS=()
for i in "${!GPU_IDS[@]}"; do
  gpu="${GPU_IDS[$i]}"
  port=$((WORKER_PORT_BASE + i))
  WORKER_URLS+=("http://${HOST}:${port}")

  CUDA_VISIBLE_DEVICES="${gpu}" vllm serve "${MODEL_PATH}" \
    --host "${HOST}" \
    --port "${port}" \
    --trust-remote-code \
    --seed "${SEED}" \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --generation-config vllm \
    > "vllm_worker_gpu${gpu}_port${port}.log" 2>&1 &
done

# Wait until all worker /health endpoints are ready here.

python -m vllm_router.launch_router \
  --host "${HOST}" \
  --port "${ROUTER_PORT}" \
  --worker-urls "${WORKER_URLS[@]}" \
  --policy round_robin \
  --request-timeout-secs 14400 \
  --log-level info \
  > vllm_router.log 2>&1 &

---

import base64
import io
import json
import asyncio
from pathlib import Path

import aiohttp
import pandas as pd
from PIL import Image

# I use the same math verifier as my RL eval pipeline.
from slime.rollout.rm_hub.math_utils import grade_answer_verl


MODEL = "/root/models/Qwen3-VL-2B-Instruct"
DATASET = "/root/datasets/geo3k_imgurl/test.parquet"
BASE_URL = "http://127.0.0.1:18080"
OUTPUT = "vllm_chat_geo3k_eval_full_8gpu_router.json"

MAX_TOKENS = 4096
TEMPERATURE = 0.0
TOP_P = 1.0
TOP_K = -1
SEED = 42
CONCURRENCY = 64


def decode_data_url(data_url: str) -> Image.Image:
    if data_url.startswith("data:"):
        _, encoded = data_url.split(",", 1)
    else:
        encoded = data_url
    return Image.open(io.BytesIO(base64.b64decode(encoded))).convert("RGB")


def image_to_png_data_url(image: Image.Image) -> str:
    buf = io.BytesIO()
    image.convert("RGB").save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode("utf-8")


def as_list(value):
    if value is None:
        return []
    if hasattr(value, "tolist"):
        value = value.tolist()
    if isinstance(value, list):
        return value
    return [value]


def build_openai_messages(problem: str, image_urls: list[str]):
    parts = []
    chunks = problem.split("<image>")
    for i, chunk in enumerate(chunks):
        if chunk:
            parts.append({"type": "text", "text": chunk})
        if i < len(chunks) - 1:
            parts.append({
                "type": "image_url",
                "image_url": {"url": image_urls[i]},
            })
    return [{"role": "user", "content": parts}]


async def post_json(session, url, payload):
    async with session.post(url, json=payload, timeout=aiohttp.ClientTimeout(total=14400)) as resp:
        text = await resp.text()
        if resp.status >= 400:
            raise RuntimeError(f"{resp.status}: {text[:1000]}")
        return json.loads(text)


async def main():
    df = pd.read_parquet(DATASET)

    samples = []
    for i, row in df.iterrows():
        pil_images = [decode_data_url(x) for x in as_list(row["images"])]
        image_urls = [image_to_png_data_url(img) for img in pil_images]
        messages = build_openai_messages(str(row["problem"]), image_urls)
        samples.append({
            "index": int(i),
            "messages": messages,
            "label": str(row["answer"]),
        })

    sem = asyncio.Semaphore(CONCURRENCY)
    rows = []

    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
        async def one(sample):
            async with sem:
                payload = {
                    "model": MODEL,
                    "messages": sample["messages"],
                    "max_tokens": MAX_TOKENS,
                    "temperature": TEMPERATURE,
                    "top_p": TOP_P,
                    "top_k": TOP_K,
                    "seed": SEED,
                    "skip_special_tokens": False,
                    "spaces_between_special_tokens": False,
                }
                data = await post_json(session, f"{BASE_URL}/v1/chat/completions", payload)
                text = data["choices"][0]["message"]["content"] or ""
                reward = 1 if grade_answer_verl(text, sample["label"]) else 0
                return {
                    "index": sample["index"],
                    "reward": reward,
                    "label": sample["label"],
                    "response_preview": text[:1000],
                }

        tasks = [asyncio.create_task(one(s)) for s in samples]
        for t in asyncio.as_completed(tasks):
            rows.append(await t)

    rows.sort(key=lambda x: x["index"])
    score = sum(r["reward"] for r in rows) / len(rows)

    result = {
        "backend": "vllm_chat",
        "num_samples": len(rows),
        "score": score,
        "correct": sum(r["reward"] for r in rows),
        "sampling": {
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
            "top_k": TOP_K,
            "max_tokens": MAX_TOKENS,
            "seed": SEED,
        },
        "rows": rows,
    }
    Path(OUTPUT).write_text(json.dumps(result, ensure_ascii=False, indent=2))
    print(json.dumps({k: result[k] for k in ["backend", "num_samples", "correct", "score", "sampling"]}, indent=2))


asyncio.run(main())

---

{
  "backend": "vllm_chat",
  "num_samples": 601,
  "correct": 116,
  "score": 0.1930116472545757,
  "sampling": {
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "max_tokens": 4096,
    "seed": 42
  }
}

---

CUDA_VISIBLE_DEVICES="${gpu}" python -m sglang.launch_server \
  --model-path /root/models/Qwen3-VL-2B-Instruct \
  --host 127.0.0.1 \
  --port "${port}" \
  --trust-remote-code \
  --random-seed 42 \
  --mem-fraction-static 0.6 \
  --context-length 262144 \
  --sampling-defaults openai \
  --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256

---

{
  "text": "<chat-templated prompt>",
  "image_data": ["data:image/png;base64,..."],
  "sampling_params": {
    "max_new_tokens": 4096,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "sampling_seed": 42,
    "skip_special_tokens": false,
    "spaces_between_special_tokens": false
  },
  "return_logprob": false
}

---

{
  "backend": "sglang",
  "num_samples": 601,
  "correct": 173,
  "score": 0.2878535773710483
}

---

{
  "backend": "vllm_render_generate",
  "num_samples": 601,
  "correct": 109,
  "score": 0.18136439267886856
}
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS: Ubuntu 22.04.5 LTS
GPU: 8 x NVIDIA H20-3e
NVIDIA driver: 580.105.08
CUDA runtime version: 12.9.86
PyTorch: 2.11.0+cu129
CUDA used to build PyTorch: 12.9
Python: 3.12.13
vLLM Version: 0.21.0
flashinfer-python: 0.6.8.post1
transformers: 5.8.1
triton: 3.6.0
flash-attn: 2.7.4.post1
transformer_engine_torch: 2.10.0
</details>

🐛 Describe the bug

Summary

I observe a large accuracy gap between vLLM and SGLang when serving Qwen3-VL-2B-Instruct on a VLM benchmark geo3k, even when using deterministic decoding and the official vLLM OpenAI-compatible /v1/chat/completions endpoint.

The gap reproduces with:

  • same model: Qwen3-VL-2B-Instruct
  • same Geo3K test set: 601 samples
  • same sampling:
    • temperature=0
    • top_p=1.0
    • top_k=-1
    • max_tokens=4096
    • seed=42
  • one worker per GPU
  • router policy: round_robin
  • 8 GPUs

Results:

apinumberscore
vLLM official /v1/chat/completions116 / 6010.1930116472545757
vLLM /v1/chat/completions/render + /inference/v1/generate109 / 6010.18136439267886856
SGLang /generate173 / 6010.2878535773710483

Environment

OS: Linux GPU: NVIDIA H20-3e, 143771 MiB NVIDIA driver: 580.105.08 CUDA toolkit: 12.9, nvcc V12.9.86

Python: 3.12.13

vLLM: 0.21.0+cu129 SGLang: 0.5.10.post1 Torch: 2.11.0+cu129 torch.version.cuda: 12.9 Transformers: 5.8.1 Triton: 3.6.0 cuda-python: 12.9.0 flash-attn: 2.7.4.post1 Pillow: 12.2.0 pandas: 3.0.3 aiohttp: 3.13.5

vLLM official chat reproduction

Start 8 single-GPU vLLM workers and a round-robin vLLM router:

#!/usr/bin/env bash
set -euo pipefail

MODEL_PATH=/root/models/Qwen3-VL-2B-Instruct
HOST=127.0.0.1
GPU_IDS=(0 1 2 3 4 5 6 7)
WORKER_PORT_BASE=18100
ROUTER_PORT=18080
SEED=42

export CUDA_HOME=/usr/local/cuda-12.9
export CUDA_PATH="${CUDA_HOME}"
export PATH="/path/to/venv/bin:${CUDA_HOME}/bin:${PATH}"
export NO_PROXY="127.0.0.1,localhost,${HOST}"
export no_proxy="${NO_PROXY}"

WORKER_URLS=()
for i in "${!GPU_IDS[@]}"; do
  gpu="${GPU_IDS[$i]}"
  port=$((WORKER_PORT_BASE + i))
  WORKER_URLS+=("http://${HOST}:${port}")

  CUDA_VISIBLE_DEVICES="${gpu}" vllm serve "${MODEL_PATH}" \
    --host "${HOST}" \
    --port "${port}" \
    --trust-remote-code \
    --seed "${SEED}" \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --generation-config vllm \
    > "vllm_worker_gpu${gpu}_port${port}.log" 2>&1 &
done

# Wait until all worker /health endpoints are ready here.

python -m vllm_router.launch_router \
  --host "${HOST}" \
  --port "${ROUTER_PORT}" \
  --worker-urls "${WORKER_URLS[@]}" \
  --policy round_robin \
  --request-timeout-secs 14400 \
  --log-level info \
  > vllm_router.log 2>&1 &

Then send official OpenAI-compatible chat completion requests:

import base64
import io
import json
import asyncio
from pathlib import Path

import aiohttp
import pandas as pd
from PIL import Image

# I use the same math verifier as my RL eval pipeline.
from slime.rollout.rm_hub.math_utils import grade_answer_verl


MODEL = "/root/models/Qwen3-VL-2B-Instruct"
DATASET = "/root/datasets/geo3k_imgurl/test.parquet"
BASE_URL = "http://127.0.0.1:18080"
OUTPUT = "vllm_chat_geo3k_eval_full_8gpu_router.json"

MAX_TOKENS = 4096
TEMPERATURE = 0.0
TOP_P = 1.0
TOP_K = -1
SEED = 42
CONCURRENCY = 64


def decode_data_url(data_url: str) -> Image.Image:
    if data_url.startswith("data:"):
        _, encoded = data_url.split(",", 1)
    else:
        encoded = data_url
    return Image.open(io.BytesIO(base64.b64decode(encoded))).convert("RGB")


def image_to_png_data_url(image: Image.Image) -> str:
    buf = io.BytesIO()
    image.convert("RGB").save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode("utf-8")


def as_list(value):
    if value is None:
        return []
    if hasattr(value, "tolist"):
        value = value.tolist()
    if isinstance(value, list):
        return value
    return [value]


def build_openai_messages(problem: str, image_urls: list[str]):
    parts = []
    chunks = problem.split("<image>")
    for i, chunk in enumerate(chunks):
        if chunk:
            parts.append({"type": "text", "text": chunk})
        if i < len(chunks) - 1:
            parts.append({
                "type": "image_url",
                "image_url": {"url": image_urls[i]},
            })
    return [{"role": "user", "content": parts}]


async def post_json(session, url, payload):
    async with session.post(url, json=payload, timeout=aiohttp.ClientTimeout(total=14400)) as resp:
        text = await resp.text()
        if resp.status >= 400:
            raise RuntimeError(f"{resp.status}: {text[:1000]}")
        return json.loads(text)


async def main():
    df = pd.read_parquet(DATASET)

    samples = []
    for i, row in df.iterrows():
        pil_images = [decode_data_url(x) for x in as_list(row["images"])]
        image_urls = [image_to_png_data_url(img) for img in pil_images]
        messages = build_openai_messages(str(row["problem"]), image_urls)
        samples.append({
            "index": int(i),
            "messages": messages,
            "label": str(row["answer"]),
        })

    sem = asyncio.Semaphore(CONCURRENCY)
    rows = []

    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
        async def one(sample):
            async with sem:
                payload = {
                    "model": MODEL,
                    "messages": sample["messages"],
                    "max_tokens": MAX_TOKENS,
                    "temperature": TEMPERATURE,
                    "top_p": TOP_P,
                    "top_k": TOP_K,
                    "seed": SEED,
                    "skip_special_tokens": False,
                    "spaces_between_special_tokens": False,
                }
                data = await post_json(session, f"{BASE_URL}/v1/chat/completions", payload)
                text = data["choices"][0]["message"]["content"] or ""
                reward = 1 if grade_answer_verl(text, sample["label"]) else 0
                return {
                    "index": sample["index"],
                    "reward": reward,
                    "label": sample["label"],
                    "response_preview": text[:1000],
                }

        tasks = [asyncio.create_task(one(s)) for s in samples]
        for t in asyncio.as_completed(tasks):
            rows.append(await t)

    rows.sort(key=lambda x: x["index"])
    score = sum(r["reward"] for r in rows) / len(rows)

    result = {
        "backend": "vllm_chat",
        "num_samples": len(rows),
        "score": score,
        "correct": sum(r["reward"] for r in rows),
        "sampling": {
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
            "top_k": TOP_K,
            "max_tokens": MAX_TOKENS,
            "seed": SEED,
        },
        "rows": rows,
    }
    Path(OUTPUT).write_text(json.dumps(result, ensure_ascii=False, indent=2))
    print(json.dumps({k: result[k] for k in ["backend", "num_samples", "correct", "score", "sampling"]}, indent=2))


asyncio.run(main())

Observed vLLM official chat result:

{
  "backend": "vllm_chat",
  "num_samples": 601,
  "correct": 116,
  "score": 0.1930116472545757,
  "sampling": {
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "max_tokens": 4096,
    "seed": 42
  }
}

SGLang comparison

SGLang workers were launched with the same model and deterministic sampling:

CUDA_VISIBLE_DEVICES="${gpu}" python -m sglang.launch_server \
  --model-path /root/models/Qwen3-VL-2B-Instruct \
  --host 127.0.0.1 \
  --port "${port}" \
  --trust-remote-code \
  --random-seed 42 \
  --mem-fraction-static 0.6 \
  --context-length 262144 \
  --sampling-defaults openai \
  --cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256

SGLang requests used:

{
  "text": "<chat-templated prompt>",
  "image_data": ["data:image/png;base64,..."],
  "sampling_params": {
    "max_new_tokens": 4096,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,
    "sampling_seed": 42,
    "skip_special_tokens": false,
    "spaces_between_special_tokens": false
  },
  "return_logprob": false
}

Observed SGLang result:

{
  "backend": "sglang",
  "num_samples": 601,
  "correct": 173,
  "score": 0.2878535773710483
}

Additional vLLM path tested

I also tested vLLM's disaggregated multimodal flow:

  1. /v1/chat/completions/render
  2. /inference/v1/generate

This path uses locally computed HF tokenizer/processor prompt ids and aligns the vLLM render output to the local prompt ids.

Result:

{
  "backend": "vllm_render_generate",
  "num_samples": 601,
  "correct": 109,
  "score": 0.18136439267886856
}

So the official /v1/chat/completions path is slightly better, but the gap remains large.

Expected behavior

With deterministic decoding and the same model, vLLM and SGLang do not have to produce bit-identical tokens, but I would expect the aggregate Geo3K accuracy to be much closer, especially when using the official vLLM OpenAI-compatible multimodal chat API.

Actual behavior

vLLM is significantly worse than SGLang on the same model and dataset:

vLLM official chat: 0.1930 SGLang: 0.2879 SGLang has 57 more correct answers out of 601.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

With deterministic decoding and the same model, vLLM and SGLang do not have to produce bit-identical tokens, but I would expect the aggregate Geo3K accuracy to be much closer, especially when using the official vLLM OpenAI-compatible multimodal chat API.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING