vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-9B-AWQ on ROCm/vLLM 0.19.0 can get stuck generating endless "!" inside JSON schema output [4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39348Fetched 2026-04-09 07:51:43
View on GitHub
Comments
4
Participants
3
Timeline
16
Reactions
0
Author
Timeline (top)
commented ×4mentioned ×4subscribed ×4labeled ×2

Code Example

vLLM server version: 0.19.0

OS:
PRETTY_NAME="Ubuntu 24.04.4 LTS"
VERSION="24.04.4 LTS (Noble Numbat)"

Kernel:
Linux saturnix-AB350M-Pro4 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 19 01:28:37 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Python:
Python 3.12.3

GPU:
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] (rev cc)

ROCm note from host:
ROCk module is loaded
Unable to open /dev/kfd read-write: No such file or directory
saturnix is member of video group

Model:
QuantTrio/Qwen3.5-9B-AWQ

Served model name:
qwen3.5:9b

Docker image:
vllm/vllm-openai-rocm:latest

Server launch command:
sudo docker run --rm --network host \
  --device /dev/kfd --device /dev/dri \
  --group-add render \
  --ipc=host --shm-size 16G \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest \
  --model QuantTrio/Qwen3.5-9B-AWQ \
  --served-model-name qwen3.5:9b \
  --quantization awq \
  --host 127.0.0.1 \
  --port 11434 \
  --max-model-len 16384 \
  --enable-prefix-caching \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.95 \
  --language-model-only

---

{
  "keywords": [
    ["little blue truck", "little blue truck"],
    ["spring", "spring"],
    ["easter", "easter"],
    ["baby animals", "baby animals"],
    ["interactive book", "interactive book"],
    ["flap book", "flap book"],
    ["children's book", "children's book"],
    ["farm animals", "farm animals"]
  ],
  "topics": [
    ["little blue truck", "little blue truck"],
    ["spring", "spring"],
    ["easter", "easter"],
    ["farm animals", "farm animals"],
    ["baby animals", "baby animals"],
    ["interactive reading", "interactive reading!!!!!!!!!!!!!!!!!!!!!!!!!

---

from openai import OpenAI

SYSTEM_PROMPT = """You are a book metadata classifier. Return a precise JSON object with:
- keywords: 4-8 keywords, each as [original_language, english]
- topics: 4-8 topics, each as [original_language, english]
- externally_promoted: boolean
- famous_author: boolean
- is_fiction: boolean
- low_content: boolean
- confidence: integer 1-10
- notes: short English explanation, max 200 chars
"""

USER_PROMPT = """<book>
Title: Little Blue Truck's Springtime: An Easter And Springtime Book For KidsAn Interactive Adventure with Baby Animals
Author: Schertle, Alice
Publisher: Clarion Books
Market: amazon.com
Description: Celebrate the beauty of springtime with the #1 New York Times best-selling Little Blue Truck!
Beep! Beep! Little Blue Truck is out for a ride with his good friend Toad. The sun is shining and the flowers are blooming—it's a beautiful spring day! Who will they see along the way?
Open the flaps to meet all of the sweet baby animals just born on the farm. Peep! Peep!
</book>
<web_search_results>
Use the following as evidence for famous_author and externally_promoted.
Author search results:
[en.wikipedia.org] Alice Schertle (born 1941) is an American poet, teacher, and author from Los Angeles. She is known as the author of numerous children's books, most notably the New York Times best-selling Little Blue Truck series.
[aliceschertle.com] Alice Schertle is an award-winning poet whose books for children include All You Need for a Snowman and the New York Times bestselling Good Night, Little Blue Truck.
[littlebluetruckbooks.com] Alice Schertle is a poet and author of many well-loved books for children, including the beloved, #1 New York Times best-selling Little Blue Truck series.
[encyclopedia.com] Schertle, Alice, born April 7, 1941, in Los Angeles, CA...
[poetryfoundation.org] Children's poet Alice Schertle was born in Los Angeles...
Publisher search results:
[harpercollins.com] Are you a Clarion Books fan? Sign up now for Clarion Books alerts...
[instagram.com] Clarion Books (@clarionbooks) on Instagram...
[facebook.com] Clarion Books. Learn more about the latest releases from Clarion Books...
[aalbc.com] Browse books published by Clarion Books, an imprint of HarperCollins.
</web_search_results>"""

SCHEMA = {
    "type": "object",
    "properties": {
        "keywords": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "topics": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "externally_promoted": {"type": "boolean"},
        "famous_author": {"type": "boolean"},
        "is_fiction": {"type": "boolean"},
        "low_content": {"type": "boolean"},
        "confidence": {"type": "integer"},
        "notes": {"type": "string", "maxLength": 200},
    },
    "required": ["keywords", "topics", "is_fiction", "confidence"],
}

client = OpenAI(base_url="http://127.0.0.1:11434/v1", api_key="dummy")

stream = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "BookTagResult",
            "schema": SCHEMA,
        },
    },
    temperature=0.7,
    top_p=0.8,
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    text = delta.content or ""
    if text:
        print(text, end="", flush=True)
RAW_BUFFERClick to expand / collapse

Your current environment

Environment details gathered locally

vLLM server version: 0.19.0

OS:
PRETTY_NAME="Ubuntu 24.04.4 LTS"
VERSION="24.04.4 LTS (Noble Numbat)"

Kernel:
Linux saturnix-AB350M-Pro4 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 19 01:28:37 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Python:
Python 3.12.3

GPU:
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] (rev cc)

ROCm note from host:
ROCk module is loaded
Unable to open /dev/kfd read-write: No such file or directory
saturnix is member of video group

Model:
QuantTrio/Qwen3.5-9B-AWQ

Served model name:
qwen3.5:9b

Docker image:
vllm/vllm-openai-rocm:latest

Server launch command:
sudo docker run --rm --network host \
  --device /dev/kfd --device /dev/dri \
  --group-add render \
  --ipc=host --shm-size 16G \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest \
  --model QuantTrio/Qwen3.5-9B-AWQ \
  --served-model-name qwen3.5:9b \
  --quantization awq \
  --host 127.0.0.1 \
  --port 11434 \
  --max-model-len 16384 \
  --enable-prefix-caching \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.95 \
  --language-model-only

🐛 Describe the bug

For some prompts, QuantTrio/Qwen3.5-9B-AWQ served by vLLM on ROCm starts generating valid JSON and then abruptly degenerates into an endless stream of ! characters inside a JSON string.

This is not just an application-layer parsing issue:

  • it happens with raw OpenAI-compatible requests, not only via Instructor
  • it happens with response_format={"type":"json_schema", ...}
  • streaming shows the degeneration live token-by-token
  • the prompt text itself looks clean: no suspicious control characters

One reproducible failing case is based on:

  • ASIN: 0544938097
  • market: .com
  • title: Little Blue Truck's Springtime: An Easter And Springtime Book For Kids – An Interactive Adventure with Baby Animals
  • author: Schertle, Alice
  • publisher: Clarion Books

The rendered prompt for this case is structurally normal:

  • system prompt length: 4931
  • user prompt length: 5534
  • non-ASCII chars in user prompt: 10
  • control chars in user prompt: []

When streamed, the model behaves like this:

{
  "keywords": [
    ["little blue truck", "little blue truck"],
    ["spring", "spring"],
    ["easter", "easter"],
    ["baby animals", "baby animals"],
    ["interactive book", "interactive book"],
    ["flap book", "flap book"],
    ["children's book", "children's book"],
    ["farm animals", "farm animals"]
  ],
  "topics": [
    ["little blue truck", "little blue truck"],
    ["spring", "spring"],
    ["easter", "easter"],
    ["farm animals", "farm animals"],
    ["baby animals", "baby animals"],
    ["interactive reading", "interactive reading!!!!!!!!!!!!!!!!!!!!!!!!!

So generation is initially normal, then it gets stuck repeating ! forever inside a topics item.

Minimal reproduction

from openai import OpenAI

SYSTEM_PROMPT = """You are a book metadata classifier. Return a precise JSON object with:
- keywords: 4-8 keywords, each as [original_language, english]
- topics: 4-8 topics, each as [original_language, english]
- externally_promoted: boolean
- famous_author: boolean
- is_fiction: boolean
- low_content: boolean
- confidence: integer 1-10
- notes: short English explanation, max 200 chars
"""

USER_PROMPT = """<book>
Title: Little Blue Truck's Springtime: An Easter And Springtime Book For Kids – An Interactive Adventure with Baby Animals
Author: Schertle, Alice
Publisher: Clarion Books
Market: amazon.com
Description: Celebrate the beauty of springtime with the #1 New York Times best-selling Little Blue Truck!
Beep! Beep! Little Blue Truck is out for a ride with his good friend Toad. The sun is shining and the flowers are blooming—it's a beautiful spring day! Who will they see along the way?
Open the flaps to meet all of the sweet baby animals just born on the farm. Peep! Peep!
</book>
<web_search_results>
Use the following as evidence for famous_author and externally_promoted.
Author search results:
[en.wikipedia.org] Alice Schertle (born 1941) is an American poet, teacher, and author from Los Angeles. She is known as the author of numerous children's books, most notably the New York Times best-selling Little Blue Truck series.
[aliceschertle.com] Alice Schertle is an award-winning poet whose books for children include All You Need for a Snowman and the New York Times bestselling Good Night, Little Blue Truck.
[littlebluetruckbooks.com] Alice Schertle is a poet and author of many well-loved books for children, including the beloved, #1 New York Times best-selling Little Blue Truck series.
[encyclopedia.com] Schertle, Alice, born April 7, 1941, in Los Angeles, CA...
[poetryfoundation.org] Children's poet Alice Schertle was born in Los Angeles...
Publisher search results:
[harpercollins.com] Are you a Clarion Books fan? Sign up now for Clarion Books alerts...
[instagram.com] Clarion Books (@clarionbooks) on Instagram...
[facebook.com] Clarion Books. Learn more about the latest releases from Clarion Books...
[aalbc.com] Browse books published by Clarion Books, an imprint of HarperCollins.
</web_search_results>"""

SCHEMA = {
    "type": "object",
    "properties": {
        "keywords": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "topics": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "externally_promoted": {"type": "boolean"},
        "famous_author": {"type": "boolean"},
        "is_fiction": {"type": "boolean"},
        "low_content": {"type": "boolean"},
        "confidence": {"type": "integer"},
        "notes": {"type": "string", "maxLength": 200},
    },
    "required": ["keywords", "topics", "is_fiction", "confidence"],
}

client = OpenAI(base_url="http://127.0.0.1:11434/v1", api_key="dummy")

stream = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "BookTagResult",
            "schema": SCHEMA,
        },
    },
    temperature=0.7,
    top_p=0.8,
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    text = delta.content or ""
    if text:
        print(text, end="", flush=True)

Expected behavior

The model should return a valid JSON object matching the schema.

Actual behavior

The model starts generating valid JSON, then falls into an endless repetition loop of ! characters within a string field and never finishes unless the client times out.

Additional notes

  • This repro uses the raw OpenAI-compatible API directly.
  • The issue survives even when bypassing higher-level wrappers.
  • This does not happen for every prompt, only for some real payloads.
  • The failure appears to be in model inference / decoding rather than prompt formatting.

extent analysis

TL;DR

The issue can be mitigated by adjusting the model's decoding parameters, such as temperature and top_p, to prevent it from getting stuck in an endless repetition loop.

Guidance

  • Verify that the issue persists across different prompts and input data to rule out any specific prompt-related issues.
  • Experiment with adjusting the temperature and top_p parameters in the chat.completions.create method to see if it affects the model's behavior and prevents the repetition loop.
  • Consider adding a timeout or a maximum response length to the client to prevent it from waiting indefinitely for a response.
  • Investigate if the issue is specific to the qwen3.5:9b model or if it occurs with other models as well.

Example

stream = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "BookTagResult",
            "schema": SCHEMA,
        },
    },
    temperature=0.5,  # adjusted temperature
    top_p=0.9,  # adjusted top_p
    stream=True,
)

Notes

The issue seems to be related to the model's decoding process, and adjusting the decoding parameters may help mitigate it. However, the root cause of the issue is still unclear, and further investigation is needed to determine the underlying problem.

Recommendation

Apply a workaround by adjusting the decoding parameters, such as temperature and top_p, to prevent the model from getting stuck in an endless repetition loop. This can be done by experimenting with different values for these parameters to find the optimal settings that prevent the issue while still maintaining the desired model behavior.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The model should return a valid JSON object matching the schema.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING