vllm - ✅(Solved) Fix [Bug]: Gemma 4 (31B / 26B-A4B) generates infinite repetition loops, especially with structured output (JSON schema) [2 pull requests, 2 comments, 2 participants]

vllm2026-04-17 01:21:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40080•Fetched 2026-04-17 08:27:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Foreist

Participants

Foreist

ianliuy

Timeline (top)

commented ×2cross-referenced ×2labeled ×1mentioned ×1

Root Cause

The repetition pattern seems to be a model-level tendency that is amplified by grammar-constrained decoding. When xgrammar restricts the token space to valid JSON tokens, the model's slight repetition bias becomes a strong loop because the grammar prevents the model from generating an EOS or breaking out of the pattern.

Possible mitigations at the vLLM level:

repetition_penalty / frequency_penalty sampling parameters partially help but do not fully prevent the issue
The interaction between xgrammar bitmask and Gemma 4's attention pattern may deserve investigation

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) 6527P CPU family: 6 Model: 173 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 1 CPU(s) scaling MHz: 21% CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 6000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 2.3 MiB (48 instances) L1i cache: 3 MiB (48 instances) L2 cache: 96 MiB (48 instances) L3 cache: 288 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS Not affected; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Possible mitigations at the vLLM level:

repetition_penalty / frequency_penalty sampling parameters partially help but do not fully prevent the issue
The interaction between xgrammar bitmask and Gemma 4's attention pattern may deserve investigation

PR fix notes

PR #40097: fix: auto-enable repetition detection for structured output requests

Repository: vllm-project/vllm
Author: ssam18
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40097

Description (problem / solution / changelog)

When grammar-constrained decoding is active, the bitmask blocks tokens that would naturally break a loop, so models like Gemma 4 can spin indefinitely. This wires in repetition detection automatically whenever a structured output constraint is set, so generation terminates with a clear error instead of looping forever. Users who want different thresholds can still pass their own repetition_detection params and those won't be overridden.Fixes #40080.

Changed files

tests/v1/core/test_repetition_detection.py (modified, +58/-1)
vllm/lora/layers/base_linear.py (modified, +10/-7)
vllm/sampling_params.py (modified, +11/-0)

PR #40099: [Bugfix] Auto-enable repetition detection for grammar-constrained structured output

Repository: vllm-project/vllm
Author: ianliuy
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40099

Description (problem / solution / changelog)

Purpose

Mitigate infinite repetition loops in Gemma 4 (and other repetition-prone models) when grammar-constrained decoding (JSON schema) is enabled.

Fixes https://github.com/vllm-project/vllm/issues/40080

What's broken?

Gemma 4 IT models produce infinite repetition loops (e.g., "General/Systemic-related symptoms: General/Systemic-related symptoms: ...") when generating structured output via JSON schema. The model generates a valid prefix, then enters a degenerate loop repeating a phrase with minor variations until max_tokens is hit — consuming GPU resources for thousands of garbage tokens.

Who is affected?

Any user serving Gemma 4 (31B-it or 26B-A4B-it) with response_format=json_schema. Other models with repetition tendencies may also be affected. This is a model-level issue confirmed across platforms (google-deepmind/gemma#622, google-deepmind/gemma#610).

Root cause analysis

The grammar bitmask is applied before sampling penalties in the pipeline (model_runner.py:841-848). Inside a JSON string value, xgrammar allows thousands of tokens, but the model's logit distribution heavily favors the repeating phrase. With default repetition_penalty=1.0 (no penalty), nothing counteracts this bias. The existing RepetitionDetectionParams (added in #35451) can detect and stop such loops, but is opt-in and disabled by default — most users don't know it exists.

Note: This is distinct from #39842 (BOS token fix for PT models by @lucianommartins), which addressed a different Gemma 4 repetition mode.

How we mitigate it

Auto-enable RepetitionDetectionParams with generous thresholds when grammar-constrained structured output (json, json_object, grammar) is active and the user has not set explicit repetition_detection:

# In SamplingParams.__post_init__:
if (self.structured_outputs is not None
        and self.repetition_detection is None
        and self.structured_outputs._uses_grammar_constraint()):
    self.repetition_detection = RepetitionDetectionParams(
        max_pattern_size=20, min_pattern_size=3, min_count=4)

Thresholds: 3-to-20 token N-gram repeated 4+ times → at minimum 12 tokens of pure repetition before triggering. This is extremely conservative and will not false-positive on legitimate JSON.

Scope: Only grammar-based modes (json, json_object, grammar). Non-grammar modes (choice, regex) are unaffected.

Override: Users can disable via repetition_detection=RepetitionDetectionParams() (all-zero = disabled).

This follows the defensive-abort precedent from #38663 (abort stuck FSM requests).

Why this approach?

Alternative	Why not
Reorder grammar mask / penalties	Doesn't help — repeated tokens are grammar-allowed
Auto-apply `repetition_penalty`	Changes sampling distribution for all requests; doesn't guarantee loop prevention
Warning log only	Doesn't solve the problem; users still burn GPU on garbage
Model-specific fix	Too narrow; other models also affected

Test Plan

python -m pytest tests/v1/core/test_repetition_detection.py -v -k TestStructuredOutput

12 new tests covering:

Auto-enable for json/json_object ✅
NOT auto-enabled for choice/regex ✅
User-explicit params preserved ✅
User-explicit disable preserved ✅
No auto-enable without structured output ✅
Loop detection stops degenerate output ✅
Non-repeating output continues normally ✅
Similar-but-not-identical elements (false-positive test) ✅
Large N-gram pattern detection ✅

Test Result

PASS: auto-enable json
PASS: max_pattern=20
PASS: auto-enable json_object
PASS: no auto for choice
PASS: no auto for regex
PASS: explicit preserved
PASS: explicit disabled
PASS: no struct = None
PASS: loop stopped
PASS: status FINISHED_REPETITION
PASS: non-repeat continues
PASS: no false positive
PASS: large ngram
Results: 13 passed, 0 failed — ALL TESTS PASSED

cc @russellb @mgoin — this touches structured output + sampling interaction.

<details> <summary>Essential Elements of an Effective PR Description Checklist</summary>

The purpose of the PR, such as "Fix some issue"
The test plan, such as providing test command
The test results, such as pasting results comparison
The description of crucial code changes

</details>

Changed files

tests/v1/core/test_repetition_detection.py (modified, +182/-1)
vllm/sampling_params.py (modified, +30/-0)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 18.1.3 (1ubuntu1)
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, May 26 2025, 18:50:19) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-19-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 580.126.09
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  96
On-line CPU(s) list:                     0-95
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) 6527P
CPU family:                              6
Model:                                   173
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               2
Stepping:                                1
CPU(s) scaling MHz:                      21%
CPU max MHz:                             4200.0000
CPU min MHz:                             800.0000
BogoMIPS:                                6000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               2.3 MiB (48 instances)
L1i cache:                               3 MiB (48 instances)
L2 cache:                                96 MiB (48 instances)
L3 cache:                                288 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-23,48-71
NUMA node1 CPU(s):                       24-47,72-95
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pytorch-triton==3.1.0+cf34004b8.internal
[pip3] pyzmq==27.0.0
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[pip3] tritonfrontend==2.59.0
[pip3] tritonserver==0.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev328+g18013df6a (git sha: 18013df6a)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	SYS	SYS	0-23,48-71	0		N/A
GPU1	NODE	 X 	SYS	SYS	0-23,48-71	0		N/A
GPU2	SYS	SYS	 X 	NODE	24-47,72-95	1		N/A
GPU3	SYS	SYS	NODE	 X 	24-47,72-95	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NCCL_VERSION=2.27.3
VLLM_CACHE_ROOT=/usr/src/models
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NVIDIA_PRODUCT_NAME=Triton Server
CUDA_VERSION=12.9.1.010
CUBLASMP_VERSION=0.4.0.789
CUDNN_FRONTEND_VERSION=1.12.0
CUDNN_VERSION=9.10.2.21
NVIDIA_TRITON_SERVER_VERSION=25.06
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=179868735
CUDA_DRIVER_VERSION=575.57.08
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "Summarize the patient's symptoms in a structured format."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "diagnosis",
            "schema": {
                "type": "object",
                "properties": {
                    "patient_information": {"type": "string"},
                    "diagnosis_and_complaints": {"type": "string"},
                },
                "required": ["patient_information", "diagnosis_and_complaints"]
            }
        }
    },
    max_tokens=2000,
)
print(response.choices[0].message.content)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 18.1.3 (1ubuntu1)
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, May 26 2025, 18:50:19) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-19-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 580.126.09
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  96
On-line CPU(s) list:                     0-95
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) 6527P
CPU family:                              6
Model:                                   173
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               2
Stepping:                                1
CPU(s) scaling MHz:                      21%
CPU max MHz:                             4200.0000
CPU min MHz:                             800.0000
BogoMIPS:                                6000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               2.3 MiB (48 instances)
L1i cache:                               3 MiB (48 instances)
L2 cache:                                96 MiB (48 instances)
L3 cache:                                288 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-23,48-71
NUMA node1 CPU(s):                       24-47,72-95
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pytorch-triton==3.1.0+cf34004b8.internal
[pip3] pyzmq==27.0.0
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[pip3] tritonfrontend==2.59.0
[pip3] tritonserver==0.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev328+g18013df6a (git sha: 18013df6a)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	SYS	SYS	0-23,48-71	0		N/A
GPU1	NODE	 X 	SYS	SYS	0-23,48-71	0		N/A
GPU2	SYS	SYS	 X 	NODE	24-47,72-95	1		N/A
GPU3	SYS	SYS	NODE	 X 	24-47,72-95	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.9.1.4
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NCCL_VERSION=2.27.3
VLLM_CACHE_ROOT=/usr/src/models
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
CUDA_ARCH_LIST=7.5 8.0 8.6 9.0 10.0 12.0
NVIDIA_PRODUCT_NAME=Triton Server
CUDA_VERSION=12.9.1.010
CUBLASMP_VERSION=0.4.0.789
CUDNN_FRONTEND_VERSION=1.12.0
CUDNN_VERSION=9.10.2.21
NVIDIA_TRITON_SERVER_VERSION=25.06
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=179868735
CUDA_DRIVER_VERSION=575.57.08
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

</details>

🐛 Describe the bug

Gemma 4 models (both google/gemma-4-31B-it and google/gemma-4-26B-A4B-it) fall into infinite repetition loops during generation. The issue occurs significantly more frequently when structured output (JSON schema / grammar constraints) is enabled, but has also been observed in unconstrained generation.

Example output (with JSON schema): {"patient_information": "Patient reports a 2-week duration of a...", "diagnosis_and_complaints": "General/Systemic-related symptoms: General/Systemic-related symptoms: General/Systemic-related symptoms: General/System-related symptoms: General/Systemic-related symptoms: General/System/related-related symptoms: General/Systemic-related symptoms: General/Systemic-related symptoms: General/System-related symptoms: General/Systemic-related symptoms: ...

The model generates a valid prefix, then enters a degenerate loop repeating a phrase with minor variations indefinitely until max_tokens is hit.

Observations:

Occurs with both BF16 and quantized (FP8, NVFP4) weights
Much higher frequency when grammar/JSON schema constraints are applied via xgrammar
Not specific to tensor parallelism configuration
Occurs on both vLLM v0.19.x and latest main
Not a vLLM-specific issue — the same behavior has been reported across multiple deployment platforms (see related issues below)

Related issues (cross-platform)

This appears to be a model-level issue affecting all serving platforms, not specific to vLLM:

google-deepmind/gemma#622 — Gemma 4 repetition issue (general)
google-deepmind/gemma#610 — Cloudflare Gemma-4-26B-A4B repetition
google/gemma-4-31B-it HF discussion#63 — Repetition reported on Google Vertex AI official deployment
vllm#39130 — Related: --reasoning-parser gemma4 silently disables structured output when enable_thinking=false

Analysis

Possible mitigations at the vLLM level:

repetition_penalty / frequency_penalty sampling parameters partially help but do not fully prevent the issue
The interaction between xgrammar bitmask and Gemma 4's attention pattern may deserve investigation

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

🐛 Describe the bug

Reproduction

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "Summarize the patient's symptoms in a structured format."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "diagnosis",
            "schema": {
                "type": "object",
                "properties": {
                    "patient_information": {"type": "string"},
                    "diagnosis_and_complaints": {"type": "string"},
                },
                "required": ["patient_information", "diagnosis_and_complaints"]
            }
        }
    },
    max_tokens=2000,
)
print(response.choices[0].message.content)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be mitigated by adjusting the repetition_penalty and frequency_penalty sampling parameters in the model configuration.

Guidance

Investigate the interaction between xgrammar bitmask and Gemma 4's attention pattern to understand how it contributes to the repetition loop.
Experiment with different values for repetition_penalty and frequency_penalty to find the optimal balance between preventing repetition and maintaining model performance.
Consider modifying the JSON schema to allow for more flexibility in the generated output, potentially reducing the likelihood of the model entering a repetition loop.
Review related issues across different platforms to identify any common patterns or solutions that may be applicable.

Example

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "Summarize the patient's symptoms in a structured format."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "diagnosis",
            "schema": {
                "type": "object",
                "properties": {
                    "patient_information": {"type": "string"},
                    "diagnosis_and_complaints": {"type": "string"},
                },
                "required": ["patient_information", "diagnosis_and_complaints"]
            }
        }
    },
    max_tokens=2000,
    parameters={
        "repetition_penalty": 1.5,  # adjust this value to mitigate repetition
        "frequency_penalty": 0.5  # adjust this value to mitigate repetition
    }
)

Notes

The provided code snippet is a reproduction of the issue, but the actual solution may require modifications to the model configuration or the JSON schema. The repetition_penalty and frequency_penalty parameters may need to be adjusted experimentally to find the optimal values.

Recommendation

Apply a workaround by adjusting the repetition_penalty and frequency_penalty sampling parameters, as this has been shown to partially mitigate the issue. Further investigation into the model's attention pattern and interaction with xgrammar may be necessary to develop a more comprehensive solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #generation error #database connection #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Gemma 4 (31B / 26B-A4B) generates infinite repetition loops, especially with structured output (JSON schema) [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #40097: fix: auto-enable repetition detection for structured output requests

Description (problem / solution / changelog)

Changed files

PR #40099: [Bugfix] Auto-enable repetition detection for grammar-constrained structured output

Description (problem / solution / changelog)

Purpose

What's broken?

Who is affected?

Root cause analysis

How we mitigate it

Why this approach?

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Related issues (cross-platform)

Analysis

Before submitting a new issue...

🐛 Describe the bug

Reproduction

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING