ollama - 💡(How to fix) Fix gemma4:31b repetition loop during constrained JSON generation with free-text string fields [4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15502Fetched 2026-04-12 13:24:16
View on GitHub
Comments
4
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
commented ×4subscribed ×2cross-referenced ×1

gemma4:31b enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (format=). A word doubles, then collapses into a single repeated token that fills the remaining num_predict budget, leaving the JSON unterminated. Bug rate is 60-100% depending on the prompt, across 39 trials.

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Error Message

Actual output from the minimal repro below (seed=0):

Root Cause

Root cause isolation

Code Example

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

---

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")
RAW_BUFFERClick to expand / collapse

Summary

gemma4:31b enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (format=). A word doubles, then collapses into a single repeated token that fills the remaining num_predict budget, leaving the JSON unterminated. Bug rate is 60-100% depending on the prompt, across 39 trials.

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Observed behavior

Actual output from the minimal repro below (seed=0):

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

(300 chars, unterminated JSON — the "own" token repeats until num_predict is exhausted.)

The cascade pattern:

  1. Normal generation starts fine inside a JSON string value
  2. A token intrudes: "amber own"
  3. Collapses into a single repeated token: "own own own own own..."
  4. Fills remaining num_predict budget (8192 tokens)
  5. JSON left unterminated -> parse error

Root cause isolation

We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are all required to trigger the bug:

#TestRep BugsJSON FailWhat it proves
14 different prompts + schema + free-text8/1210/12Not prompt-specific (60-100% rate)
2no format= (free generation)0/30/3format= IS required
3schema + no free-text fields0/30/3Free-text strings in JSON trigger it
4schema + free-text + think=False0/33/3No repetition, but JSON broken (#15260)
5agemma4:26b (MoE) + schema + free-text0/33/3Dense (31b) only, MoE has different JSON issues
5bgemma3:27b + schema + free-text0/30/3gemma4-specific regression
6arepeat_penalty=1.02/32/3Penalty has no effect
6brepeat_penalty=1.152/32/3Same seeds fail regardless
6crepeat_penalty=1.52/32/3Cannot suppress the cascade

The three necessary conditions

  1. gemma4:31b (Dense) — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug
  2. format= with a JSON schema — removing the grammar constraint eliminates the bug entirely
  3. Free-text string fields in the schema (e.g., "description": {"type": "string"} requesting multi-sentence output) — a simple schema with only arrays and enums is clean

Vision input is not required — text-only prompts reproduce at the same rate.

Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.

Minimal reproduction

No images, no dependencies beyond the ollama Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output.

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")

Expected output

Valid JSON (~500-2000 chars) with description, analysis, and tags fields, properly terminated.

Actual output

Unterminated JSON (300 chars with seed=0, up to ~33,000 chars with other seeds). The description or analysis field enters a repetition loop partway through and fills the remaining token budget.

Expected vs actual behavior

ExpectedActual
OutputValid, terminated JSON with multi-sentence free-text fieldsWord repetition loop fills num_predict budget, JSON unterminated
repeat_penaltyHigher values should suppress repetitionNo effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically
Grammar constraintShould enforce valid JSON structureGrammar allows the repeated word because it's a valid string character sequence

System info

ComponentValue
GPUNVIDIA GeForce RTX 5090 (32 GB VRAM)
Driver (running kernel module)580.126.16
CUDA Version13.0
Ollama0.20.5
OSUbuntu 24.04.3 LTS
Kernel6.17.0-14-generic x86_64
CPUAMD Ryzen 7 9800X3D
Modelgemma4:31b (SHA 6316f0629137, 19 GB)

Additional context

Why repeat_penalty has no effect

We tested repeat_penalty at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength.

Interaction with ollama#15260

Test 5 (think=False + format=) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output.

gemma4:26b (MoE) behavior

The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428.

Not the GEMV buffer overlap bug

Zero <unused> tokens were observed across all 39 trials. The repeated tokens are normal English words ("beach", "own", "same", "companion", "fatigue,en"). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702.

Related issues

  • ollama/ollama#15260 — think=false breaks format= (format constraint silently ignored)
  • ollama/ollama#15386 — Structured output contradicts model's own thinking (constrained decoding vs thinking tension)
  • ollama/ollama#15350 — Flash Attention hangs on gemma4:31b Dense (different bug, same model)
  • ollama/ollama#15428 — gemma4:26b empty response with long system prompts
  • ggml-org/llama.cpp#21321 — Gemma 4 <unused24> tokens (GEMV buffer overlap — different root cause)
  • ggml-org/llama.cpp#21566 — Fix for GEMV buffer overlap (does NOT fix this bug)

Tested on 2026-04-11. 39 trials across 13 test configurations.

The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.

extent analysis

TL;DR

The most likely fix for the word repetition loop in gemma4:31b is to modify the format= constraint or the model's decoding strategy to prevent degeneration into repeated tokens.

Guidance

  • Investigate modifying the format= constraint to reject repeated tokens within a JSON string value.
  • Consider adjusting the model's decoding strategy to penalize repeated tokens more effectively, potentially by introducing a custom penalty function.
  • Review related issues, such as ollama#15260 and ollama#15386, to understand the interactions between the format= constraint, thinking, and structured output.
  • Test the model with different prompts and seeds to determine if the issue is specific to certain inputs or if it's a more general problem.

Example

No specific code changes can be recommended without further investigation, but modifying the ollama.chat call to include a custom penalty function or adjusting the format= constraint could potentially mitigate the issue.

Notes

The root cause of the issue appears to be related to the interaction between the format= constraint and the model's decoding strategy, rather than a specific bug in the gemma4:31b model or the ollama package. Further investigation is needed to determine the best course of action.

Recommendation

Apply a workaround by modifying the format= constraint or the model's decoding strategy to prevent repeated tokens, as upgrading to a fixed version is not currently an option. This approach may require significant changes to the model or the ollama package, and may not completely eliminate the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING