ollama - 💡(How to fix) Fix gemma4:31b repetition loop during constrained JSON generation with free-text string fields [4 comments, 2 participants]

ollama2026-04-11 14:23:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15502•Fetched 2026-04-12 13:24:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rnh0

Participants

rick-github

rnh0

Timeline (top)

commented ×4subscribed ×2cross-referenced ×1

gemma4:31b enters word-repetition loops when generating long free-text strings inside a JSON schema constraint (format=). A word doubles, then collapses into a single repeated token that fills the remaining num_predict budget, leaving the JSON unterminated. Bug rate is 60-100% depending on the prompt, across 39 trials.

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Error Message

Actual output from the minimal repro below (seed=0):

Root Cause

Root cause isolation

Code Example

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

---

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")

RAW_BUFFERClick to expand / collapse

Summary

This is NOT the <unused> token / GEMV buffer overlap bug (llama.cpp#21321 / #21566). The repeated tokens are normal English words, and zero <unused> tokens were observed in any trial.

Observed behavior

Actual output from the minimal repro below (seed=0):

{
  "description": "The horizon is a bleeding, vibrant, orange-gold single own line where the sky meets
the ocean, as the sun dips low, casting a long, shimmering path of amber own own own own own own own
own own own own own own own own own own own own own own own own own own own own own own own own own

(300 chars, unterminated JSON — the "own" token repeats until num_predict is exhausted.)

The cascade pattern:

Normal generation starts fine inside a JSON string value
A token intrudes: "amber own"
Collapses into a single repeated token: "own own own own own..."
Fills remaining num_predict budget (8192 tokens)
JSON left unterminated -> parse error

Root cause isolation

We ran 39 trials across 13 test configurations, varying one condition at a time. Three conditions are all required to trigger the bug:

#	Test	Rep Bugs	JSON Fail	What it proves
1	4 different prompts + schema + free-text	8/12	10/12	Not prompt-specific (60-100% rate)
2	no format= (free generation)	0/3	0/3	format= IS required
3	schema + no free-text fields	0/3	0/3	Free-text strings in JSON trigger it
4	schema + free-text + think=False	0/3	3/3	No repetition, but JSON broken (#15260)
5a	gemma4:26b (MoE) + schema + free-text	0/3	3/3	Dense (31b) only, MoE has different JSON issues
5b	gemma3:27b + schema + free-text	0/3	0/3	gemma4-specific regression
6a	repeat_penalty=1.0	2/3	2/3	Penalty has no effect
6b	repeat_penalty=1.15	2/3	2/3	Same seeds fail regardless
6c	repeat_penalty=1.5	2/3	2/3	Cannot suppress the cascade

The three necessary conditions

gemma4:31b (Dense) — gemma4:26b (MoE) and gemma3:27b do not exhibit this bug
format= with a JSON schema — removing the grammar constraint eliminates the bug entirely
Free-text string fields in the schema (e.g., "description": {"type": "string"} requesting multi-sentence output) — a simple schema with only arrays and enums is clean

Vision input is not required — text-only prompts reproduce at the same rate.

Note: The test matrix was collected using a longer prompt with vision input variants. The same bug reproduces with the simplified text-only prompt shown below. The minimal repro was verified independently.

Minimal reproduction

No images, no dependencies beyond the ollama Python package. Seeds 0 and 84 hit the repetition loop; seed 42 produces malformed JSON of a different kind. All 3/3 seeds produce broken output.

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

PROMPT = (
    "Describe a beach scene at sunset in detail. "
    "Write at least 3 full sentences for description "
    "and several paragraphs for analysis."
)

response = ollama.chat(
    model="gemma4:31b",
    messages=[{"role": "user", "content": PROMPT}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
content = response.message.content
print(f"Length: {len(content)} chars")
print(f"Tail: ...{content[-200:]}")

Expected output

Valid JSON (~500-2000 chars) with description, analysis, and tags fields, properly terminated.

Actual output

Unterminated JSON (300 chars with seed=0, up to ~33,000 chars with other seeds). The description or analysis field enters a repetition loop partway through and fills the remaining token budget.

Expected vs actual behavior

	Expected	Actual
Output	Valid, terminated JSON with multi-sentence free-text fields	Word repetition loop fills num_predict budget, JSON unterminated
repeat_penalty	Higher values should suppress repetition	No effect at any tested value (1.0, 1.15, 1.5) — same seeds fail identically
Grammar constraint	Should enforce valid JSON structure	Grammar allows the repeated word because it's a valid string character sequence

System info

Component	Value
GPU	NVIDIA GeForce RTX 5090 (32 GB VRAM)
Driver (running kernel module)	580.126.16
CUDA Version	13.0
Ollama	0.20.5
OS	Ubuntu 24.04.3 LTS
Kernel	6.17.0-14-generic x86_64
CPU	AMD Ryzen 7 9800X3D
Model	gemma4:31b (SHA `6316f0629137`, 19 GB)

Additional context

Why repeat_penalty has no effect

We tested repeat_penalty at 1.0, 1.15, and 1.5 — identical seeds fail identically at all values. Our hypothesis: the grammar constraint limits token choices at each step, and inside a JSON string value any valid string content (including word repetition) is allowed. If the model's logit distribution degenerates to strongly favor a single token, the grammar has no mechanism to reject it regardless of penalty strength.

Interaction with ollama#15260

Test 5 (think=False + format=) produced 0/3 repetition bugs but 3/3 JSON failures (output was plain text, not JSON). This confirms #15260: when thinking is disabled, the format constraint is never applied because the end-of-thinking token never fires. This accidentally "fixes" the repetition bug by removing the grammar constraint entirely — but breaks structured output.

gemma4:26b (MoE) behavior

The MoE variant produced 0/3 repetition bugs but 3/3 JSON failures of a different kind (malformed JSON, not repetition loops). The MoE model has separate structured output issues that may be related to #15428.

Not the GEMV buffer overlap bug

Zero <unused> tokens were observed across all 39 trials. The repeated tokens are normal English words ("beach", "own", "same", "companion", "fatigue,en"). This is a distinct bug from the CUDA GEMV fusion buffer overlap fixed in llama.cpp#21566 / b8702.

Related issues

ollama/ollama#15260 — think=false breaks format= (format constraint silently ignored)
ollama/ollama#15386 — Structured output contradicts model's own thinking (constrained decoding vs thinking tension)
ollama/ollama#15350 — Flash Attention hangs on gemma4:31b Dense (different bug, same model)
ollama/ollama#15428 — gemma4:26b empty response with long system prompts
ggml-org/llama.cpp#21321 — Gemma 4 <unused24> tokens (GEMV buffer overlap — different root cause)
ggml-org/llama.cpp#21566 — Fix for GEMV buffer overlap (does NOT fix this bug)

Tested on 2026-04-11. 39 trials across 13 test configurations.

The test matrix and isolation methodology were designed with assistance from Claude Code (Anthropic). All tests were run locally on the hardware described above. Results are deterministic and independently reproducible.

extent analysis

TL;DR

The most likely fix for the word repetition loop in gemma4:31b is to modify the format= constraint or the model's decoding strategy to prevent degeneration into repeated tokens.

Guidance

Investigate modifying the format= constraint to reject repeated tokens within a JSON string value.
Consider adjusting the model's decoding strategy to penalize repeated tokens more effectively, potentially by introducing a custom penalty function.
Review related issues, such as ollama#15260 and ollama#15386, to understand the interactions between the format= constraint, thinking, and structured output.
Test the model with different prompts and seeds to determine if the issue is specific to certain inputs or if it's a more general problem.

Example

No specific code changes can be recommended without further investigation, but modifying the ollama.chat call to include a custom penalty function or adjusting the format= constraint could potentially mitigate the issue.

Notes

The root cause of the issue appears to be related to the interaction between the format= constraint and the model's decoding strategy, rather than a specific bug in the gemma4:31b model or the ollama package. Further investigation is needed to determine the best course of action.

Recommendation

Apply a workaround by modifying the format= constraint or the model's decoding strategy to prevent repeated tokens, as upgrading to a fixed version is not currently an option. This approach may require significant changes to the model or the ollama package, and may not completely eliminate the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API rate limit #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - 💡(How to fix) Fix gemma4:31b repetition loop during constrained JSON generation with free-text string fields [4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause isolation

Code Example

Summary

Observed behavior

Root cause isolation

The three necessary conditions

Minimal reproduction

Expected output

Actual output

Expected vs actual behavior

System info

Additional context

Why repeat_penalty has no effect

Interaction with ollama#15260

gemma4:26b (MoE) behavior

Not the GEMV buffer overlap bug

Related issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING