ollama - ✅(Solved) Fix `think=false` breaks `format` (structured output) for `gemma4` — format constraint silently ignored [1 pull requests, 6 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15260Fetched 2026-04-08 02:33:33
View on GitHub
Comments
6
Participants
6
Timeline
20
Reactions
5
Assignees
Timeline (top)
commented ×6subscribed ×6cross-referenced ×4assigned ×1

When using gemma4:26b-a4b-it-q4_K_M with the format parameter (JSON schema structured output), setting think=false causes the format constraint to be completely ignored. The model outputs plain text instead of the requested JSON structure.

If think is omitted (not sent at all), the format works correctly — but the model then defaults to thinking mode, adding unwanted latency.

This is the same class of bug as #14645 (qwen3.5 series), but confirmed to also affect gemma4. gemma4 uses <|think|> tokens in its chat template for thinking control, similar to how qwen3.5 models handle thinking.

Root Cause

Same as described in #14645: Ollama appears to defer format probability masking until it sees the end-of-thinking token. When think=false is set, the thinking tags are closed in the template and the model never outputs the end-of-thinking token, so the masking is never applied.

PR fix notes

PR #14660: server: apply format constraint when thinking is disabled

Description (problem / solution / changelog)

Summary

  • Fix format/structured outputs being silently ignored when think=false on thinking-capable models (e.g. qwen3.5)

Problem

When sending think=false + format=json, the structured outputs logic still defers format masking (sets currentFormat = nil) because the condition only checks whether the model has a builtin parser or thinking capability, not whether thinking is actually enabled for the request. Since no thinking content is produced, the restart signal never fires, and format masking is never applied.

Fix

Add a thinkEnabled check so that format masking is only deferred when thinking will actually produce content.

Test plan

  • Added format applied when think disabled test to TestChatWithPromptEndingInThinkTag
  • Verifies that with think=false + format, the completion receives the format constraint directly (single call, format not nil)
  • All existing structured outputs tests pass (think=true behavior unchanged)

Fixes #14645

Changed files

  • server/routes.go (modified, +2/-1)
  • server/routes_generate_test.go (modified, +63/-0)

Code Example

# ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "think": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = plain text (NOT JSON), format completely ignored

# ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency)
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = valid JSON: {"emotion": "happy", "response_text": "..."}

---

Server log shows no errors — all 4 requests returned HTTP 200:

[GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat"Test 1 (think=false, ~0.8s, format IGNORED)
[GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat"Test 2 (think omitted -> default: true, ~3.9s, format OK)
[GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat"Test 3 (think=false, ~0.6s, format IGNORED)
[GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat"Test 4 (think omitted -> default: true, ~4.1s, format OK)

Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips
thinking — but format constraint is silently ignored, producing plain text instead of JSON.
No warnings or errors are logged server-side when format is ignored.
RAW_BUFFERClick to expand / collapse

What is the issue?

Description

When using gemma4:26b-a4b-it-q4_K_M with the format parameter (JSON schema structured output), setting think=false causes the format constraint to be completely ignored. The model outputs plain text instead of the requested JSON structure.

If think is omitted (not sent at all), the format works correctly — but the model then defaults to thinking mode, adding unwanted latency.

This is the same class of bug as #14645 (qwen3.5 series), but confirmed to also affect gemma4. gemma4 uses <|think|> tokens in its chat template for thinking control, similar to how qwen3.5 models handle thinking.

Environment

  • Ollama version: 0.20.0
  • Model: gemma4:26b-a4b-it-q4_K_M (SHA: 7121486771cb)
  • OS: Windows 11 (10.0.26200)
  • GPU: NVIDIA GeForce RTX 4090 (Driver 582.32)
  • CPU: Intel Core i9-14900
  • Tested via: Direct HTTP API calls (curl / requests.post) — not SDK-specific

Minimal Reproduction (via HTTP)

# ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "think": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = plain text (NOT JSON), format completely ignored

# ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency)
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = valid JSON: {"emotion": "happy", "response_text": "..."}

Test Results (4 scenarios, all via HTTP)

#ModethinkformatResult
1non-streamfalse✅ JSON schemaPlain text — format ignored
2non-stream(omitted)✅ JSON schema✅ Valid JSON (emotion=happy)
3streamfalse✅ JSON schemaPlain text — format ignored
4stream(omitted)✅ JSON schema✅ Valid JSON (emotion=happy)

Expected Behavior

think=false + format should produce valid JSON matching the schema (same as when think is omitted, but without the thinking overhead).

Actual Behavior

When think=false is sent, the format constraint is silently dropped. The model generates unconstrained plain text as if format was never specified.

Root Cause Analysis

Same as described in #14645: Ollama appears to defer format probability masking until it sees the end-of-thinking token. When think=false is set, the thinking tags are closed in the template and the model never outputs the end-of-thinking token, so the masking is never applied.

Notes

  • Not SDK-specific: Tested with both ollama Python SDK and raw HTTP POST to /api/chat — identical behavior.
  • Not model-specific to qwen3.5: This affects gemma4 as well. Other models without thinking templates (e.g., gpt-oss:20b) work correctly with think=false + format.
  • Related: #14645 (qwen3.5), #14850 (qwen3.5:27b, closed as dup), #10929 (invalid JSON with think=true), #10538 (feature request for thinking + structured output)

Relevant log output

Server log shows no errors — all 4 requests returned HTTP 200:

[GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat"  ← Test 1 (think=false, ~0.8s, format IGNORED)
[GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat"  ← Test 2 (think omitted -> default: true, ~3.9s, format OK)
[GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat"  ← Test 3 (think=false, ~0.6s, format IGNORED)
[GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat"  ← Test 4 (think omitted -> default: true, ~4.1s, format OK)

Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips
thinking — but format constraint is silently ignored, producing plain text instead of JSON.
No warnings or errors are logged server-side when format is ignored.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

extent analysis

TL;DR

The issue can be worked around by omitting the think parameter to ensure the format constraint is applied, although this will introduce additional latency due to the model defaulting to thinking mode.

Guidance

  • The root cause is related to how Ollama handles the end-of-thinking token when think=false is set, preventing format probability masking from being applied.
  • To verify the issue, compare the response when think is omitted versus when think=false is explicitly set, focusing on whether the format constraint is respected.
  • Consider testing with other models that do not use thinking templates to see if the issue is specific to models like gemma4.
  • Review related issues (#14645, #14850, #10929, #10538) for potential insights or workarounds.

Example

No code example is provided as the issue is described in terms of API requests and responses rather than code snippets.

Notes

The provided solution is a workaround rather than a fix, as it introduces additional latency. The actual fix would require addressing the underlying issue in how Ollama handles the think parameter and format constraints.

Recommendation

Apply the workaround by omitting the think parameter in requests where the format constraint is crucial, accepting the additional latency as a trade-off for correct functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING