ollama - ✅(Solved) Fix `think=false` breaks `format` (structured output) for `gemma4` — format constraint silently ignored [1 pull requests, 6 comments, 6 participants]

ollama2026-04-03 06:41:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15260•Fetched 2026-04-08 02:33:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×6subscribed ×6cross-referenced ×4assigned ×1

When using gemma4:26b-a4b-it-q4_K_M with the format parameter (JSON schema structured output), setting think=false causes the format constraint to be completely ignored. The model outputs plain text instead of the requested JSON structure.

If think is omitted (not sent at all), the format works correctly — but the model then defaults to thinking mode, adding unwanted latency.

This is the same class of bug as #14645 (qwen3.5 series), but confirmed to also affect gemma4. gemma4 uses <|think|> tokens in its chat template for thinking control, similar to how qwen3.5 models handle thinking.

Root Cause

Same as described in #14645: Ollama appears to defer format probability masking until it sees the end-of-thinking token. When think=false is set, the thinking tags are closed in the template and the model never outputs the end-of-thinking token, so the masking is never applied.

PR fix notes

PR #14660: server: apply format constraint when thinking is disabled

Repository: ollama/ollama
Author: majiayu000
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/14660

Description (problem / solution / changelog)

Summary

Fix format/structured outputs being silently ignored when think=false on thinking-capable models (e.g. qwen3.5)

Problem

When sending think=false + format=json, the structured outputs logic still defers format masking (sets currentFormat = nil) because the condition only checks whether the model has a builtin parser or thinking capability, not whether thinking is actually enabled for the request. Since no thinking content is produced, the restart signal never fires, and format masking is never applied.

Fix

Add a thinkEnabled check so that format masking is only deferred when thinking will actually produce content.

Test plan

Added format applied when think disabled test to TestChatWithPromptEndingInThinkTag
Verifies that with think=false + format, the completion receives the format constraint directly (single call, format not nil)
All existing structured outputs tests pass (think=true behavior unchanged)

Fixes #14645

Changed files

server/routes.go (modified, +2/-1)
server/routes_generate_test.go (modified, +63/-0)

Code Example

# ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "think": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = plain text (NOT JSON), format completely ignored

# ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency)
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = valid JSON: {"emotion": "happy", "response_text": "..."}

---

Server log shows no errors — all 4 requests returned HTTP 200:

[GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat"  ← Test 1 (think=false, ~0.8s, format IGNORED)
[GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat"  ← Test 2 (think omitted -> default: true, ~3.9s, format OK)
[GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat"  ← Test 3 (think=false, ~0.6s, format IGNORED)
[GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat"  ← Test 4 (think omitted -> default: true, ~4.1s, format OK)

Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips
thinking — but format constraint is silently ignored, producing plain text instead of JSON.
No warnings or errors are logged server-side when format is ignored.

RAW_BUFFERClick to expand / collapse

What is the issue?

Description

If think is omitted (not sent at all), the format works correctly — but the model then defaults to thinking mode, adding unwanted latency.

Environment

Ollama version: 0.20.0
Model: gemma4:26b-a4b-it-q4_K_M (SHA: 7121486771cb)
OS: Windows 11 (10.0.26200)
GPU: NVIDIA GeForce RTX 4090 (Driver 582.32)
CPU: Intel Core i9-14900
Tested via: Direct HTTP API calls (curl / requests.post) — not SDK-specific

Minimal Reproduction (via HTTP)

# ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "think": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = plain text (NOT JSON), format completely ignored

# ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency)
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = valid JSON: {"emotion": "happy", "response_text": "..."}

Test Results (4 scenarios, all via HTTP)

#	Mode	`think`	`format`	Result
1	non-stream	`false`	✅ JSON schema	❌ Plain text — format ignored
2	non-stream	(omitted)	✅ JSON schema	✅ Valid JSON (`emotion=happy`)
3	stream	`false`	✅ JSON schema	❌ Plain text — format ignored
4	stream	(omitted)	✅ JSON schema	✅ Valid JSON (`emotion=happy`)

Expected Behavior

think=false + format should produce valid JSON matching the schema (same as when think is omitted, but without the thinking overhead).

Actual Behavior

When think=false is sent, the format constraint is silently dropped. The model generates unconstrained plain text as if format was never specified.

Root Cause Analysis

Notes

Not SDK-specific: Tested with both ollama Python SDK and raw HTTP POST to /api/chat — identical behavior.
Not model-specific to qwen3.5: This affects gemma4 as well. Other models without thinking templates (e.g., gpt-oss:20b) work correctly with think=false + format.
Related: #14645 (qwen3.5), #14850 (qwen3.5:27b, closed as dup), #10929 (invalid JSON with think=true), #10538 (feature request for thinking + structured output)

Relevant log output

Server log shows no errors — all 4 requests returned HTTP 200:

[GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat"  ← Test 1 (think=false, ~0.8s, format IGNORED)
[GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat"  ← Test 2 (think omitted -> default: true, ~3.9s, format OK)
[GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat"  ← Test 3 (think=false, ~0.6s, format IGNORED)
[GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat"  ← Test 4 (think omitted -> default: true, ~4.1s, format OK)

Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips
thinking — but format constraint is silently ignored, producing plain text instead of JSON.
No warnings or errors are logged server-side when format is ignored.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

extent analysis

TL;DR

The issue can be worked around by omitting the think parameter to ensure the format constraint is applied, although this will introduce additional latency due to the model defaulting to thinking mode.

Guidance

The root cause is related to how Ollama handles the end-of-thinking token when think=false is set, preventing format probability masking from being applied.
To verify the issue, compare the response when think is omitted versus when think=false is explicitly set, focusing on whether the format constraint is respected.
Consider testing with other models that do not use thinking templates to see if the issue is specific to models like gemma4.
Review related issues (#14645, #14850, #10929, #10538) for potential insights or workarounds.

Example

No code example is provided as the issue is described in terms of API requests and responses rather than code snippets.

Notes

The provided solution is a workaround rather than a fix, as it introduces additional latency. The actual fix would require addressing the underlying issue in how Ollama handles the think parameter and format constraints.

Recommendation

Apply the workaround by omitting the think parameter in requests where the format constraint is crucial, accepting the additional latency as a trade-off for correct functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix `think=false` breaks `format` (structured output) for `gemma4` — format constraint silently ignored [1 pull requests, 6 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #14660: server: apply format constraint when thinking is disabled

Description (problem / solution / changelog)

Summary

Problem

Fix

Test plan

Changed files

Code Example

What is the issue?

Description

Environment

Minimal Reproduction (via HTTP)

Test Results (4 scenarios, all via HTTP)

Expected Behavior

Actual Behavior

Root Cause Analysis

Notes

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING