ollama - 💡(How to fix) Fix gemma4:26b (MoE) returns completely empty response (no content, no reasoning) and stops early on long system prompts [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15428Fetched 2026-04-09 07:51:18
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
subscribed ×2commented ×1

Code Example

# Create a 2000-character dummy system prompt
SYS_PROMPT=$(printf "System instruction. %.0s" {1..100})

# Send request to native API
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [
    {"role": "system", "content": "'"$SYS_PROMPT"'"},
    {"role": "user", "content": "Who won the 2025 NCAA mens basketball championship?"}
  ],
  "stream": false
}' | jq .

---

{
  "model": "gemma4:26b",
  "created_at": "2026-04-08T19:59:20.718104Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 1265782584,
  "load_duration": 137602292,
  "prompt_eval_count": 1423,
  "prompt_eval_duration": 192706834,
  "eval_count": 49,
  "eval_duration": 911844667
}
RAW_BUFFERClick to expand / collapse

gemma4:26b (MoE) returns completely empty response (no content, no reasoning) and stops early on long system prompts

What is the issue?

When using the Gemma 4 MoE model (gemma4:26b) via the native /api/chat endpoint, the model returns a completely empty response (empty content, missing reasoning, and done_reason: "stop") if the system prompt exceeds roughly 500 characters.

This is not the known issue with the OpenAI /v1/chat/completions endpoint where the output is moved to the reasoning field (as reported in #15288). In this case, the model evaluates ~49 tokens and stops immediately, producing absolutely no output.

If the system prompt is short (e.g., < 200 chars), the model works correctly and produces output. The same long prompt works perfectly on Dense models like gemma4:31b and gemma-d-cc:latest.

Steps to reproduce

Create a long system prompt (e.g., 2000+ characters). You can use any filler text or a complex system prompt.

# Create a 2000-character dummy system prompt
SYS_PROMPT=$(printf "System instruction. %.0s" {1..100})

# Send request to native API
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [
    {"role": "system", "content": "'"$SYS_PROMPT"'"},
    {"role": "user", "content": "Who won the 2025 NCAA mens basketball championship?"}
  ],
  "stream": false
}' | jq .

Result:

{
  "model": "gemma4:26b",
  "created_at": "2026-04-08T19:59:20.718104Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 1265782584,
  "load_duration": 137602292,
  "prompt_eval_count": 1423,
  "prompt_eval_duration": 192706834,
  "eval_count": 49,
  "eval_duration": 911844667
}

Notice eval_count is 49 and content is "".

What I have ruled out

  • Modelfile/Parameters: I tested identical blobs with varied Modelfiles (removing RENDERER, PARSER, tweaking num_ctx, temperature, top_k, etc.). The bug persists across all variants.
  • Ollama Version: Reproduced on 0.20.0 and 0.20.3.
  • Prompt Content: The bug triggers even with purely all-ASCII junk (Lorem Ipsum) if it reaches the length threshold.
  • Dense Models: gemma4:31b and gemma4:e4b handle the exact same prompt correctly. This bug is isolated to the MoE architecture or its specific quantization in the library.

Environment

  • Ollama version: 0.20.3
  • Model: gemma4:26b (SHA: 7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df)
  • OS: macOS 15 (Apple Silicon Mac Studio)
  • Configuration: flash_attention: false, num_parallel: 1

extent analysis

TL;DR

The issue can be mitigated by shortening the system prompt to under 500 characters or exploring model configuration adjustments.

Guidance

  • Verify that the issue is specific to the MoE architecture by testing other models with similar configurations.
  • Experiment with adjusting the num_ctx parameter to increase the context window, potentially allowing the model to process longer prompts.
  • Consider preprocessing the system prompt to reduce its length while preserving essential information.
  • Investigate if there are any model-specific limitations or optimizations that can be applied to handle longer prompts.

Example

No code snippet is provided as the issue seems to be related to model configuration and prompt length rather than a specific code implementation.

Notes

The issue appears to be isolated to the gemma4:26b model and its MoE architecture. The fact that dense models like gemma4:31b handle long prompts correctly suggests that the problem might be specific to the model's quantization or configuration.

Recommendation

Apply workaround: Shorten the system prompt or adjust model configuration to mitigate the issue, as the root cause seems to be related to the model's architecture and prompt length handling.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING