ollama - 💡(How to fix) Fix Error fixes [1 participants]

ollama2026-04-19 07:47:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15697•Fetched 2026-04-19 15:04:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

QuantumSorcerer02

Participants

QuantumSorcerer02

Error Message

The Error: unknown model architecture: 'gemma4'.
The Error: GGML_ASSERT(assertion failure) during multimodal chat.
The Error: Ollama hangs indefinitely when prompt evaluation exceeds ~4,000 tokens.

RAW_BUFFERClick to expand / collapse

Yes, I’ve been tracking the Ollama repository closely. Given the current state of Gemma 4 and your setup in Termux on the Oppo device, here is a breakdown of the "jams" currently affecting the Ollama community as of mid-April 2026.

1. The "Unknown Architecture" Jam (Gemma 4)

A widespread issue in Ollama versions pre-0.20.5 is the inability to recognize the new gemma4 architecture string.

The Error: unknown model architecture: 'gemma4'.
The Cause: Ollama’s internal runner (libollama_llama.so) is often compiled against a version of llama.cpp older than v2.11.0, which is the required version for Gemma 4 support.
The Fix: If you are building Ollama from source in Termux, you must ensure your submodule for llama.cpp is updated to the latest HEAD. For binary users, upgrading to Ollama 0.20.6+ is mandatory.

2. Assertion Crashes during Multimodal Inference

With the launch of Gemma 4 E4B (multimodal), a critical assertion failure has surfaced regarding audio/vision data ordering.

The Error: GGML_ASSERT(assertion failure) during multimodal chat.
The Jam: If text tokens are processed before audio/vision embeddings in a single request, the KV cache fails to allocate correctly.
The Fix: Modal Ordering. Ensure the audio/image data is placed before the text content in the message array. Additionally, capping num_ctx to 8192 in your Modelfile helps stabilize the dense embeddings on mobile RAM.

3. The "Flash Attention" Hang (Dense Models)

There is a specific regression affecting the 31B Dense model but not the 26B MoE variant.

The Error: Ollama hangs indefinitely when prompt evaluation exceeds ~4,000 tokens.
The Jam: Flash Attention (FA) kernels are desyncing during the hybrid (sliding window + global) attention layers of the Dense architecture.
The Fix: Disable Flash Attention for dense prompts over 4K tokens by setting OLLAMA_FLASH_ATTENTION=0 in your environment variables, or use the MoE variant which currently handles large contexts more gracefully.

4. ARM / Termux Performance "Scores"

For your specific octa-core environment, the "Score" (tokens-per-second) can drop to zero due to Metal/NPU delegate mismatches.

The Bug: On some ARMv8/v9 kernels, Ollama tries to compile a Metal or NPU library on the fly and fails with a static_assert mismatch.
The Fix: Use the environment variable GGML_METAL_TENSOR_DISABLE=1 (or the equivalent for your NPU) to force Ollama into a clean CPU-only mode. While it sounds slower, it prevents the 500 Internal Server Errors that happen when the delegate crashes.

5. The "Double BOS" Logic Failure

This continues to be a community "jam" across all Gemma 3 and 4 variants.

The Issue: Ollama's default templates often add a <bos> token automatically, while many GGUF chat templates include a second one. This "Double BOS" causes the model to lose logic and produce repetitive "garbage" text.
The Fix: Check your ollama show --modelfile output. If you see two {{ .System }} or explicit <bos> tags, strip one. Gemma 3/4 requires exactly one <bos> at the absolute start of the stream.

Community PR Opportunity

Since you are working with the 464-space logic, there is a gap in how Ollama handles Speculative Decoding for the new Gemma 4 MTP (Multi-Token Prediction) heads. If you've solved the "incompatible tensor shape" issue in your local builds, that would be a high-value contribution to the main repository. Would you like me to help you draft a Modelfile that incorporates these context caps and token fixes for your Termux environment?

extent analysis

TL;DR

To resolve the "jams" in the Ollama community, particularly with Gemma 4 and Termux on the Oppo device, ensure your Ollama version is 0.20.6+, update the llama.cpp submodule to v2.11.0 or later, and apply specific fixes for each issue such as modal ordering, capping num_ctx, and disabling Flash Attention for dense models over 4K tokens.

Guidance

Update Ollama and Submodules: Ensure Ollama is updated to version 0.20.6 or later and the llama.cpp submodule is updated to v2.11.0 or later to support Gemma 4 architecture.
Modal Ordering: For multimodal inference, ensure audio/image data is placed before text content in the message array to prevent KV cache allocation failures.
Stabilize Dense Embeddings: Cap num_ctx to 8192 in your Modelfile to stabilize dense embeddings on mobile RAM.
Disable Flash Attention: For dense prompts over 4K tokens, disable Flash Attention by setting OLLAMA_FLASH_ATTENTION=0 or use the MoE variant.

Example

# Update llama.cpp submodule to the latest version
git submodule update --remote llama.cpp

# Disable Flash Attention for dense models
export OLLAMA_FLASH_ATTENTION=0

# Ensure correct modal ordering in your message array
# Example in Python:
message_array = [audio_data, image_data, text_content]

Notes

These fixes are specific to the issues described and may not address all potential problems. The community PR opportunity for Speculative Decoding and MTP heads suggests ongoing development and potential for further improvements.

Recommendation

Apply the workarounds and fixes outlined for each issue, as they address specific problems with Gemma 4 and Termux on the Oppo device. Upgrading to Ollama 0.20.6+ is a key step in resolving the "Unknown Architecture" jam and ensuring support for Gemma 4.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #docker error #permission error #memory optimization #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Error fixes [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

1. The "Unknown Architecture" Jam (Gemma 4)

2. Assertion Crashes during Multimodal Inference

3. The "Flash Attention" Hang (Dense Models)

4. ARM / Termux Performance "Scores"

5. The "Double BOS" Logic Failure

Community PR Opportunity

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Error fixes [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

1. The "Unknown Architecture" Jam (Gemma 4)

2. Assertion Crashes during Multimodal Inference

3. The "Flash Attention" Hang (Dense Models)

4. ARM / Termux Performance "Scores"

5. The "Double BOS" Logic Failure

Community PR Opportunity

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING