ollama - 💡(How to fix) Fix gemma4:26b thinking mode causes passive/incomplete tool-calling behavior on CUDA via /v1/chat/completions

ollama2026-04-09 20:05:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When using gemma4:26b with thinking enabled (chat_template_kwargs: {"thinking": true}) via the OpenAI-compatible /v1/chat/completions endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task.

Root Cause

Fix Action

Workaround

Disabling thinking ("thinking": false) restores correct agentic behavior on CUDA.

RAW_BUFFERClick to expand / collapse

Description

Environment

OS: Linux (WSL2)
GPU: RTX 5090
Ollama version: 0.20.4
Model: gemma4:26b (Q4_K_M)
Endpoint: /v1/chat/completions
Thinking: enabled, reasoning_effort: low

Steps to Reproduce

Pull gemma4:26b
Enable thinking mode via chat_template_kwargs: {"thinking": true}
Use the /v1/chat/completions endpoint in an agentic loop with tools defined
Ask the model to perform a multi-step task (e.g. clone a repo, then add a new API endpoint to a file)

Expected Behavior

Model completes agentic tasks end-to-end — reads files, writes code, uses tools — as observed on macOS Metal with identical Ollama version, model, and thinking settings.

Actual Behavior

Model reads a file via tool call, summarizes the contents, then responds conversationally asking for further instructions instead of proceeding with the requested task. Subsequent tool calls (e.g. write_file) are never made.

Workaround

Disabling thinking ("thinking": false) restores correct agentic behavior on CUDA.

Notes

The same setup (identical Ollama version 0.20.4, same model tag, same thinking settings, same harness and prompts) works correctly on macOS Metal (Apple M-series, Mac Mini 64GB RAM). This points to a platform-specific difference in how thinking output is handled on CUDA vs Metal.

Related to #15288 (thinking output routed to reasoning field on /v1/chat/completions).

extent analysis

TL;DR

Disabling thinking mode by setting "thinking": false in chat_template_kwargs may temporarily resolve the issue of the model not following through with subsequent tool calls on a Linux CUDA backend.

Guidance

The issue seems to be platform-specific, related to how thinking output is handled on CUDA vs Metal, so comparing the behavior on different platforms may provide insights.
Verify that the model is correctly configured and that the /v1/chat/completions endpoint is being used as intended, especially with the chat_template_kwargs settings.
Try adjusting the reasoning_effort parameter to see if it affects the model's behavior, as the current setting is low.
Consider investigating related issues, such as #15288, which mentions thinking output being routed to the reasoning field, to see if there are any known fixes or workarounds.

Example

No specific code snippet can be provided without more context, but ensuring the correct usage of the /v1/chat/completions endpoint and chat_template_kwargs is crucial.

Notes

The provided workaround of disabling thinking mode may not be ideal, as it alters the model's behavior. Further investigation into the platform-specific differences and how thinking output is handled on CUDA vs Metal is necessary for a more permanent solution.

Recommendation

Apply the workaround by setting "thinking": false until a more permanent fix can be found, as it restores the correct agentic behavior on the CUDA backend.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix gemma4:26b thinking mode causes passive/incomplete tool-calling behavior on CUDA via /v1/chat/completions

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Workaround

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix gemma4:26b thinking mode causes passive/incomplete tool-calling behavior on CUDA via /v1/chat/completions

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Workaround

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING