ollama - 💡(How to fix) Fix gemma4:26b thinking mode causes passive/incomplete tool-calling behavior on CUDA via /v1/chat/completions

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When using gemma4:26b with thinking enabled (chat_template_kwargs: {"thinking": true}) via the OpenAI-compatible /v1/chat/completions endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task.

Root Cause

When using gemma4:26b with thinking enabled (chat_template_kwargs: {"thinking": true}) via the OpenAI-compatible /v1/chat/completions endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task.

Fix Action

Workaround

Disabling thinking ("thinking": false) restores correct agentic behavior on CUDA.

RAW_BUFFERClick to expand / collapse

Description

When using gemma4:26b with thinking enabled (chat_template_kwargs: {"thinking": true}) via the OpenAI-compatible /v1/chat/completions endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task.

Environment

  • OS: Linux (WSL2)
  • GPU: RTX 5090
  • Ollama version: 0.20.4
  • Model: gemma4:26b (Q4_K_M)
  • Endpoint: /v1/chat/completions
  • Thinking: enabled, reasoning_effort: low

Steps to Reproduce

  1. Pull gemma4:26b
  2. Enable thinking mode via chat_template_kwargs: {"thinking": true}
  3. Use the /v1/chat/completions endpoint in an agentic loop with tools defined
  4. Ask the model to perform a multi-step task (e.g. clone a repo, then add a new API endpoint to a file)

Expected Behavior

Model completes agentic tasks end-to-end — reads files, writes code, uses tools — as observed on macOS Metal with identical Ollama version, model, and thinking settings.

Actual Behavior

Model reads a file via tool call, summarizes the contents, then responds conversationally asking for further instructions instead of proceeding with the requested task. Subsequent tool calls (e.g. write_file) are never made.

Workaround

Disabling thinking ("thinking": false) restores correct agentic behavior on CUDA.

Notes

The same setup (identical Ollama version 0.20.4, same model tag, same thinking settings, same harness and prompts) works correctly on macOS Metal (Apple M-series, Mac Mini 64GB RAM). This points to a platform-specific difference in how thinking output is handled on CUDA vs Metal.

Related to #15288 (thinking output routed to reasoning field on /v1/chat/completions).

extent analysis

TL;DR

Disabling thinking mode by setting "thinking": false in chat_template_kwargs may temporarily resolve the issue of the model not following through with subsequent tool calls on a Linux CUDA backend.

Guidance

  • The issue seems to be platform-specific, related to how thinking output is handled on CUDA vs Metal, so comparing the behavior on different platforms may provide insights.
  • Verify that the model is correctly configured and that the /v1/chat/completions endpoint is being used as intended, especially with the chat_template_kwargs settings.
  • Try adjusting the reasoning_effort parameter to see if it affects the model's behavior, as the current setting is low.
  • Consider investigating related issues, such as #15288, which mentions thinking output being routed to the reasoning field, to see if there are any known fixes or workarounds.

Example

No specific code snippet can be provided without more context, but ensuring the correct usage of the /v1/chat/completions endpoint and chat_template_kwargs is crucial.

Notes

The provided workaround of disabling thinking mode may not be ideal, as it alters the model's behavior. Further investigation into the platform-specific differences and how thinking output is handled on CUDA vs Metal is necessary for a more permanent solution.

Recommendation

Apply the workaround by setting "thinking": false until a more permanent fix can be found, as it restores the correct agentic behavior on the CUDA backend.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING