ollama - 💡(How to fix) Fix Claude Code & Ollama Integration - Invalid tool parameters & CPU Fallback [8 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15390Fetched 2026-04-08 03:01:15
View on GitHub
Comments
8
Participants
3
Timeline
9
Reactions
0
Timeline (top)
commented ×8labeled ×1

When using Claude Code (CLI) with a local Ollama instance, the agent consistently fails during tool execution (e.g., entering "Plan Mode" or reading files). The model generates invalid JSON for the tool calls, leading to a loop of "Invalid tool parameters" errors.

Additionally, specific configurations cause extreme CPU spikes (100%+) and slow response times (50s+), which seems to be related to an unintended vision-processing overhead and a Flash Attention fallback.

Root Cause

When using Claude Code (CLI) with a local Ollama instance, the agent consistently fails during tool execution (e.g., entering "Plan Mode" or reading files). The model generates invalid JSON for the tool calls, leading to a loop of "Invalid tool parameters" errors.

Additionally, specific configurations cause extreme CPU spikes (100%+) and slow response times (50s+), which seems to be related to an unintended vision-processing overhead and a Flash Attention fallback.

Code Example

environment:
      - OLLAMA_SCHED_SPREAD=true
      - OLLAMA_NUM_CTX=32768
      - OLLAMA_FLASH_ATTENTION=0  # Setting to 1 causes 100% CPU load instead of GPU boost
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

---

time=2026-04-07T12:23:41.133Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... FlashAttention:Disabled KvSize:32768 ...}"
time=2026-04-07T12:23:41.392Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=147.336691ms shape="[2560 256]"
time=2026-04-07T12:23:41.579Z level=INFO source=ggml.go:494 msg="offloaded 43/43 layers to GPU"
[GIN] 2026/04/07 - 12:24:32 | 200 | 53.364869091s | 192.168.66.36 | POST "/v1/messages?beta=true"
RAW_BUFFERClick to expand / collapse

What is the issue?

Description

When using Claude Code (CLI) with a local Ollama instance, the agent consistently fails during tool execution (e.g., entering "Plan Mode" or reading files). The model generates invalid JSON for the tool calls, leading to a loop of "Invalid tool parameters" errors.

Additionally, specific configurations cause extreme CPU spikes (100%+) and slow response times (50s+), which seems to be related to an unintended vision-processing overhead and a Flash Attention fallback.

System Environment

  • OS: Linux (Docker Deployment)
  • GPU: 2x NVIDIA GeForce RTX 3060 (12GB VRAM each)
  • Ollama Version: 0.20.3
  • Model: gemma4 (Local blob: sha256-4c27e0f5...)
  • Claude Code Command: ollama launch cloude

Docker Configuration (docker-compose.yml)

    environment:
      - OLLAMA_SCHED_SPREAD=true
      - OLLAMA_NUM_CTX=32768
      - OLLAMA_FLASH_ATTENTION=0  # Setting to 1 causes 100% CPU load instead of GPU boost
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Steps to Reproduce

Connect Claude Code to the local Ollama instance.

Provide a complex coding task (e.g., "Fix my project architecture and MQTT connection").

The agent attempts to initialize its internal "Plan Mode" tool.

The CLI returns: ⎿ Invalid tool parameters.

The model enters a loop: It apologizes for the wrong parameters and retries with the same (or similarly broken) JSON schema until the process is aborted.

Suspected Causes

Tool Parameter Formatting: The gemma4 model (likely due to its template or architecture) does not produce the exact JSON schema required by Claude Code's tool definitions.

Vision Encoder Overhead: The runner executes vision-related code (vision: encoded) for code-only prompts, which increases latency significantly and might interfere with the attention mechanism.

Flash Attention Regression: OLLAMA_FLASH_ATTENTION=1 results in a massive CPU spike. This suggests that the presence of the vision projector forces a fallback to a CPU-based attention implementation that is not optimized for large contexts.

Context Management: There is a discrepancy between the requested NUM_CTX=32768 and the actual prompt processing speed/stability wenn tools are involved.

Additional Context

The CPU load remains normal only wenn OLLAMA_FLASH_ATTENTION is disabled. However, the tool-calling issue persists regardless of this setting, preventing the agent from completing multi-step tasks.

Relevant log output

time=2026-04-07T12:23:41.133Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc ... FlashAttention:Disabled KvSize:32768 ...}"
time=2026-04-07T12:23:41.392Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=147.336691ms shape="[2560 256]"
time=2026-04-07T12:23:41.579Z level=INFO source=ggml.go:494 msg="offloaded 43/43 layers to GPU"
[GIN] 2026/04/07 - 12:24:32 | 200 | 53.364869091s | 192.168.66.36 | POST "/v1/messages?beta=true"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

TL;DR

Disable OLLAMA_FLASH_ATTENTION and adjust OLLAMA_NUM_CTX to a lower value to mitigate CPU spikes and tool parameter issues.

Guidance

  • Verify that the gemma4 model template or architecture is compatible with Claude Code's tool definitions to resolve the JSON schema mismatch.
  • Investigate the vision encoder overhead by checking if the vision: encoded log message is related to the tool execution, and consider optimizing or disabling vision-related code for code-only prompts.
  • Test with a lower OLLAMA_NUM_CTX value (e.g., 16384 or 8192) to reduce context management discrepancies and potential performance issues.
  • Monitor CPU load and response times after applying these changes to ensure the fixes are effective.

Example

No code snippet is provided as the issue is related to configuration and model compatibility.

Notes

The provided log output suggests that the OLLAMA_FLASH_ATTENTION setting has a significant impact on CPU load, and disabling it may help mitigate the issue. However, the root cause of the tool parameter formatting issue remains unclear and may require further investigation into the gemma4 model or Claude Code's tool definitions.

Recommendation

Apply workaround: Disable OLLAMA_FLASH_ATTENTION and adjust OLLAMA_NUM_CTX to a lower value, as this may help mitigate the CPU spikes and tool parameter issues, allowing for further debugging and potential resolution of the underlying compatibility problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING