openclaw - ✅(Solved) Fix [Bug]: Ollama provider reports hardcoded 32768 input tokens, triggering premature compaction [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64326Fetched 2026-04-11 06:15:22
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×1cross-referenced ×1labeled ×1referenced ×1

When using the Ollama provider with local models (e.g., glm-flash:latest), OpenClaw reports input: 32768 tokens for every single message regardless of actual content. This causes the compaction safeguard to trigger after every response, breaking multi-turn conversations.

Error Message

Silent failure — no error shown to user, just empty responses or “heartbeat_ok” error: lane=main error="FailoverError: LLM request timed out."

Root Cause

When using the Ollama provider with local models (e.g., glm-flash:latest), OpenClaw reports input: 32768 tokens for every single message regardless of actual content. This causes the compaction safeguard to trigger after every response, breaking multi-turn conversations.

Fix Action

Fix / Workaround

Ollama Provider Config: json { "baseUrl": "http://localhost:11434", "api": "ollama", "models": [ { "id": "glm-flash:latest", "name": "GLM-4.7-Flash (Local 202K)", "contextWindow": 202752 }, { "id": "qwen3-coder:latest", "name": "Qwen3-Coder MoE (Local Fast)", "contextWindow": 32768 } ] } Compaction Config (default before workaround): json { "mode": "safeguard", "reserveTokens": 24000, "keepRecentTokens": 16000, "memoryFlush": { "enabled": true, "softThresholdTokens": 8000 } } Gateway Setup: - Caddy reverse proxy: :8443 → localhost:18789 - Auth: token-based - trustedProxies: ["127.0.0.1", "::1"]

Severity: HIGH Breaks multi-turn conversations — users cannot have back-and-forth dialogue Silent failure — no error shown to user, just empty responses or “heartbeat_ok” Affects all Ollama models — not specific to GLM, likely affects all local models Data loss — compaction wipes conversation context prematurely Workaround required — must set artificially high thresholds to use Ollama at all

PR fix notes

PR #64568: fix(ollama): enable streaming usage for OpenAI-compat endpoint (closes #64326)

Description (problem / solution / changelog)

Summary

  • Problem: Ollama provider reports hardcoded 32768 input tokens for every message, triggering premature compaction and breaking multi-turn conversations.
  • Why it matters: All Ollama local model users experience broken conversations — compaction fires after every response, wiping context.
  • What changed: Added Ollama to the supportsUsageInStreaming list in resolveOpenAICompletionsCompatDefaults, so stream_options.include_usage=true is sent to Ollama's OpenAI-compat endpoint. This makes Ollama return real usage.prompt_tokens and usage.completion_tokens in the final streaming chunk.
  • What did NOT change: Native Ollama API path (already correct), other providers' streaming usage behavior.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #64326
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: resolveOpenAICompletionsCompatDefaults in src/agents/openai-completions-compat.ts did not recognize Ollama as supporting streaming usage. Without stream_options.include_usage=true in the request, Ollama's OpenAI-compat endpoint omits usage data from streaming responses. The OpenAI transport stream then reports 0 input tokens (or falls back to incorrect values), while a separate code path uses contextWindow (32768) as the token count, triggering compaction.
  • Missing detection / guardrail: No test for Ollama OpenAI-compat streaming usage support.
  • Contributing context: Ollama's native API path correctly returns prompt_eval_count, but when accessed via /v1/chat/completions (OpenAI-compat), usage requires explicit opt-in via stream_options.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: src/agents/openai-completions-compat.test.ts
  • Scenario the test should lock in: resolveOpenAICompletionsCompatDefaults({ provider: "ollama" }).supportsUsageInStreaming === true
  • If no new test is added, why not: Keeping PR minimal — happy to add if requested

User-visible / Behavior Changes

  • Ollama models via OpenAI-compat endpoint now report accurate token counts
  • Multi-turn conversations no longer break due to premature compaction
  • Compaction only triggers when context actually approaches the limit

Diagram (if applicable)

Before:
request to Ollama /v1/chat/completions (no stream_options.include_usage)
-> Ollama omits usage in stream -> reported as 0 or contextWindow -> compaction triggers

After:
request to Ollama /v1/chat/completions (stream_options.include_usage=true)
-> Ollama returns real usage in final chunk -> accurate token count -> no premature compaction

Security Impact (required)

  • New permissions/capabilities? No
  • Auth boundary changes? No
  • Secrets/token exposure risk? No
  • New external calls? No
  • Sandbox/isolation changes? No

Evidence

  • Code trace confirms Ollama was missing from streaming usage support list
  • pnpm check passes
  • AI-assisted: fix authored by Claude Code, reviewed and verified manually

Human Verification (required)

  • Verified scenarios: Traced the full path from request construction through streaming usage parsing
  • Edge cases checked: Native Ollama API (unaffected), non-Ollama providers (unaffected)
  • What you did not verify: Live Ollama integration test

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: Older Ollama versions might not support stream_options.include_usage
    • Mitigation: Ollama silently ignores unknown request fields, so this is safe for older versions (they just won't return usage, same as before)

Changed files

  • src/agents/openai-completions-compat.ts (modified, +4/-1)

Code Example

{"type":"message","message":{"role":"assistant","content":[{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output":12}}}
{"type":"compaction","tokensBefore":32780,"fromHook":true}

The 32768 value is suspiciously exact — it’s 215, a common default context size
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When using the Ollama provider with local models (e.g., glm-flash:latest), OpenClaw reports input: 32768 tokens for every single message regardless of actual content. This causes the compaction safeguard to trigger after every response, breaking multi-turn conversations.

Steps to reproduce

Configure Ollama provider in openclaw.json with a local model Set agents.defaults.model.primary to ollama/glm-flash:latest Start a chat session via the gateway web UI Send “hi” — get a response Send a follow-up message

Expected behavior

Token usage reflects actual tokens used Conversation continues normally across multiple turns Compaction only triggers when context approaches limit

Actual behavior

Every response logs exactly input: 32768 regardless of actual content: json "usage":{"input":32768,"output" 12,"cacheRead":0,"cacheWrite":0,"totalTokens":32774} Compaction triggers immediately after each response: [compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop. After compaction, subsequent responses return "content": [] (empty).

OpenClaw version

2026.4.9 (0512059)

Operating system

Devuan GNU/Linux 6 (excalibur)

Install method

npm install -g openclaw

Model

Z.AI/GLM-4.7-FLASH/local

Provider / routing chain

Primary Model: json { "primary": "ollama/glm-flash:latest", "fallbacks": ["ollama/qwen3 coder:latest"] } Routing Flow: 1. User message → OpenClaw gateway (port 18789) 2. Gateway → Ollama provider (localhost:11434) 3. Ollama → glm-flash:latest (GLM-4.7-Flash, 17.5GB) 4. On timeout → fallback to qwen3-coder:latest

Additional provider/model setup details

Ollama Provider Config: json { "baseUrl": "http://localhost:11434", "api": "ollama", "models": [ { "id": "glm-flash:latest", "name": "GLM-4.7-Flash (Local 202K)", "contextWindow": 202752 }, { "id": "qwen3-coder:latest", "name": "Qwen3-Coder MoE (Local Fast)", "contextWindow": 32768 } ] } Compaction Config (default before workaround): json { "mode": "safeguard", "reserveTokens": 24000, "keepRecentTokens": 16000, "memoryFlush": { "enabled": true, "softThresholdTokens": 8000 } } Gateway Setup: - Caddy reverse proxy: :8443 → localhost:18789 - Auth: token-based - trustedProxies: ["127.0.0.1", "::1"]

Logs, screenshots, and evidence

{"type":"message","message":{"role":"assistant","content":[{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output":12}}}
{"type":"compaction","tokensBefore":32780,"fromHook":true}

The 32768 value is suspiciously exact — it’s 215, a common default context size

Impact and severity

Severity: HIGH Breaks multi-turn conversations — users cannot have back-and-forth dialogue Silent failure — no error shown to user, just empty responses or “heartbeat_ok” Affects all Ollama models — not specific to GLM, likely affects all local models Data loss — compaction wipes conversation context prematurely Workaround required — must set artificially high thresholds to use Ollama at all

Additional information

Session Log Evidence: json {"type":"message","message":{"role":"assistant","content" [{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output" 12}}} {"type":"compaction","tokensBefore":32780,"fromHook":true} Analysis: - The 32768 value is exactly 215 — a common hardcoded default - Ollama API does return actual token counts in responses (prompt_eval_count, eval_count) - OpenClaw appears to ignore Ollama’s reported values and use a fallback constant - The bug is likely in the Ollama provider’s response parsing code Gateway Logs Show Timeout Cascade: [model-fallback] model fallback decision: decision=candidate_failed requested=ollama/glm-flash:latest reason=timeout [diagnostic] lane task error: lane=main error="FailoverError: LLM request timed out." Related: Initial cold-start timeouts may also be related — Ollama needs time to load models into VRAM, but OpenClaw’s default timeout may be too short for large models (17.5GB GLM)

extent analysis

TL;DR

The issue can likely be fixed by modifying the Ollama provider's response parsing code to use the actual token counts returned by the Ollama API instead of a hardcoded default value.

Guidance

  • Review the Ollama provider's code to find where it parses the response from the Ollama API and identify the hardcoded default value (32768) being used for input tokens.
  • Update the code to use the actual token counts returned by the Ollama API, such as prompt_eval_count and eval_count.
  • Verify that the updated code correctly handles cases where the Ollama API returns null or empty values for token counts.
  • Consider increasing the default timeout value in OpenClaw's configuration to accommodate large models like the 17.5GB GLM, which may take longer to load into VRAM.

Example

// Example of how the Ollama API response might be parsed
const ollamaResponse = {
  "prompt_eval_count": 100,
  "eval_count": 50,
  // ...
};

// Update the code to use the actual token counts
const inputTokens = ollamaResponse.prompt_eval_count + ollamaResponse.eval_count;

Notes

The issue seems to be specific to the Ollama provider and its interaction with the OpenClaw gateway. The fix will likely require modifying the Ollama provider's code to correctly handle the token counts returned by the Ollama API.

Recommendation

Apply a workaround by updating the Ollama provider's code to use the actual token counts returned by the Ollama API, as this will likely resolve the issue and allow multi-turn conversations to work correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Token usage reflects actual tokens used Conversation continues normally across multiple turns Compaction only triggers when context approaches limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Ollama provider reports hardcoded 32768 input tokens, triggering premature compaction [1 pull requests, 1 comments, 2 participants]