openclaw - ✅(Solved) Fix [Bug]: Ollama provider reports hardcoded 32768 input tokens, triggering premature compaction [1 pull requests, 1 comments, 2 participants]

pookNast · 2026-04-10T13:42:54Z

[openclaw] When using the Ollama provider with local models e.g., glm-flash:latest , OpenClaw reports input: 32768 tokens for every single message regardless o… When using the Ollama provider with local models (e.g., glm-flash:latest), OpenClaw reports input: 32768 tokens for every single message regardless of actual content. This causes the compaction safeguard to trigger after every response, breaking multi-turn conversations. # PR #64568: fix(ollama): enable streaming usage for OpenAI-compat endpoint (closes #64326) - Repository: openclaw/openclaw - Author: xchunzhao - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/64568 ## Description (problem / solution / changelog) ## Summary - **Problem:** Ollama provider reports hardcoded 32768 input tokens for every message, triggering premature compaction and breaking multi-turn conversations. - **Why it matters:** All Ollama local model users experience broken conversations — compaction fires after every response, wiping context. - **What changed:** Added Ollama to the `supportsUsageInStreaming` list in `resolveOpenAICompletionsCompatDefaults`, so `stream_options.include_usage=true` is sent to Ollama's OpenAI-compat endpoint. This makes Ollama return real `usage.prompt_tokens` and `usage.completion_tokens` in the final streaming chunk. - **What did NOT change:** Native Ollama API path (already correct), other providers' streaming usage behavior. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #64326 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - **Root cause:** `resolveOpenAICompletionsCompatDefaults` in `src/agents/openai-completions-compat.ts` did not recognize Ollama as supporting streaming usage. Without `stream_options.include_usage=true` in the request, Ollama's OpenAI-compat endpoint omits usage data from streaming responses. The OpenAI transport stream then reports 0 input tokens (or falls back to incorrect values), while a separate code path uses `contextWindow` (32768) as the token count, triggering compaction. - **Missing detection / guardrail:** No test for Ollama OpenAI-compat streaming usage support. - **Contributing context:** Ollama's native API path correctly returns `prompt_eval_count`, but when accessed via `/v1/chat/completions` (OpenAI-compat), usage requires explicit opt-in via `stream_options`. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - Target test or file: `src/agents/openai-completions-compat.test.ts` - Scenario the test should lock in: `resolveOpenAICompletionsCompatDefaults({ provider: "ollama" }).supportsUsageInStreaming === true` - If no new test is added, why not: Keeping PR minimal — happy to add if requested ## User-visible / Behavior Changes - Ollama models via OpenAI-compat endpoint now report accurate token counts - Multi-turn conversations no longer break due to premature compaction - Compaction only triggers when context actually approaches the limit ## Diagram (if applicable) ```text Before: request to Ollama /v1/chat/completions (no stream_options.include_usage) -> Ollama omits usage in stream -> reported as 0 or contextWindow -> compaction triggers After: request to Ollama /v1/chat/completions (stream_options.include_usage=true) -> Ollama returns real usage in final chunk -> accurate token count -> no premature compaction ``` ## Security Impact (required) - New permissions/capabilities? No - Auth boundary changes? No - Secrets/token exposure risk? No - New external calls? No - Sandbox/isolation changes? No ## Evidence - Code trace confirms Ollama was missing from streaming usage support list - `pnpm check` passes - AI-assisted: fix authored by Claude Code, reviewed and verified manually ## Human Verification (required) - Verified scenarios: Traced the full path from request construction through streaming usage parsing - Edge cases checked: Native Ollama API (unaffected), non-Ollama providers (unaffected) - What you did **not** verify: Live Ollama integration test ## Review Conversations - [x] I replied to or resolved every bot review conversation I addressed in this PR. - [x] I left unresolved only the conversations that still need reviewer or maintainer judgment. ## Compatibility / Migration - Backward compatible? Yes - Config/env changes? No - Migration needed? No ## Risks and Mitigations - Risk: Older Ollama versions might not support stream_options.include_usage - Mitigation: Ollama silently ignores unknown request fields, so this is safe for older versions (they just won't return usage, same as before) ## Changed files - `src/agents/openai-completions-compat.ts` (modi

openclaw2026-04-10 13:42:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#64326•Fetched 2026-04-11 06:15:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pookNast

Participants

martingarramon

pookNast

Timeline (top)

commented ×1cross-referenced ×1labeled ×1referenced ×1

When using the Ollama provider with local models (e.g., glm-flash:latest), OpenClaw reports input: 32768 tokens for every single message regardless of actual content. This causes the compaction safeguard to trigger after every response, breaking multi-turn conversations.

Error Message

Silent failure — no error shown to user, just empty responses or “heartbeat_ok” error: lane=main error="FailoverError: LLM request timed out."

Root Cause

Fix Action

Fix / Workaround

Ollama Provider Config: json { "baseUrl": "http://localhost:11434", "api": "ollama", "models": [ { "id": "glm-flash:latest", "name": "GLM-4.7-Flash (Local 202K)", "contextWindow": 202752 }, { "id": "qwen3-coder:latest", "name": "Qwen3-Coder MoE (Local Fast)", "contextWindow": 32768 } ] } Compaction Config (default before workaround): json { "mode": "safeguard", "reserveTokens": 24000, "keepRecentTokens": 16000, "memoryFlush": { "enabled": true, "softThresholdTokens": 8000 } } Gateway Setup: - Caddy reverse proxy: :8443 → localhost:18789 - Auth: token-based - trustedProxies: ["127.0.0.1", "::1"]

Severity: HIGH Breaks multi-turn conversations — users cannot have back-and-forth dialogue Silent failure — no error shown to user, just empty responses or “heartbeat_ok” Affects all Ollama models — not specific to GLM, likely affects all local models Data loss — compaction wipes conversation context prematurely Workaround required — must set artificially high thresholds to use Ollama at all

PR fix notes

PR #64568: fix(ollama): enable streaming usage for OpenAI-compat endpoint (closes #64326)

Repository: openclaw/openclaw
Author: xchunzhao
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/64568

Description (problem / solution / changelog)

Summary

Problem: Ollama provider reports hardcoded 32768 input tokens for every message, triggering premature compaction and breaking multi-turn conversations.
Why it matters: All Ollama local model users experience broken conversations — compaction fires after every response, wiping context.
What changed: Added Ollama to the supportsUsageInStreaming list in resolveOpenAICompletionsCompatDefaults, so stream_options.include_usage=true is sent to Ollama's OpenAI-compat endpoint. This makes Ollama return real usage.prompt_tokens and usage.completion_tokens in the final streaming chunk.
What did NOT change: Native Ollama API path (already correct), other providers' streaming usage behavior.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #64326
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: resolveOpenAICompletionsCompatDefaults in src/agents/openai-completions-compat.ts did not recognize Ollama as supporting streaming usage. Without stream_options.include_usage=true in the request, Ollama's OpenAI-compat endpoint omits usage data from streaming responses. The OpenAI transport stream then reports 0 input tokens (or falls back to incorrect values), while a separate code path uses contextWindow (32768) as the token count, triggering compaction.
Missing detection / guardrail: No test for Ollama OpenAI-compat streaming usage support.
Contributing context: Ollama's native API path correctly returns prompt_eval_count, but when accessed via /v1/chat/completions (OpenAI-compat), usage requires explicit opt-in via stream_options.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test or file: src/agents/openai-completions-compat.test.ts
Scenario the test should lock in: resolveOpenAICompletionsCompatDefaults({ provider: "ollama" }).supportsUsageInStreaming === true
If no new test is added, why not: Keeping PR minimal — happy to add if requested

User-visible / Behavior Changes

Ollama models via OpenAI-compat endpoint now report accurate token counts
Multi-turn conversations no longer break due to premature compaction
Compaction only triggers when context actually approaches the limit

Diagram (if applicable)

Before:
request to Ollama /v1/chat/completions (no stream_options.include_usage)
-> Ollama omits usage in stream -> reported as 0 or contextWindow -> compaction triggers

After:
request to Ollama /v1/chat/completions (stream_options.include_usage=true)
-> Ollama returns real usage in final chunk -> accurate token count -> no premature compaction

Security Impact (required)

New permissions/capabilities? No
Auth boundary changes? No
Secrets/token exposure risk? No
New external calls? No
Sandbox/isolation changes? No

Evidence

Code trace confirms Ollama was missing from streaming usage support list
pnpm check passes
AI-assisted: fix authored by Claude Code, reviewed and verified manually

Human Verification (required)

Verified scenarios: Traced the full path from request construction through streaming usage parsing
Edge cases checked: Native Ollama API (unaffected), non-Ollama providers (unaffected)
What you did not verify: Live Ollama integration test

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: Older Ollama versions might not support stream_options.include_usage
- Mitigation: Ollama silently ignores unknown request fields, so this is safe for older versions (they just won't return usage, same as before)

Changed files

src/agents/openai-completions-compat.ts (modified, +4/-1)

Code Example

{"type":"message","message":{"role":"assistant","content":[{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output":12}}}
{"type":"compaction","tokensBefore":32780,"fromHook":true}

The 32768 value is suspiciously exact — it’s 215, a common default context size

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Configure Ollama provider in openclaw.json with a local model Set agents.defaults.model.primary to ollama/glm-flash:latest Start a chat session via the gateway web UI Send “hi” — get a response Send a follow-up message

Expected behavior

Token usage reflects actual tokens used Conversation continues normally across multiple turns Compaction only triggers when context approaches limit

Actual behavior

Every response logs exactly input: 32768 regardless of actual content: json "usage":{"input":32768,"output" 12,"cacheRead":0,"cacheWrite":0,"totalTokens":32774} Compaction triggers immediately after each response: [compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop. After compaction, subsequent responses return "content": [] (empty).

OpenClaw version

2026.4.9 (0512059)

Operating system

Devuan GNU/Linux 6 (excalibur)

Install method

npm install -g openclaw

Model

Z.AI/GLM-4.7-FLASH/local

Provider / routing chain

Primary Model: json { "primary": "ollama/glm-flash:latest", "fallbacks": ["ollama/qwen3 coder:latest"] } Routing Flow: 1. User message → OpenClaw gateway (port 18789) 2. Gateway → Ollama provider (localhost:11434) 3. Ollama → glm-flash:latest (GLM-4.7-Flash, 17.5GB) 4. On timeout → fallback to qwen3-coder:latest

Additional provider/model setup details

Logs, screenshots, and evidence

{"type":"message","message":{"role":"assistant","content":[{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output":12}}}
{"type":"compaction","tokensBefore":32780,"fromHook":true}

The 32768 value is suspiciously exact — it’s 215, a common default context size

Impact and severity

Additional information

Session Log Evidence: json {"type":"message","message":{"role":"assistant","content" [{"type":"text","text":"Hi pook.\n\nWhat's on your mind today?"}],"usage":{"input":32768,"output" 12}}} {"type":"compaction","tokensBefore":32780,"fromHook":true} Analysis: - The 32768 value is exactly 215 — a common hardcoded default - Ollama API does return actual token counts in responses (prompt_eval_count, eval_count) - OpenClaw appears to ignore Ollama’s reported values and use a fallback constant - The bug is likely in the Ollama provider’s response parsing code Gateway Logs Show Timeout Cascade: [model-fallback] model fallback decision: decision=candidate_failed requested=ollama/glm-flash:latest reason=timeout [diagnostic] lane task error: lane=main error="FailoverError: LLM request timed out." Related: Initial cold-start timeouts may also be related — Ollama needs time to load models into VRAM, but OpenClaw’s default timeout may be too short for large models (17.5GB GLM)

extent analysis

TL;DR

The issue can likely be fixed by modifying the Ollama provider's response parsing code to use the actual token counts returned by the Ollama API instead of a hardcoded default value.

Guidance

Review the Ollama provider's code to find where it parses the response from the Ollama API and identify the hardcoded default value (32768) being used for input tokens.
Update the code to use the actual token counts returned by the Ollama API, such as prompt_eval_count and eval_count.
Verify that the updated code correctly handles cases where the Ollama API returns null or empty values for token counts.
Consider increasing the default timeout value in OpenClaw's configuration to accommodate large models like the 17.5GB GLM, which may take longer to load into VRAM.

Example

// Example of how the Ollama API response might be parsed
const ollamaResponse = {
  "prompt_eval_count": 100,
  "eval_count": 50,
  // ...
};

// Update the code to use the actual token counts
const inputTokens = ollamaResponse.prompt_eval_count + ollamaResponse.eval_count;

Notes

The issue seems to be specific to the Ollama provider and its interaction with the OpenClaw gateway. The fix will likely require modifying the Ollama provider's code to correctly handle the token counts returned by the Ollama API.

Recommendation

Apply a workaround by updating the Ollama provider's code to use the actual token counts returned by the Ollama API, as this will likely resolve the issue and allow multi-turn conversations to work correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Token usage reflects actual tokens used Conversation continues normally across multiple turns Compaction only triggers when context approaches limit

#api #LLM response #prompt template #agent execution #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Ollama provider reports hardcoded 32768 input tokens, triggering premature compaction [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #64568: fix(ollama): enable streaming usage for OpenAI-compat endpoint (closes #64326)

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING