openclaw - ✅(Solved) Fix fix(agents): context overflow not detected for llama.cpp server provider [1 pull requests, 1 comments, 2 participants]

openclaw2026-04-10 07:50:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#64180•Fetched 2026-04-11 06:16:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

alexander-applyinnovations

Participants

alexander-applyinnovations

Jah-yee

Timeline (top)

referenced ×5closed ×1commented ×1cross-referenced ×1

Error Message

When llama.cpp returns its native overflow error, openclaw does not classify it as a context overflow → auto-compaction never triggers → the run fails surfacing the raw upstream error to the user. error=400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it It then falls through to matchesProviderContextOverflow() in src/agents/pi-embedded-helpers/provider-error-patterns.ts. The candidate pre-check (PROVIDER_CONTEXT_OVERFLOW_SIGNAL_RE + PROVIDER_CONTEXT_OVERFLOW_ACTION_RE) passes — the message contains "request", "tokens", "context" and "exceeds" — but none of the concrete PROVIDER_CONTEXT_OVERFLOW_PATTERNS regexes match either: | \bollama error:\s*context length exceeded(?:,\s*too many tokens)?\b | No |

The run aborts with an opaque error instead of compacting and retrying. Add a llama.cpp-shaped pattern to PROVIDER_CONTEXT_OVERFLOW_PATTERNS in src/agents/pi-embedded-helpers/provider-error-patterns.ts. Llama.cpp's wording is stable across versions: Plus a matching test case in src/agents/pi-embedded-helpers/provider-error-patterns.test.ts. PR to follow.
src/agents/pi-embedded-helpers/provider-error-patterns.ts — PROVIDER_CONTEXT_OVERFLOW_PATTERNS
src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — coverage
Model: Qwen3.5-35B-A3B (any model — the error is from llama.cpp's slot manager, not the model)

Fix Action

Fixed

Fixed by PR: fix(agents): detect llama.cpp slot overflow as context overflow (https://github.com/openclaw/openclaw/pull/64196)
Closed with commit: 57e6aeca840b2556c8cc74684e7cf6665d25f5eb

PR fix notes

PR #64196: fix(agents): detect llama.cpp slot overflow as context overflow

Repository: openclaw/openclaw
Author: alexander-applyinnovations
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/64196

Description (problem / solution / changelog)

Summary

Adds a llama.cpp-shaped regex to PROVIDER_CONTEXT_OVERFLOW_PATTERNS so isContextOverflowError() recognises the native overflow message from llama.cpp's slot manager (used directly or behind any api: "openai-completions" proxy).
Adds 4 unit test cases covering the new pattern (3 parameterised messages + 1 end-to-end isContextOverflowError() assertion).

Closes #64180

The Bug

Self-hosted llama.cpp HTTP servers are very common (ghcr.io/ggml-org/llama.cpp:server-cuda and similar). When a prompt overshoots a slot's --ctx-size, llama.cpp returns:

400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it

This message slips past every existing detector:

Check	Why it misses
`request_too_large`	wrong wording
`context length exceeded` / `maximum context length`	llama.cpp says "context size", not "context length"
`prompt is too long` / `prompt too long`	llama.cpp says "request (N tokens) exceeds…"
`exceeds model context window` / `context_window_exceeded`	wrong wording
`413` + `too large`	llama.cpp returns 400
`\binput token count exceeds the maximum number of input tokens\b` (Bedrock)	wrong wording
`\binput is too long for this model\b` (Bedrock/Mistral generic)	wrong wording
`\binput exceeds the maximum number of tokens\b` (Vertex)	wrong wording
`\bollama error:\s*context length exceeded\b` (Ollama)	no `ollama error:` prefix
`\btotal tokens?.*exceeds?` (Cohere)	message has "(N tokens)", not "total tokens"

The generic candidate pre-check (PROVIDER_CONTEXT_OVERFLOW_SIGNAL_RE + PROVIDER_CONTEXT_OVERFLOW_ACTION_RE) does pass — the message contains "request"/"tokens"/"context"/"exceeds" — but no concrete pattern matches, so matchesProviderContextOverflow() returns false. The agent runner sees isContextOverflowError() === false, never enters the compaction branch, and the user gets the raw upstream 400 instead of an automatic compact + retry.

Same class of bug as #58839 (Bedrock/Ollama/Cohere), just for a different provider.

The Fix

One regex added next to the existing provider patterns in src/agents/pi-embedded-helpers/provider-error-patterns.ts:

// llama.cpp HTTP server (often used directly or behind an OpenAI-compatible
// shim) returns "request (N tokens) exceeds the available context size
// (M tokens), try increasing it" when the prompt overshoots a slot's
// ctx-size. Wording is from the upstream slot manager and is stable.
// Example: "400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it"
/\b(?:request|prompt) \(\d[\d,]*\s*tokens?\) exceeds (?:the )?available context size\b/i,

The pattern is anchored on the stable slot-manager wording ((?:request|prompt) (N tokens) exceeds (the )?available context size) so it can't accidentally swallow unrelated provider errors. The existing candidate pre-check still gates the regex evaluation, keeping cost negligible.

Tests

provider-error-patterns.test.ts parameterised cases:
- "400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it" (verbatim production payload)
- "request (130000 tokens) exceeds available context size (131072 tokens)" (no the, no status prefix)
- "prompt (8500 tokens) exceeds the available context size (8192 tokens), try increasing it" (the prompt alternation)
New end-to-end case under describe("isContextOverflowError with provider patterns") asserting that isContextOverflowError(<verbatim message>) === true, which is what the agent runner actually calls.

Targeted run:

$ pnpm vitest run src/agents/pi-embedded-helpers/provider-error-patterns.test.ts
 Test Files  1 passed (1)
      Tests  26 passed (26)

Broader vitest run shows 2 pre-existing failures in src/channels/plugins/contracts/group-policy.fallback.contract.test.ts — those are not touched by this PR (the change is scoped to two files in src/agents/pi-embedded-helpers/).

Test plan

Targeted unit tests pass (26/26 in provider-error-patterns.test.ts)
Pattern only matches llama.cpp's specific wording (verified by the existing "does not match unrelated errors" cases)
Manual: verify on a llama.cpp deployment that compaction now triggers when a session overshoots --ctx-size

AI-assisted

Drafted with Claude Code (Claude Opus 4.6, 1M context)
Lightly tested (targeted unit tests pass; manual verification still pending in our prod env)
I understand what the code does — added one regex to an existing fallback list and added matching test coverage

🤖 Generated with Claude Code

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/pi-embedded-helpers/provider-error-patterns.test.ts (modified, +14/-0)
src/agents/pi-embedded-helpers/provider-error-patterns.ts (modified, +7/-0)

Code Example

400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it

---

[agent] embedded run agent end: ... isError=true
  model=qwen3.5-35b-a3b
  provider=llamacpp-deep
  error=400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it
  rawError=400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it

---

/\b(?:request|prompt) \(\d[\d,]*\s*tokens?\) exceeds (?:the )?available context size\b/i

RAW_BUFFERClick to expand / collapse

Bug Description

isContextOverflowError() and matchesProviderContextOverflow() in src/agents/pi-embedded-helpers/ detect context window overflow for major providers (Anthropic, OpenAI, Bedrock, Ollama, Cohere, Vertex), but the llama.cpp HTTP server (used directly or via vLLM-style local deployments behind api: "openai-completions") is not covered.

Reproduction

Run any llama.cpp server (e.g. ghcr.io/ggml-org/llama.cpp:server-cuda) with a fixed context size, configure it as an openai-completions provider in openclaw, and let an agent build up a context that overshoots the per-slot context.

Llama.cpp returns:

400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it

Openclaw surfaces it directly to the user instead of triggering compaction:

[agent] embedded run agent end: ... isError=true
  model=qwen3.5-35b-a3b
  provider=llamacpp-deep
  error=400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it
  rawError=400 request (66202 tokens) exceeds the available context size (65536 tokens), try increasing it

Why It Isn't Detected

Walking through isContextOverflowError() in src/agents/pi-embedded-helpers/errors.ts against the message above, none of the existing string/regex checks match:

Check	Matches?
`request_too_large`	No
`invalid_argument` + `maximum number of tokens`	No
`request exceeds the maximum size`	No
`context length exceeded`	No (llama.cpp says "context size", not "context length")
`maximum context length`	No
`prompt is too long` / `prompt too long`	No
`exceeds model context window`	No
`model token limit`	No
`input exceeds` + `maximum number of tokens`	No
`request size exceeds` + `context window/length`	No
`context overflow:`	No
`exceed context limit`	No
`exceeds the model's maximum context`	No
`max_tokens` + `exceed` + `context`	No
`input length` + `exceed` + `context`	No
`413` + `too large`	No (llama.cpp returns 400)
`context_window_exceeded`	No
Chinese proxy patterns	No

It then falls through to matchesProviderContextOverflow() in src/agents/pi-embedded-helpers/provider-error-patterns.ts. The candidate pre-check (PROVIDER_CONTEXT_OVERFLOW_SIGNAL_RE + PROVIDER_CONTEXT_OVERFLOW_ACTION_RE) passes — the message contains "request", "tokens", "context" and "exceeds" — but none of the concrete PROVIDER_CONTEXT_OVERFLOW_PATTERNS regexes match either:

Pattern	Matches?
`\binput token count exceeds the maximum number of input tokens\b`	No
`\binput is too long for this model\b`	No
`\binput exceeds the maximum number of tokens\b`	No
`\bollama error:\scontext length exceeded(?:,\stoo many tokens)?\b`	No
`\btotal tokens?.*exceeds? (?:the )?(?:model(?:'s)? )?(?:max\|maximum\|limit)`	No
`\binput (?:is )?too long for (?:the )?model\b`	No

So isContextOverflowError() returns false and the agent runner never calls into the compaction path.

Impact

Any user pointing openclaw at a llama.cpp HTTP server (a very common self-hosted setup) hits this the first time their session grows past the per-slot context.
The run aborts with an opaque error instead of compacting and retrying.
Compaction's whole job — rescuing oversized sessions — silently doesn't apply to llama.cpp.

This is the same class of bug as #58839, but for a different provider.

Expected Behavior

isContextOverflowError() (via matchesProviderContextOverflow()) should detect llama.cpp's native overflow wording so the existing compaction pipeline kicks in automatically, just like it does for OpenAI / Anthropic / Bedrock / Ollama / Cohere.

Proposed Fix

Add a llama.cpp-shaped pattern to PROVIDER_CONTEXT_OVERFLOW_PATTERNS in src/agents/pi-embedded-helpers/provider-error-patterns.ts. Llama.cpp's wording is stable across versions:

/\b(?:request|prompt) \(\d[\d,]*\s*tokens?\) exceeds (?:the )?available context size\b/i

Plus a matching test case in src/agents/pi-embedded-helpers/provider-error-patterns.test.ts. PR to follow.

Affected Code

src/agents/pi-embedded-helpers/provider-error-patterns.ts — PROVIDER_CONTEXT_OVERFLOW_PATTERNS
src/agents/pi-embedded-helpers/provider-error-patterns.test.ts — coverage

Environment

OpenClaw v2026.4.9
Provider: llama.cpp HTTP server (ghcr.io/ggml-org/llama.cpp:server-cuda) configured via api: "openai-completions"
Model: Qwen3.5-35B-A3B (any model — the error is from llama.cpp's slot manager, not the model)

extent analysis

TL;DR

Add a new regex pattern to PROVIDER_CONTEXT_OVERFLOW_PATTERNS in src/agents/pi-embedded-helpers/provider-error-patterns.ts to match llama.cpp's native overflow error message.

Guidance

Update PROVIDER_CONTEXT_OVERFLOW_PATTERNS with the proposed regex pattern: /\b(?:request|prompt) \(\d[\d,]*\s*tokens?\) exceeds (?:the )?available context size\b/i
Add a test case in src/agents/pi-embedded-helpers/provider-error-patterns.test.ts to cover the new pattern
Verify that the updated isContextOverflowError() function correctly detects llama.cpp's overflow error and triggers the compaction pipeline
Test the fix with a llama.cpp server and a model that exceeds the per-slot context size

Example

// src/agents/pi-embedded-helpers/provider-error-patterns.ts
const PROVIDER_CONTEXT_OVERFLOW_PATTERNS = [
  // ... existing patterns ...
  /\b(?:request|prompt) \(\d[\d,]*\s*tokens?\) exceeds (?:the )?available context size\b/i,
];

Notes

The proposed fix assumes that the regex pattern accurately matches llama.cpp's native overflow error message. If the error message changes in future versions of llama.cpp, the pattern may need to be updated.

Recommendation

Apply the workaround by adding the new regex pattern to PROVIDER_CONTEXT_OVERFLOW_PATTERNS. This will allow the compaction pipeline to trigger correctly for llama.cpp servers, preventing runs from aborting with opaque errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.