openclaw - 💡(How to fix) Fix [Bug]: Prompt cache boundary fix (#43148) doesn't apply to OpenAI-compat providers — full reprocess every turn on llama-server / LM Studio / vLLM [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78053Fetched 2026-05-06 06:17:27
View on GitHub
Comments
2
Participants
2
Timeline
9
Reactions
2
Author
Timeline (top)
commented ×2labeled ×2mentioned ×2subscribed ×2

The fix shipped for #43148 (SYSTEM_PROMPT_CACHE_BOUNDARY + Anthropic cache_control) is Anthropic-specific and does not help OpenAI-compatible providers (llama-server, LM Studio, vLLM, Ollama, MLX servers). For local-LLM users, volatile per-turn metadata is still emitted at the end of the system prompt and reaches the model unmodified, causing full prompt re-processing on every turn of a single chat session.

Root Cause

The fix shipped for #43148 (SYSTEM_PROMPT_CACHE_BOUNDARY + Anthropic cache_control) is Anthropic-specific and does not help OpenAI-compatible providers (llama-server, LM Studio, vLLM, Ollama, MLX servers). For local-LLM users, volatile per-turn metadata is still emitted at the end of the system prompt and reaches the model unmodified, causing full prompt re-processing on every turn of a single chat session.

Fix Action

Fix / Workaround

Related:

  • #43148 — closed as fixed; fix is Anthropic-specific.
  • #21785 — message_id injection in buildInboundMetaSystemPrompt() patched in source but not in three loaded bundles; same incomplete-fix pattern may apply here.
  • #19892 — local model provider regression cluster (2026.2.15+).
  • #19534 — root-cause analysis of dynamic system prompt content.

Code Example

I see these diffs at the end of the system prompt, metadata related to whatsapp chat messages with changing timestamps and message ids (characters ~58000) forcing prompt reprocessing at every turn.

Sanitized diff:


A:
  "chat_id": "<redacted-jid>",
  "message_id": "<id-A>",                       ← differs
  "sender_id": "+XXX0000000",
  "timestamp": "Tue 2026-05-05 21:45 GMT+2",    ← differs
  "group_subject": "<redacted>",
  "group_members": "...",
  "is_group_chat": true

B:
  "chat_id": "<redacted-jid>",
  "message_id": "<id-B>",                       ← differs
  "sender_id": "+XXX0000000",
  "timestamp": "Tue 2026-05-05 21:52 GMT+2",    ← differs
  "group_subject": "<redacted>",
  "group_members": "...",
  "is_group_chat": true


llama-server log shows total cache miss despite same-session-size prompts already cached:


srv  load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv  update: - cache state: 4 prompts, 26787.829 MiB
    - prompt 0x...: 103469 tokens
    - prompt 0x...: 104352 tokens
    - prompt 0x...: 103910 tokens
    - prompt 0x...:  48716 tokens
slot get_availabl: id 10 | task -1 | selected slot by LRU, t_last = -1
slot update_slots: id 10 | task 2545 | new prompt, n_tokens = 103756
slot update_slots: id 10 | task 2545 | n_tokens = 0, memory_seq_rm [0, end)


`sim = 0.000` (below default 0.10 threshold) despite three cached prompts of similar size that should match a same-session continuation by ~99% if the system prompt were stable.
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The fix shipped for #43148 (SYSTEM_PROMPT_CACHE_BOUNDARY + Anthropic cache_control) is Anthropic-specific and does not help OpenAI-compatible providers (llama-server, LM Studio, vLLM, Ollama, MLX servers). For local-LLM users, volatile per-turn metadata is still emitted at the end of the system prompt and reaches the model unmodified, causing full prompt re-processing on every turn of a single chat session.

Steps to reproduce

  1. Run OpenClaw 2026.5.3 against a local llama-server via OpenAI-compat /v1/chat/completions.
  2. Channel: WhatsApp (direct/group) chat (similar pattern likely on Telegram/Discord groups).
  3. Send two consecutive user messages in the same chat, capturing the outbound request bodies via mitmproxy / reverse-proxy.
  4. Diff messages[0].content between the two captures.

Expected behavior

System prompt is byte-identical across consecutive user turns in the same session, so prefix-matching caches (llama.cpp slot LCP, vLLM prefix cache, MLX cache_wrapper) reuse the cached prefix and only the new tokens are prefilled.

Actual behavior

System prompts diverge at character ~57,400 of messages[0].content. The divergent region is the trailing Conversation info (untrusted metadata) JSON block, specifically two fields:

  • message_id — unique per inbound message, changes every turn
  • timestamp — sent time, changes every turn

Other fields in the same block (chat_id, sender_id, conversation_label, sender, group_subject, group_members, is_group_chat) are stable across turns.

OpenClaw version

2026.5.3

Operating system

Linux (Ubuntu in WSL2)

Install method

npm

Model

Qwen 3.6 27B (Q5_K_M GGUF) via local llama.cpp

Provider / routing chain

openclaw → local llama-server (OpenAI-compat)

Additional provider/model setup details

Logs, screenshots, and evidence

I see these diffs at the end of the system prompt, metadata related to whatsapp chat messages with changing timestamps and message ids (characters ~58000) forcing prompt reprocessing at every turn.

Sanitized diff:


A:
  "chat_id": "<redacted-jid>",
  "message_id": "<id-A>",                       ← differs
  "sender_id": "+XXX0000000",
  "timestamp": "Tue 2026-05-05 21:45 GMT+2",    ← differs
  "group_subject": "<redacted>",
  "group_members": "...",
  "is_group_chat": true

B:
  "chat_id": "<redacted-jid>",
  "message_id": "<id-B>",                       ← differs
  "sender_id": "+XXX0000000",
  "timestamp": "Tue 2026-05-05 21:52 GMT+2",    ← differs
  "group_subject": "<redacted>",
  "group_members": "...",
  "is_group_chat": true


llama-server log shows total cache miss despite same-session-size prompts already cached:


srv  load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv  update: - cache state: 4 prompts, 26787.829 MiB
    - prompt 0x...: 103469 tokens
    - prompt 0x...: 104352 tokens
    - prompt 0x...: 103910 tokens
    - prompt 0x...:  48716 tokens
slot get_availabl: id 10 | task -1 | selected slot by LRU, t_last = -1
slot update_slots: id 10 | task 2545 | new prompt, n_tokens = 103756
slot update_slots: id 10 | task 2545 | n_tokens = 0, memory_seq_rm [0, end)


`sim = 0.000` (below default 0.10 threshold) despite three cached prompts of similar size that should match a same-session continuation by ~99% if the system prompt were stable.

Impact and severity

  • Affected: every OpenClaw user with an OpenAI-compat LLM backend that relies on byte-identical prefix matching (llama.cpp, LM Studio, vLLM, MLX, Ollama).
  • Severity: high — every turn triggers a full prefill of 100K+ tokens.
  • Frequency: deterministic, every turn.
  • Cost: at typical local prefill speeds (~1–2K tok/s), this is 50–100 seconds per turn that should be sub-second.

Additional information

The fix landed for #43148 is correct for Anthropic-family transports: the volatile suffix is still on the wire, but Anthropic respects cache_control and only caches the marked stable prefix, so the suffix doesn't bust the cache.

For OpenAI-compat transports there is no cache_control equivalent. Prefix matching is byte-level on the entire rendered prompt — the volatile suffix breaks the prefix match at the boundary regardless of any internal markers.

The architectural fix that works across all transports: instead of placing volatile per-turn metadata below an internal cache boundary in the system prompt, prepend it to the latest user-role message. The system prompt then stays byte-identical, the trailing user message naturally has new content anyway (it's the new turn), and provider-side prefix matching works without any provider-specific markers.

Related:

  • #43148 — closed as fixed; fix is Anthropic-specific.
  • #21785 — message_id injection in buildInboundMetaSystemPrompt() patched in source but not in three loaded bundles; same incomplete-fix pattern may apply here.
  • #19892 — local model provider regression cluster (2026.2.15+).
  • #19534 — root-cause analysis of dynamic system prompt content.

extent analysis

TL;DR

Prepend volatile per-turn metadata to the latest user-role message instead of placing it below an internal cache boundary in the system prompt.

Guidance

  • Identify the code responsible for generating the system prompt and modify it to prepend the volatile metadata to the user message.
  • Verify that the system prompt remains byte-identical across consecutive turns by capturing and comparing the outbound request bodies.
  • Test the modified system prompt with the OpenAI-compat LLM backend to ensure that prefix matching works correctly.
  • Review related issues (#43148, #21785, #19892, #19534) to ensure that similar fixes are applied consistently.

Example

No code snippet is provided as the issue does not contain sufficient information about the codebase.

Notes

The fix may require modifications to the buildInboundMetaSystemPrompt() function or similar code responsible for generating the system prompt. The prepend approach should work across all transports, but testing is necessary to confirm.

Recommendation

Apply the workaround by prepending volatile per-turn metadata to the latest user-role message, as this approach is transport-agnostic and should resolve the issue for OpenAI-compat LLM backends.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

System prompt is byte-identical across consecutive user turns in the same session, so prefix-matching caches (llama.cpp slot LCP, vLLM prefix cache, MLX cache_wrapper) reuse the cached prefix and only the new tokens are prefilled.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING