openclaw - 💡(How to fix) Fix Prompt-assembly overhead scales linearly with tool-schema size; trivial turns pay full cost

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Every agent turn in openclaw appears to prepend the full system prompt and tool schema to the model input, regardless of whether the turn actually needs tools. The reporter of NVIDIA/NemoClaw#2598 measured prompt_eval_count=18,355 tokens for a bare "say hello" prompt on Ollama. Because per-turn latency scales linearly with prompt_eval, this overhead dominates total latency on any backend — it just becomes most visible on fast local inference (DGX Spark with nemotron-3-nano:30b hits P50=10s, max=17s for a no-op turn).

This is a hardware-agnostic structural issue in prompt assembly, not a backend or deployment bug.

Root Cause

Every agent turn in openclaw appears to prepend the full system prompt and tool schema to the model input, regardless of whether the turn actually needs tools. The reporter of NVIDIA/NemoClaw#2598 measured prompt_eval_count=18,355 tokens for a bare "say hello" prompt on Ollama. Because per-turn latency scales linearly with prompt_eval, this overhead dominates total latency on any backend — it just becomes most visible on fast local inference (DGX Spark with nemotron-3-nano:30b hits P50=10s, max=17s for a no-op turn).

RAW_BUFFERClick to expand / collapse

Summary

Every agent turn in openclaw appears to prepend the full system prompt and tool schema to the model input, regardless of whether the turn actually needs tools. The reporter of NVIDIA/NemoClaw#2598 measured prompt_eval_count=18,355 tokens for a bare "say hello" prompt on Ollama. Because per-turn latency scales linearly with prompt_eval, this overhead dominates total latency on any backend — it just becomes most visible on fast local inference (DGX Spark with nemotron-3-nano:30b hits P50=10s, max=17s for a no-op turn).

This is a hardware-agnostic structural issue in prompt assembly, not a backend or deployment bug.

Reproduction

Controlled prompt-length sweep on macOS with Ollama + llama3.2:1b, identical user prompt ("say hello"), identical hardware, varying only the size of injected fake tool-schema tokens:

ConfigurationPrompt tokensTotal latencySlowdown vs. bare
Bare prompt27212 ms
+ ~2k schema tokens2,4332,980 ms14×
+ ~10k schema tokens12,03322,189 ms104×
+ ~18k schema tokens (matches observed openclaw)21,63336,581 ms172×

Latency is linear in prompt size. The 18k-token row matches the prompt_eval_count the NemoClaw reporter measured against a real openclaw turn on Brev, confirming the schema is the dominant contributor.

Mechanism

  • prompt_eval cost is O(prompt_tokens) on every backend tested.
  • openclaw's per-turn input is dominated by a static tool schema + system message that is sent regardless of whether the user's turn could invoke tools.
  • Trivial turns ("say hello", short clarifications, acknowledgments) therefore pay full schema cost for zero functional benefit.

What I checked

The prompt-assembly path lives in openclaw's shipped dist/. Relevant artifacts that contribute to the per-turn prefix:

  • bash-tools.descriptions
  • openclaw-tools.runtime
  • system-message
  • prompts

Together these produce the ~18k-token prefix observed in the wild. NemoClaw pins openclaw 2026.4.24 and ships the upstream image unmodified; it does not own this code path, which is why I'm filing here rather than there.

Suggested directions (pick whichever fits the architecture)

  1. Schema-skip for trivial turns — heuristic or classifier that omits the tool schema when the turn is clearly conversational. Highest impact, requires a fallback path if the model decides mid-turn it needs a tool.
  2. Lazy / on-demand tool registration — send a minimal tool index up front, expand to full schema only when the model signals tool use.
  3. Lite-mode flag — explicit --no-tools or lite: true session mode for known-conversational contexts (greetings, status checks, agent-to-agent handoffs). Lowest-risk; opts in rather than out.
  4. Schema compression — tighten descriptions / dedupe across bash-tools and openclaw-tools registries. Bounded gain but no behavior change.

Context

  • Downstream report with end-to-end timings on DGX Spark: NVIDIA/NemoClaw#2598
  • Cross-platform mirror on Brev/NIM: NVIDIA/NemoClaw#2600 (P50 9s, P99 128s — same root cause, different backend tail behavior)
  • Feature request asking for the same fix: NVIDIA/NemoClaw#3261 ("Lightweight message interceptor hooks for system prompt bypass" — calls out +10k tokens of overhead)
  • Token count 18,355 for "say hello" measured by the NemoClaw reporter on Brev with nemotron-3-nano:30b
  • My sweep above was on different hardware and a different model, deliberately, to isolate the prompt-length mechanism from anything Spark- or Nemotron-specific.

Happy to share the sweep script or rerun against a specific openclaw version if useful.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Prompt-assembly overhead scales linearly with tool-schema size; trivial turns pay full cost