openclaw - 💡(How to fix) Fix Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63357Fetched 2026-04-09 07:54:51
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×4

On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.

Root Cause

On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.

Fix Action

Fix / Workaround

  1. Configure an agent with the minimal config above (or any agent on a CPU-only host).
  2. Send a Teams "Hi" message via the configured msteams binding.
  3. Observe in /tmp/openclaw/openclaw-<date>.log:
    2026-04-08T19:49:04  [msteams]  received message
    2026-04-08T19:49:04  [msteams]  dispatching to agent
    ...                  (~17 seconds of silence)
  4. Observe in ~/.openclaw/agents/router/sessions/<session>.jsonl:
    "timestamp": "2026-04-08T19:49:21.522Z"   # user message written
    "timestamp": "2026-04-08T19:49:23.215Z"   # assistant reply written (LLM took 1.7s)
  5. Observe [msteams] dispatch complete at 19:49:24.

Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.

  • Not memorySearch: disabling memorySearch.enabled on the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.
  • Not MCP servers: removed all MCP servers (microsoft-azure, microsoft-mssql, microsoft-office365) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.
  • Not tool schemas: tools.profile=minimal + explicit deny of 23 tools (only sessions_spawn allowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.
  • Not bootstrap files: trimmed the agent's workspace bootstrap from 5668 chars (~1417 tokens) to 1447 chars (~362 tokens). Per-turn overhead unchanged.
  • Not humanDelay / blockStreamingDefault / typingMode: set humanDelay.mode=off, blockStreamingDefault=off, typingMode=instant. Per-turn overhead unchanged.
  • Not the LLM call itself: Direct LiteLLM curl to claude-haiku-4-5 returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.
  • Not BFS/cloudflared inbound transit: tcpdump on loopback :3978 shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway after received messagedispatching to agent, before any LLM call.
  • Not CLI cold start: tested via the gateway path (no --local, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.

Code Example

{
  "id": "router",
  "name": "router",
  "model": "litellm/claude-haiku-4-5",
  "memorySearch": { "enabled": false },
  "skills": [],
  "tools": {
    "profile": "minimal",
    "alsoAllow": ["sessions_spawn"],
    "deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"]
  },
  "subagents": {
    "allowAgents": ["pleres","family","code","research"],
    "requireAgentId": true
  }
}

---

2026-04-08T19:49:04  [msteams]  received message
   2026-04-08T19:49:04  [msteams]  dispatching to agent
   ...                  (~17 seconds of silence)

---

"timestamp": "2026-04-08T19:49:21.522Z"   # user message written
   "timestamp": "2026-04-08T19:49:23.215Z"   # assistant reply written (LLM took 1.7s)
RAW_BUFFERClick to expand / collapse

Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

Summary

On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.

Environment

  • OpenClaw 2026.4.5 (3e72c03)
  • Ubuntu 24.04.4 LTS, Node v24.14.1 via nvm
  • Hardware: 8 vCPU Intel Xeon Gold 6130 @ 2.10 GHz (2017 Skylake, AVX-512), no GPU, 16 GB RAM, no swap
  • Gateway runs as system-systemd unit (User=pleresadmin, LoadCredentialEncrypted= for secrets)
  • LiteLLM proxy on 127.0.0.1:4000 fronting Anthropic + other providers
  • Cloudflare Tunnel inbound for msteams

The router agent (where this manifests)

A router agent configured to triage inbound messages and delegate via sessions_spawn:

{
  "id": "router",
  "name": "router",
  "model": "litellm/claude-haiku-4-5",
  "memorySearch": { "enabled": false },
  "skills": [],
  "tools": {
    "profile": "minimal",
    "alsoAllow": ["sessions_spawn"],
    "deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"]
  },
  "subagents": {
    "allowAgents": ["pleres","family","code","research"],
    "requireAgentId": true
  }
}

Steps to reproduce

  1. Configure an agent with the minimal config above (or any agent on a CPU-only host).
  2. Send a Teams "Hi" message via the configured msteams binding.
  3. Observe in /tmp/openclaw/openclaw-<date>.log:
    2026-04-08T19:49:04  [msteams]  received message
    2026-04-08T19:49:04  [msteams]  dispatching to agent
    ...                  (~17 seconds of silence)
  4. Observe in ~/.openclaw/agents/router/sessions/<session>.jsonl:
    "timestamp": "2026-04-08T19:49:21.522Z"   # user message written
    "timestamp": "2026-04-08T19:49:23.215Z"   # assistant reply written (LLM took 1.7s)
  5. Observe [msteams] dispatch complete at 19:49:24.

Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.

What we ruled out

We isolated this carefully across many iterations:

  • Not memorySearch: disabling memorySearch.enabled on the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.
  • Not MCP servers: removed all MCP servers (microsoft-azure, microsoft-mssql, microsoft-office365) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.
  • Not tool schemas: tools.profile=minimal + explicit deny of 23 tools (only sessions_spawn allowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.
  • Not bootstrap files: trimmed the agent's workspace bootstrap from 5668 chars (~1417 tokens) to 1447 chars (~362 tokens). Per-turn overhead unchanged.
  • Not humanDelay / blockStreamingDefault / typingMode: set humanDelay.mode=off, blockStreamingDefault=off, typingMode=instant. Per-turn overhead unchanged.
  • Not the LLM call itself: Direct LiteLLM curl to claude-haiku-4-5 returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.
  • Not BFS/cloudflared inbound transit: tcpdump on loopback :3978 shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway after received messagedispatching to agent, before any LLM call.
  • Not CLI cold start: tested via the gateway path (no --local, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.

What seems to be happening in the 17s

We don't have visibility past the dispatching to agent log line at the default log level. The default log shows nothing during the gap between dispatch start and dispatch completion. Suspected sources (in priority order):

  1. System prompt / agent context build per turn — re-reading workspace bootstrap files, computing tool schemas, building the structured system message
  2. Subagent capability registration for sessions_spawn
  3. Provider client warmup / token refresh
  4. Some other synchronous initialization step

A debug-level log of the agent run pipeline (or a [agent.run] initialized in <ms>ms line bracketing the per-turn setup vs the LLM call) would make this trivial to diagnose for users.

Severity

Medium-high for CPU-only deployments. The 17s per-turn overhead caps the achievable end-to-end latency at ~20s for any message, regardless of how fast the LLM is. This makes OpenClaw effectively unusable for interactive personal assistant use cases on CPU-only hardware (even with cloud models). It also makes the value of fast/cheap cloud models like Haiku 4.5 invisible to operators on CPU hosts.

Workarounds attempted (none worked)

  • All the tunables listed in "What we ruled out" above
  • Restarting the gateway between calls — the overhead is per-turn, not first-call only
  • Using a streaming-disabled model entry (streaming: false) — no effect on the pre-LLM overhead

Suggested next actions

  1. Add structured timing logs at info level for the per-turn agent run pipeline:
    • [agent.run] context built in <ms>ms
    • [agent.run] tool schemas in <ms>ms
    • [agent.run] llm call started
    • [agent.run] llm call completed in <ms>ms This would let users diagnose without filing a vague issue like this one.
  2. Profile a single agent turn on a CPU-only host to find the actual hot spot
  3. Cache per-turn invariants (system prompt template, tool schemas) so they're built once at agent registration time, not on every turn

Hardware note

This is a 2017-era Skylake CPU with no GPU. Faster CPUs (Sapphire Rapids, Genoa) would likely amortize this cost differently, but the issue is still that per-turn work scales with CPU clock instead of being effectively constant. For CPU-bound operators, 17s/turn is a ceiling that no model selection can fix.

Where we landed

We pivoted our router from local Qwen 2.5 7B (which was even slower on this CPU) to cloud Claude Haiku 4.5 expecting sub-3s end-to-end latency. We achieved sub-2s LLM time but the OpenClaw runtime adds 17s on top, giving us ~12s end-to-end perceived latency. This is acceptable but well below what's possible if the per-turn overhead were addressed.

extent analysis

TL;DR

The most likely fix for the 17-second per-turn startup overhead in OpenClaw on CPU-only systems is to add structured timing logs and profile a single agent turn to identify the actual hot spot, then cache per-turn invariants to reduce the overhead.

Guidance

  • Add structured timing logs at the info level for the per-turn agent run pipeline to diagnose the issue without filing a vague issue.
  • Profile a single agent turn on a CPU-only host to find the actual hot spot and identify the source of the 17-second overhead.
  • Cache per-turn invariants, such as system prompt templates and tool schemas, so they're built once at agent registration time, not on every turn, to reduce the overhead.
  • Consider upgrading to a faster CPU, such as Sapphire Rapids or Genoa, to amortize the cost of per-turn work, although this may not completely eliminate the issue.

Example

No specific code snippet is provided, but an example of how to add structured timing logs could be:

{
  "logLevel": "info",
  "agentRunLogs": [
    "[agent.run] context built in <ms>ms",
    "[agent.run] tool schemas in <ms>ms",
    "[agent.run] llm call started",
    "[agent.run] llm call completed in <ms>ms"
  ]
}

Notes

The issue is specific to CPU-only systems, and the 17-second overhead is a significant bottleneck for interactive personal assistant use cases. The suggested next actions are designed to help diagnose and address the issue, but may not completely eliminate the overhead.

Recommendation

Apply the suggested workarounds, including adding structured timing logs and caching per-turn invariants, to reduce the per-turn overhead and improve the overall performance of OpenClaw on CPU-only systems.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING