openclaw - ✅(Solved) Fix HTTP /v1/chat/completions: 10-15s TTFB due to full agent context assembly — needs lightContext/voice mode [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68920Fetched 2026-04-20 12:04:26
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1referenced ×1

Root Cause

Every HTTP completions request goes through the full agentCommandFromIngress() pipeline, which:

  1. Assembles the complete system prompt (~15-20K tokens) including:
    • Full tool definitions and descriptions
    • Skills catalog
    • Memory recall instructions
    • Workspace bootstrap files (SOUL.md, IDENTITY.md, AGENTS.md, etc.)
    • Runtime metadata
  2. Triggers memorySearch.sync.onSessionStart (QMD re-index) on every call since each gets a unique session key (openai:<uuid>)
  3. Runs bootstrap hooks, heartbeat resolution, etc.

Fix Action

Workaround

Setting memorySearch.sync.onSessionStart: false helps with the QMD re-index on new sessions, but the core issue (15-20K tokens of framework context) remains.

PR fix notes

PR #69060: fix: add x-openclaw-session header for stable agent-scoped HTTP sessions

Description (problem / solution / changelog)

Summary

Partially addresses #68920 (the QMD re-index overhead component).

Every /v1/chat/completions call that omits the OpenAI user field receives a fresh random-UUID session key. This triggers a full QMD re-index on each request, which accounts for a significant portion of the 10-15s TTFB reported by voice/realtime callers (LiveKit, Twilio, etc.).

Voice stacks cannot always inject the user field per-request. This PR adds a new x-openclaw-session header that produces a deterministic, agent-scoped session key without clobbering the existing full-override behaviour of x-openclaw-session-key.

Priority order (unchanged except for the new entry)

PrioritySourceScope
1x-openclaw-session-key headerExplicit full override (unchanged)
2user request fieldAgent-scoped (unchanged)
3x-openclaw-session header (new)Agent-scoped stable hint
4Random UUIDPer-call fallback (unchanged)

Usage

Voice/realtime callers can now pass a stable call identifier:

POST /v1/chat/completions
x-openclaw-session: livekit-room-abc123

The gateway resolves this to {prefix}-session:livekit-room-abc123 scoped to the agent, so warm-session cache hits apply and QMD sync is skipped on subsequent requests in the same call.

What this does NOT fix

The core 15-20K-token framework system prompt assembly still runs on every request. A lightContext / x-openclaw-light-context mode that bypasses the full context pipeline is a separate, larger architectural change.

Test plan

  • Without x-openclaw-session: behaviour unchanged — new UUID session per call
  • With x-openclaw-session: my-id: same stable session key across calls (verify no QMD re-index on second call)
  • user field still takes priority over x-openclaw-session
  • x-openclaw-session-key still takes highest priority, overriding both
  • Two different agents with the same x-openclaw-session value get different session keys (agent scoping confirmed)

Changed files

  • src/gateway/http-utils.ts (modified, +23/-2)
RAW_BUFFERClick to expand / collapse

Problem

The HTTP /v1/chat/completions endpoint takes 10-15 seconds TTFB for a simple "say hi" prompt, making it unusable for real-time voice agents (LiveKit, Twilio, etc.).

Direct OpenAI API calls return in ~400ms for the same prompt.

Root Cause

Every HTTP completions request goes through the full agentCommandFromIngress() pipeline, which:

  1. Assembles the complete system prompt (~15-20K tokens) including:
    • Full tool definitions and descriptions
    • Skills catalog
    • Memory recall instructions
    • Workspace bootstrap files (SOUL.md, IDENTITY.md, AGENTS.md, etc.)
    • Runtime metadata
  2. Triggers memorySearch.sync.onSessionStart (QMD re-index) on every call since each gets a unique session key (openai:<uuid>)
  3. Runs bootstrap hooks, heartbeat resolution, etc.

Evidence

ConfigWorkspace SizePrompt TokensTTFB
Full agent (Cassius)42KB18,71810.8s
Mini agent (MiniSevro)20KB15,3876.7s
Bare minimum (2-line SOUL, tools.profile: minimal, skipBootstrap: true, all sync disabled)~200 bytes19,7139.5s
Direct OpenAI (same model)N/A~200.4s

The bare minimum agent still has 19,713 prompt tokens — the OpenClaw framework system prompt dominates regardless of workspace size.

Impact

This makes the HTTP API unusable for:

  • Real-time voice agents (LiveKit, Twilio Media Streams)
  • Low-latency chatbots
  • Any use case requiring sub-2s TTFB

Proposed Solutions

1. lightContext mode for HTTP API (highest impact)

Add a header like x-openclaw-light-context: true that:

  • Skips tool definitions injection
  • Skips skills catalog
  • Skips memory recall instructions
  • Only includes SOUL.md/IDENTITY.md (personality) or a custom slim system prompt
  • Skips QMD sync entirely

2. Stable session keys for HTTP API

Instead of openai:<random-uuid> per call, allow a x-openclaw-session header to reuse a session key, so sessionWarm cache hits and avoids QMD re-index on every call.

3. Per-agent system prompt override for HTTP

Allow agents to define a http.systemPrompt that replaces the full framework prompt when calls come via HTTP API.

Workaround

Setting memorySearch.sync.onSessionStart: false helps with the QMD re-index on new sessions, but the core issue (15-20K tokens of framework context) remains.

Environment

  • OpenClaw version: latest (npm global)
  • macOS arm64
  • LLM: openai/gpt-5.4-mini via gateway
  • Voice stack: LiveKit + faster-whisper STT + ElevenLabs TTS

extent analysis

TL;DR

Implementing a lightContext mode for the HTTP API by adding a header like x-openclaw-light-context: true could significantly reduce the Time To First Byte (TTFB) by skipping unnecessary data injections and operations.

Guidance

  • Verify the impact of the lightContext mode: Test the proposed lightContext mode with the x-openclaw-light-context: true header to measure its effect on TTFB.
  • Evaluate stable session keys: Consider implementing stable session keys via the x-openclaw-session header to reuse session keys and potentially reduce QMD re-indexing overhead.
  • Assess per-agent system prompt overrides: Explore allowing agents to define a custom http.systemPrompt to replace the full framework prompt for HTTP API calls, which might further optimize performance.
  • Review the effectiveness of the memorySearch.sync.onSessionStart: false workaround: While this setting helps with QMD re-indexing, it does not address the core issue of large framework context; thus, its long-term utility is limited.

Example

No specific code example is provided due to the lack of explicit code snippets in the issue description. However, implementing the lightContext mode could involve conditional logic based on the presence of the x-openclaw-light-context header to skip certain operations.

Notes

The proposed solutions aim to address the high TTFB issue by reducing the amount of data processed and the number of operations performed for each HTTP request. The effectiveness of these solutions may vary depending on the specific use case and the underlying infrastructure.

Recommendation

Apply the lightContext mode workaround by implementing the x-openclaw-light-context: true header, as it directly addresses the core issue of excessive framework context and has the potential to significantly reduce TTFB.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING