crewai - 💡(How to fix) Fix RFC: Session-start prompt-cache preload for crew kickoff (parallel / sequential / shared-prefix strategies) [1 pull requests]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I'd like to gauge interest in a session-start prompt-cache preload for Crew kickoff — a Crew-level hook that fires lightweight cache-warming probes against each Agent's system prompt prefix at crew.kickoff() rather than waiting for the first real agent.execute() call.

Two related observations that motivate this:

  1. Anthropic's prompt cache TTL is 5 minutes. A Crew with N agents typically takes some seconds to construct, validate, route LiteLLM, etc. By the time agent #N actually fires its first task, agent #1's system-prompt cache may already be warming a second time (or not warm at all if the first call hit a different account/region).
  2. CrewAI agents share heavy boilerplate. Role + goal + backstory + tool descriptions often contain large overlapping chunks (project conventions, formatting rules, safety policies). Each agent's system prompt is treated as a separate cache key, paying the warming cost N times.

The proposal: at crew.kickoff(), fire a cheap probe call per agent (1-token completion against the canonical system prompt) so the provider's cache is warm by the time the first real task runs. Optionally, also detect overlapping system-prompt prefixes across agents and emit a single shared prefix to the provider, letting the provider's cache layer dedupe.

Root Cause

Reading CrewAI's Agent.system_prompt construction: role + goal + backstory + tool descriptions concatenate into a prompt that's often kilobytes long, and the boilerplate fraction grows as crews adopt more conventions. Three or four agents in the same crew typically share most of that boilerplate. Without preload, the first agent step in a crew run pays the full cache-write latency; subsequent agents in the same kickoff pay their own warming cost again because their system prefix is a distinct cache key.

Fix Action

Fixed

Code Example

# Opt-in at Crew construction
crew = Crew(
    agents=[a1, a2, a3, a4],
    tasks=[...],
    cache_preload=True,           # default False; opt-in initially
    cache_preload_strategy="parallel",   # or "sequential" / "shared_prefix"
)

crew.kickoff()
# Internally on kickoff:
#
#  if cache_preload:
#      for agent in agents:
#          await llm.preload_probe(
#              system=agent.canonical_system_prompt,
#              tools=agent.tools,
#          )      # 1-token completion, cache-write path
#
#  (with shared_prefix strategy: detect common prefix across agents,
#   emit Anthropic cache_control breakpoint at the prefix boundary)
RAW_BUFFERClick to expand / collapse

Update 2026-05-25: I've edited this issue to clarify the framing. The original wording made stronger empirical claims ("I've been running this as a wrapper", "P50 latency 3.8s→1.1s", "kickoff cost $0.18→$0.13") than I can actually support — the design is from reading CrewAI's Crew / Agent construction code, not from a measured production deployment. The design discussion stands; the specific numbers and personal use-case framing have been removed.


Summary

I'd like to gauge interest in a session-start prompt-cache preload for Crew kickoff — a Crew-level hook that fires lightweight cache-warming probes against each Agent's system prompt prefix at crew.kickoff() rather than waiting for the first real agent.execute() call.

Two related observations that motivate this:

  1. Anthropic's prompt cache TTL is 5 minutes. A Crew with N agents typically takes some seconds to construct, validate, route LiteLLM, etc. By the time agent #N actually fires its first task, agent #1's system-prompt cache may already be warming a second time (or not warm at all if the first call hit a different account/region).
  2. CrewAI agents share heavy boilerplate. Role + goal + backstory + tool descriptions often contain large overlapping chunks (project conventions, formatting rules, safety policies). Each agent's system prompt is treated as a separate cache key, paying the warming cost N times.

The proposal: at crew.kickoff(), fire a cheap probe call per agent (1-token completion against the canonical system prompt) so the provider's cache is warm by the time the first real task runs. Optionally, also detect overlapping system-prompt prefixes across agents and emit a single shared prefix to the provider, letting the provider's cache layer dedupe.

Why kickoff-time, not lazy

Three reasons lazy first-call caching is wrong for Crew specifically:

  1. The first agent step is always the most latency-sensitive — it's the user's "did the crew start?" UX moment. Spending the warming budget before the user is waiting is strictly better than after.
  2. TTL races kickoff time — if Crew assembly takes meaningful wall time and the first agent step fires after that, the warming probe at t=0 is still inside the 5-min TTL when the first real call happens. Without preload, the first call writes the cache (slow path); subsequent calls in the same crew (different agents, same provider, similar prefix) might still miss.
  3. Cross-agent prefix overlap — if N agents in a crew share a large identical prefix (CrewAI's standard role/backstory/format scaffolding adds up), the provider sees N distinct cache keys for the same prefix. Optional: emit a shared "Crew identity" prefix once, then per-agent suffix, so the cache key tree has a shared root.

Design sketch

# Opt-in at Crew construction
crew = Crew(
    agents=[a1, a2, a3, a4],
    tasks=[...],
    cache_preload=True,           # default False; opt-in initially
    cache_preload_strategy="parallel",   # or "sequential" / "shared_prefix"
)

crew.kickoff()
# Internally on kickoff:
#
#  if cache_preload:
#      for agent in agents:
#          await llm.preload_probe(
#              system=agent.canonical_system_prompt,
#              tools=agent.tools,
#          )      # 1-token completion, cache-write path
#
#  (with shared_prefix strategy: detect common prefix across agents,
#   emit Anthropic cache_control breakpoint at the prefix boundary)

preload_probe is a 1-token completion (max_tokens=1, stop at first token) — just enough to make the provider commit the prefix to cache. On Anthropic, this writes the cache; subsequent calls within the 5-min TTL get cache-read pricing.

shared_prefix strategy — when 2+ agents in the crew have system prompts sharing a >= 1024-token prefix (provider cache breakpoint), the cache_control marker is inserted at the prefix boundary so all agents share the same cache key for the common chunk. Per-agent suffix (role/goal/specifics) remains distinct.

Cost model — for Crew sizes ≥ 3 agents with shared boilerplate, the preload cost (N × 1-token completion) is intended to be dwarfed by the saved warming cost on first real call (each first call writes the full system prompt at the cache-write premium). Net positive at N=3+ for any LLM with cache support, in the design's intent — happy to share a more precise cost model if maintainers want to dig in.

What this is NOT

  • Not a generic LLM-cache abstraction — uses the provider's native cache (Anthropic prompt_caching, OpenAI prefix caching, etc.) via existing LiteLLM cache_control plumbing
  • Not changing default behavior — opt-in via cache_preload=True; off by default
  • Not useful for single-agent crews — would no-op (or skip silently); only material at N ≥ 2
  • Not a cross-Crew cache (cache lives in the provider's TTL window; doesn't try to persist across Crew runs)

Why this matters in practice

Reading CrewAI's Agent.system_prompt construction: role + goal + backstory + tool descriptions concatenate into a prompt that's often kilobytes long, and the boilerplate fraction grows as crews adopt more conventions. Three or four agents in the same crew typically share most of that boilerplate. Without preload, the first agent step in a crew run pays the full cache-write latency; subsequent agents in the same kickoff pay their own warming cost again because their system prefix is a distinct cache key.

A kickoff-time preload + optional shared-prefix marker would amortize that warming cost across the crew. This is a code-reading argument about prompt structure; happy to refine based on maintainer experience with real measured latencies.

Questions before I open anything

  1. Roadmap conflict? Is there an existing Crew-startup-perf or prompt-cache initiative I should align with?
  2. Scope preference:
    • (a) minimal demo PR — just cache_preload=True with parallel-probe strategy, no shared-prefix detection
    • (b) full version with all three strategies (parallel / sequential / shared_prefix) + cost/latency telemetry
    • (c) ship as third-party crewai-cache-warmer plugin; just expose the kickoff hook
  3. Provider coverage — should v1 ship Anthropic-only (most mature cache surface) or include OpenAI prefix caching + Gemini context caching from day 1?
  4. canonical_system_prompt — how stable is Agent.system_prompt construction across CrewAI versions? A probe would need the exact-bytes prompt that will be sent at first execute(). Worth a Agent.get_system_prompt_for_caching() accessor?

Not opening a PR yet — gauging interest first.

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING