llamaIndex - ✅(Solved) Fix [Bug]: cache_idx stamps cache_control on every block, exceeding Anthropic's 4-block limit [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20854Fetched 2026-04-08 00:30:37
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
referenced ×3closed ×1cross-referenced ×1

Error Message

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

Root Cause

Two issues in llama_index/llms/anthropic/utils.py:

1. blocks_to_anthropic_blocks stamps cache_control on every block with no cap

When a message has cache_control in its additional_kwargs (injected by cache_idx), blocks_to_anthropic_blocks creates a global_cache_control and applies it to every TextBlock, ImageBlock, ToolUseBlock, etc. in that message:

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

This is fine for typical messages with 1-2 blocks, but AgentWorkflow.generate_structured_response() flattens the entire conversation history into many TextBlocks in a single ChatMessage. In my case this produces ~29 blocks in one message, all stamped with cache_control, exceeding Anthropic's limit of 4.

2. System prompt cache_control is silently discarded

In messages_to_anthropic_messages, system messages are extracted as plain strings, discarding any cache_control markers that were set:

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

So even when cache_idx covers the system message, the cache_control is set but then thrown away when the system prompt is extracted as a joined string.

Fix Action

Fixed

PR fix notes

PR #20875: fix: apply cache_control only to last block to respect Anthropic's 4-block limit

Description (problem / solution / changelog)

Description

Fixes #20854

When using AgentWorkflow with output_cls (structured output) and cache_idx set on the Anthropic LLM, the conversation history gets flattened into many TextBlocks in a single ChatMessage. Previously, blocks_to_anthropic_blocks and blocks_to_anthropic_beta_blocks applied cache_control to every block, which exceeded Anthropic's limit of 4 blocks with cache_control per request, causing a 400 error.

Changes

llama_index/llms/anthropic/utils.py:

  • blocks_to_anthropic_blocks: Apply cache_control only to the last block of each message instead of every block. This follows Anthropic's recommended cache breakpoint pattern and stays within the API limit regardless of block count.
  • blocks_to_anthropic_beta_blocks: Same fix, plus:
    • Remove duplicate unreachable CitableBlock branch
    • Move legacy tool_calls compat code outside the per-block loop (was incorrectly running on every iteration)
    • Add CitationBlock handling consistent with the non-beta function

tests/test_anthropic_utils.py:

  • Add TestCacheControlOnlyLastBlock test class with 7 tests covering:
    • cache_control only on last block (both regular and beta)
    • No cache_control when not set in kwargs
    • Single block with cache_control
    • Integration test via messages_to_anthropic_messages with cache_idx
    • 29-block scenario matching the exact issue report

Test Plan

  • All 26 tests pass (pytest tests/test_anthropic_utils.py)
  • ruff check passes
  • ruff format passes

New tests added: TestCacheControlOnlyLastBlock (7 tests)

Changed files

  • llama-index-integrations/llms/llama-index-llms-anthropic/llama_index/llms/anthropic/utils.py (modified, +32/-41)
  • llama-index-integrations/llms/llama-index-llms-anthropic/pyproject.toml (modified, +1/-1)
  • llama-index-integrations/llms/llama-index-llms-anthropic/tests/test_anthropic_utils.py (modified, +132/-0)

Code Example

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

---

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

---

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

---

from llama_index.llms.anthropic import Anthropic
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel

class MyOutput(BaseModel):
    result: str

def my_tool(query: str) -> str:
    """Look something up."""
    return f"answer to {query}"

llm = Anthropic(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    cache_idx=1,  # enable prompt caching
)

agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",
    output_cls=MyOutput,  # triggers generate_structured_response
)

import asyncio

async def run():
    result = await agent.run(user_msg="Look up foo, then bar, then baz")
    print(result)

asyncio.run(run())

---

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'A maximum of 4 blocks with cache_control may be provided. Found 29.'}}

---

# After building all anthropic_blocks:
if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control
RAW_BUFFERClick to expand / collapse

Bug Description

When using AgentWorkflow with output_cls (structured output) and cache_idx set on the Anthropic LLM, the API returns:

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

Root Cause

Two issues in llama_index/llms/anthropic/utils.py:

1. blocks_to_anthropic_blocks stamps cache_control on every block with no cap

When a message has cache_control in its additional_kwargs (injected by cache_idx), blocks_to_anthropic_blocks creates a global_cache_control and applies it to every TextBlock, ImageBlock, ToolUseBlock, etc. in that message:

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

This is fine for typical messages with 1-2 blocks, but AgentWorkflow.generate_structured_response() flattens the entire conversation history into many TextBlocks in a single ChatMessage. In my case this produces ~29 blocks in one message, all stamped with cache_control, exceeding Anthropic's limit of 4.

2. System prompt cache_control is silently discarded

In messages_to_anthropic_messages, system messages are extracted as plain strings, discarding any cache_control markers that were set:

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

So even when cache_idx covers the system message, the cache_control is set but then thrown away when the system prompt is extracted as a joined string.

Steps to Reproduce

from llama_index.llms.anthropic import Anthropic
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel

class MyOutput(BaseModel):
    result: str

def my_tool(query: str) -> str:
    """Look something up."""
    return f"answer to {query}"

llm = Anthropic(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    cache_idx=1,  # enable prompt caching
)

agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",
    output_cls=MyOutput,  # triggers generate_structured_response
)

import asyncio

async def run():
    result = await agent.run(user_msg="Look up foo, then bar, then baz")
    print(result)

asyncio.run(run())

After a few tool call rounds, generate_structured_response() flattens the conversation into many TextBlocks in one message. With cache_idx=1, all blocks get cache_control, and the Anthropic API rejects the request.

Relevant Logs/Tracbacks

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'A maximum of 4 blocks with cache_control may be provided. Found 29.'}}

Suggested Fix

In blocks_to_anthropic_blocks, only apply cache_control to the last block in the message (matching Anthropic's recommended pattern for cache breakpoints), rather than every block:

# After building all anthropic_blocks:
if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control

For the system prompt issue, messages_to_anthropic_messages could return the system prompt as a list of content blocks (preserving cache_control) instead of a joined plain string, when cache markers are present.

Environment

  • llama-index-llms-anthropic version: 0.10.10
  • Python 3.12
  • Anthropic API

extent analysis

Fix Plan

1. Modify blocks_to_anthropic_blocks to apply cache_control only to the last block

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

anthropic_blocks = []
for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))

if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control

2. Modify messages_to_anthropic_messages to preserve cache_control in system prompts

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt = [
        {"text": block.text, "cache_control": block.cache_control}  # preserve cache_control
        for block in message.blocks if isinstance(block, TextBlock)
    ]
# ...
return ..., system_prompt  # return as list of content blocks

3. Update AgentWorkflow to use the modified messages_to_anthropic_messages function

# llama_index/core/agent/workflow.py
from llama_index.llms.anthropic.utils import messages_to_anthropic_messages

class AgentWorkflow:
    # ...

    def generate_structured_response(self, *args, **kwargs):
        # ...
        messages = messages_to_anthropic_messages(self.system_prompt, *args, **kwargs)
        # ...

4. Update the example code to use the modified AgentWorkflow class

# example code
agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING