llamaIndex - ✅(Solved) Fix [Bug]: cache_idx stamps cache_control on every block, exceeding Anthropic's 4-block limit [1 pull requests, 1 participants]

llamaIndex2026-03-03 02:45:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#20854•Fetched 2026-04-08 00:30:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lestephen

Participants

lestephen

Timeline (top)

referenced ×3closed ×1cross-referenced ×1

Error Message

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

Root Cause

Two issues in llama_index/llms/anthropic/utils.py:

1. blocks_to_anthropic_blocks stamps cache_control on every block with no cap

When a message has cache_control in its additional_kwargs (injected by cache_idx), blocks_to_anthropic_blocks creates a global_cache_control and applies it to every TextBlock, ImageBlock, ToolUseBlock, etc. in that message:

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

This is fine for typical messages with 1-2 blocks, but AgentWorkflow.generate_structured_response() flattens the entire conversation history into many TextBlocks in a single ChatMessage. In my case this produces ~29 blocks in one message, all stamped with cache_control, exceeding Anthropic's limit of 4.

2. System prompt cache_control is silently discarded

In messages_to_anthropic_messages, system messages are extracted as plain strings, discarding any cache_control markers that were set:

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

So even when cache_idx covers the system message, the cache_control is set but then thrown away when the system prompt is extracted as a joined string.

Fix Action

Fixed

Fixed by PR: fix: apply cache_control only to last block to respect Anthropic's 4-block limit (https://github.com/run-llama/llama_index/pull/20875)

PR fix notes

PR #20875: fix: apply cache_control only to last block to respect Anthropic's 4-block limit

Repository: run-llama/llama_index
Author: weiguangli-io
State: closed | merged: True
Link: https://github.com/run-llama/llama_index/pull/20875

Description (problem / solution / changelog)

Description

Fixes #20854

When using AgentWorkflow with output_cls (structured output) and cache_idx set on the Anthropic LLM, the conversation history gets flattened into many TextBlocks in a single ChatMessage. Previously, blocks_to_anthropic_blocks and blocks_to_anthropic_beta_blocks applied cache_control to every block, which exceeded Anthropic's limit of 4 blocks with cache_control per request, causing a 400 error.

Changes

llama_index/llms/anthropic/utils.py:

blocks_to_anthropic_blocks: Apply cache_control only to the last block of each message instead of every block. This follows Anthropic's recommended cache breakpoint pattern and stays within the API limit regardless of block count.
blocks_to_anthropic_beta_blocks: Same fix, plus:
- Remove duplicate unreachable CitableBlock branch
- Move legacy tool_calls compat code outside the per-block loop (was incorrectly running on every iteration)
- Add CitationBlock handling consistent with the non-beta function

tests/test_anthropic_utils.py:

Add TestCacheControlOnlyLastBlock test class with 7 tests covering:
- cache_control only on last block (both regular and beta)
- No cache_control when not set in kwargs
- Single block with cache_control
- Integration test via messages_to_anthropic_messages with cache_idx
- 29-block scenario matching the exact issue report

Test Plan

All 26 tests pass (pytest tests/test_anthropic_utils.py)
ruff check passes
ruff format passes

New tests added: TestCacheControlOnlyLastBlock (7 tests)

Changed files

llama-index-integrations/llms/llama-index-llms-anthropic/llama_index/llms/anthropic/utils.py (modified, +32/-41)
llama-index-integrations/llms/llama-index-llms-anthropic/pyproject.toml (modified, +1/-1)
llama-index-integrations/llms/llama-index-llms-anthropic/tests/test_anthropic_utils.py (modified, +132/-0)

Code Example

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

---

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

---

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

---

from llama_index.llms.anthropic import Anthropic
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel

class MyOutput(BaseModel):
    result: str

def my_tool(query: str) -> str:
    """Look something up."""
    return f"answer to {query}"

llm = Anthropic(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    cache_idx=1,  # enable prompt caching
)

agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",
    output_cls=MyOutput,  # triggers generate_structured_response
)

import asyncio

async def run():
    result = await agent.run(user_msg="Look up foo, then bar, then baz")
    print(result)

asyncio.run(run())

---

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'A maximum of 4 blocks with cache_control may be provided. Found 29.'}}

---

# After building all anthropic_blocks:
if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control

RAW_BUFFERClick to expand / collapse

Bug Description

When using AgentWorkflow with output_cls (structured output) and cache_idx set on the Anthropic LLM, the API returns:

invalid_request_error: A maximum of 4 blocks with cache_control may be provided. Found 29.

Root Cause

Two issues in llama_index/llms/anthropic/utils.py:

1. blocks_to_anthropic_blocks stamps cache_control on every block with no cap

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))
            if global_cache_control:
                anthropic_blocks[-1]["cache_control"] = global_cache_control  # every block gets it

2. System prompt cache_control is silently discarded

In messages_to_anthropic_messages, system messages are extracted as plain strings, discarding any cache_control markers that were set:

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt.extend(
        [block.text for block in message.blocks if isinstance(block, TextBlock)]
    )
# ...
return ..., "\n".join(system_prompt)  # plain string, cache_control lost

So even when cache_idx covers the system message, the cache_control is set but then thrown away when the system prompt is extracted as a joined string.

Steps to Reproduce

from llama_index.llms.anthropic import Anthropic
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel

class MyOutput(BaseModel):
    result: str

def my_tool(query: str) -> str:
    """Look something up."""
    return f"answer to {query}"

llm = Anthropic(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    cache_idx=1,  # enable prompt caching
)

agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",
    output_cls=MyOutput,  # triggers generate_structured_response
)

import asyncio

async def run():
    result = await agent.run(user_msg="Look up foo, then bar, then baz")
    print(result)

asyncio.run(run())

After a few tool call rounds, generate_structured_response() flattens the conversation into many TextBlocks in one message. With cache_idx=1, all blocks get cache_control, and the Anthropic API rejects the request.

Relevant Logs/Tracbacks

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'A maximum of 4 blocks with cache_control may be provided. Found 29.'}}

Suggested Fix

In blocks_to_anthropic_blocks, only apply cache_control to the last block in the message (matching Anthropic's recommended pattern for cache breakpoints), rather than every block:

# After building all anthropic_blocks:
if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control

For the system prompt issue, messages_to_anthropic_messages could return the system prompt as a list of content blocks (preserving cache_control) instead of a joined plain string, when cache markers are present.

Environment

llama-index-llms-anthropic version: 0.10.10
Python 3.12
Anthropic API

extent analysis

Fix Plan

1. Modify `blocks_to_anthropic_blocks` to apply `cache_control` only to the last block

# utils.py, blocks_to_anthropic_blocks()
if kwargs.get("cache_control"):
    global_cache_control = CacheControlEphemeralParam(**kwargs["cache_control"])

anthropic_blocks = []
for block in blocks:
    if isinstance(block, TextBlock):
        if block.text:
            anthropic_blocks.append(_to_anthropic_text_block(block))

if global_cache_control and anthropic_blocks:
    anthropic_blocks[-1]["cache_control"] = global_cache_control

2. Modify `messages_to_anthropic_messages` to preserve `cache_control` in system prompts

# utils.py, messages_to_anthropic_messages()
if message.role == MessageRole.SYSTEM:
    system_prompt = [
        {"text": block.text, "cache_control": block.cache_control}  # preserve cache_control
        for block in message.blocks if isinstance(block, TextBlock)
    ]
# ...
return ..., system_prompt  # return as list of content blocks

3. Update `AgentWorkflow` to use the modified `messages_to_anthropic_messages` function

# llama_index/core/agent/workflow.py
from llama_index.llms.anthropic.utils import messages_to_anthropic_messages

class AgentWorkflow:
    # ...

    def generate_structured_response(self, *args, **kwargs):
        # ...
        messages = messages_to_anthropic_messages(self.system_prompt, *args, **kwargs)
        # ...

4. Update the example code to use the modified `AgentWorkflow` class

# example code
agent = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[FunctionTool.from_defaults(fn=my_tool)],
    llm=llm,
    system_prompt="You are a helpful assistant.",

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #conversation history #embedding generation #cache error #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix [Bug]: cache_idx stamps cache_control on every block, exceeding Anthropic's 4-block limit [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #20875: fix: apply cache_control only to last block to respect Anthropic's 4-block limit

Description (problem / solution / changelog)

Description

Changes

Test Plan

Changed files

Code Example

Bug Description

Root Cause

Steps to Reproduce

Relevant Logs/Tracbacks

Suggested Fix

Environment

extent analysis

Fix Plan

1. Modify `blocks_to_anthropic_blocks` to apply `cache_control` only to the last block

2. Modify `messages_to_anthropic_messages` to preserve `cache_control` in system prompts

3. Update `AgentWorkflow` to use the modified `messages_to_anthropic_messages` function

4. Update the example code to use the modified `AgentWorkflow` class

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix [Bug]: cache_idx stamps cache_control on every block, exceeding Anthropic's 4-block limit [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #20875: fix: apply cache_control only to last block to respect Anthropic's 4-block limit

Description (problem / solution / changelog)

Description

Changes

Test Plan

Changed files

Code Example

Bug Description

Root Cause

Steps to Reproduce

Relevant Logs/Tracbacks

Suggested Fix

Environment

extent analysis

Fix Plan

1. Modify blocks_to_anthropic_blocks to apply cache_control only to the last block

2. Modify messages_to_anthropic_messages to preserve cache_control in system prompts

3. Update AgentWorkflow to use the modified messages_to_anthropic_messages function

4. Update the example code to use the modified AgentWorkflow class

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Modify `blocks_to_anthropic_blocks` to apply `cache_control` only to the last block

2. Modify `messages_to_anthropic_messages` to preserve `cache_control` in system prompts

3. Update `AgentWorkflow` to use the modified `messages_to_anthropic_messages` function

4. Update the example code to use the modified `AgentWorkflow` class