langchain - ✅(Solved) Fix ChatAnthropic: `usage_metadata.input_token_details.cache_creation` is 0 when tokens were written to cache [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36991Fetched 2026-04-25 06:03:13
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #2907: feat(sdk): add fork mode to subagents for prompt-cache reuse

Description (problem / solution / changelog)

Summary

Adds an opt-in fork: bool field to SubAgent. When true, the subagent inherits the parent's composed system prompt and full message history as its prefix, seeds the task description as an additional HumanMessage, and defaults its model to the parent's (mismatch raises at build time).

The prefix alignment is what unlocks prompt-cache reuse on every supported provider — Anthropic via cache_control markers, OpenAI via the automatic Responses-API cache, Gemini 2.5 via implicit caching, and OpenRouter via pass-through. Isolation semantics are unchanged: only the fork's final message is surfaced back to the parent as a ToolMessage, and fork intermediate state is not written back.

To keep the main agent from wasting tokens re-stating context the fork already has, the task tool description annotates forked subagents with [forked — inherits full conversation context] and appends a usage-guidance block instructing the caller to pass only the task delta. Fork invocations are tagged ls_agent_type="fork-subagent" so LangSmith can filter them separately (needed to measure cache-read savings in production).

CompiledSubAgent + fork=True is rejected at build time — compiled subagents own their own system prompt and graph, so splicing in the parent prefix would be ambiguous.

Provider coverage

Fork itself is provider-neutral; each provider's existing cache path realizes the savings once the prefix aligns:

ProviderHow cache kicks inChange required?
AnthropicAnthropicPromptCachingMiddleware already marks the system prompt as cacheable; shared prefix → hitNone
OpenAIAutomatic server-side (≥1024 tokens, Responses API); _openai.py already forces use_responses_api=TrueNone
Gemini 2.5Implicit caching, 75% automatic discount on shared prefixNone
OpenRouterSticky routing + pass-through from the underlying providerNone

Explicit Gemini caching is skipped (incompatible with LangChain tool binding per langchain-google#1528). Anthropic-via-OpenRouter cache_control passthrough is a pre-existing deepagents limitation (AnthropicPromptCachingMiddleware no-ops on ChatOpenRouter) and is out of scope here.

Usage

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-5-20250929",
    system_prompt="<large shared instructions>",
    tools=[...],
    subagents=[
        {
            "name": "reviewer",
            "description": "Review a single artifact.",
            "system_prompt": "You review one artifact at a time.",
            "tools": [...],
            "fork": True,  # inherits parent prefix for cache reuse
        },
    ],
)

Test results

Unit testslibs/deepagents/tests/unit_tests/test_fork_subagents.py (9 new, all pass):

  • test_fork_prepends_parent_messages_when_seeding_state — fork's state["messages"] is parent history + HumanMessage(description); non-fork is just the description.
  • test_fork_sets_ls_agent_type_to_fork_subagent — telemetry tag is "fork-subagent" for forks, "subagent" otherwise.
  • test_forked_subagent_rendered_with_marker_and_guidance — task tool description shows [forked — inherits full conversation context] and appends the usage-guidance block.
  • test_no_fork_no_marker_or_guidance — zero forks → no marker, no guidance.
  • test_fork_composes_parent_prefix_and_inherits_message_history — captures the actual SystemMessage the fork's model receives; asserts it contains parent prefix + fork suffix and that parent HumanMessage is in the fork's message list.
  • test_fork_without_model_inherits_parent_model — no model → defaults to parent.
  • test_fork_with_mismatched_model_raises — different model class → ValueError.
  • test_compiled_subagent_with_fork_raises — guards compiled + fork combo.
  • test_subagent_intermediate_messages_do_not_leak_to_parent — isolation guarantee.

Full suite: 1186 passed, 84 skipped, 4 xfailed (uv run --group test pytest tests/unit_tests/ --no-cov --disable-socket --allow-unix-socket).

Live integration testlibs/deepagents/tests/integration_tests/test_fork_caching_anthropic.py (2 tests against Claude Haiku 4.5 claude-haiku-4-5-20251001):

TestClassificationcache_creationcache_read
fork=True, invocation 1fork-subagent4699683
fork=True, invocation 2fork-subagent4699683
fork=False, invocation 1subagent00
fork=False, invocation 2subagent00

Numbers captured via on_chat_model_start / on_llm_end callbacks, classifying each LLM call by system-message content (ls_agent_type propagates through the LangSmith tracer, not the callback metadata channel). The fork positive case reads ~80% of its input from cache on both runs; the non-fork negative control shows zero cache activity — isolating the fork flag as the cause of the savings.

Uses shared system prompt sized above Haiku's 2048-token minimum cacheable-block floor. Reads raw Anthropic cache_creation_input_tokens / cache_read_input_tokens directly because LangChain's normalized input_token_details.cache_creation is zeroed when the ephemeral TTL breakdown is present — see langchain-ai/langchain#36991.

Test Plan

  • uv run --group test pytest tests/unit_tests/ — 1186 passed
  • uv run --group test pytest tests/integration_tests/test_fork_caching_anthropic.py — 2 passed (live Haiku 4.5)
  • uv run --all-groups ruff check deepagents tests/
  • uv run --all-groups ruff format --diff
  • uv run --all-groups ty check deepagents

Changed files

  • libs/cli/deepagents_cli/_env_vars.py (modified, +3/-0)
  • libs/cli/deepagents_cli/agent.py (modified, +12/-3)
  • libs/cli/tests/unit_tests/test_agent.py (modified, +56/-0)
  • libs/deepagents/deepagents/graph.py (modified, +196/-51)
  • libs/deepagents/deepagents/middleware/async_subagents.py (modified, +12/-5)
  • libs/deepagents/deepagents/middleware/subagents.py (modified, +283/-28)
  • libs/deepagents/tests/integration_tests/fork_cache_utils.py (added, +191/-0)
  • libs/deepagents/tests/integration_tests/test_fork_caching_anthropic.py (added, +38/-0)
  • libs/deepagents/tests/integration_tests/test_fork_caching_openai.py (added, +35/-0)
  • libs/deepagents/tests/unit_tests/test_fork_subagents.py (added, +761/-0)

Code Example

import json
import uuid

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage

LARGE_SYSTEM_PROMPT_TEXT = (
    f"nonce-{uuid.uuid4()}\n"  # force a fresh cache write on every run
    + ("You are a helpful research assistant specialized in long-form technical "
       "documentation synthesis. ") * 800
)

system = SystemMessage(
    content=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_TEXT,
            "cache_control": {"type": "ephemeral", "ttl": "5m"},
        }
    ]
)

llm = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)
ai = llm.invoke([system, HumanMessage("hi")])

itd = ai.usage_metadata["input_token_details"]
raw = ai.response_metadata["usage"]["cache_creation_input_tokens"]

print("normalized input_token_details.cache_creation     =", itd["cache_creation"])
print("raw    response_metadata.cache_creation_input_tokens =", raw)
print("sum    ephemeral_5m + ephemeral_1h                   =",
      itd.get("ephemeral_5m_input_tokens", 0) + itd.get("ephemeral_1h_input_tokens", 0))

---

normalized input_token_details.cache_creation     = 0
raw    response_metadata.cache_creation_input_tokens = 12030
sum    ephemeral_5m + ephemeral_1h                   = 12030

---

{
  "input_tokens": 12037,
  "output_tokens": 92,
  "total_tokens": 12129,
  "input_token_details": {
    "cache_read": 0,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 12030,
    "ephemeral_1h_input_tokens": 0
  }
}

---

{
  "cache_creation": {
    "ephemeral_1h_input_tokens": 0,
    "ephemeral_5m_input_tokens": 12030
  },
  "cache_creation_input_tokens": 12030,
  "cache_read_input_tokens": 0,
  "input_tokens": 7,
  "output_tokens": 92
}
RAW_BUFFERClick to expand / collapse

ChatAnthropic: usage_metadata.input_token_details.cache_creation reports 0 when tokens were written to cache

Checked other resources

  • This is a bug, not a usage question.
  • I searched existing issues — closest is #32818, which is closed and focuses on input_tokens accuracy. The specific symptom below (normalized cache_creation is 0 while ephemeral_5m_input_tokens and raw cache_creation_input_tokens are both correct) does not appear to be filed.
  • The bug reproduces on the latest stable langchain-anthropic.

Example code

import json
import uuid

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage

LARGE_SYSTEM_PROMPT_TEXT = (
    f"nonce-{uuid.uuid4()}\n"  # force a fresh cache write on every run
    + ("You are a helpful research assistant specialized in long-form technical "
       "documentation synthesis. ") * 800
)

system = SystemMessage(
    content=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_TEXT,
            "cache_control": {"type": "ephemeral", "ttl": "5m"},
        }
    ]
)

llm = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)
ai = llm.invoke([system, HumanMessage("hi")])

itd = ai.usage_metadata["input_token_details"]
raw = ai.response_metadata["usage"]["cache_creation_input_tokens"]

print("normalized input_token_details.cache_creation     =", itd["cache_creation"])
print("raw    response_metadata.cache_creation_input_tokens =", raw)
print("sum    ephemeral_5m + ephemeral_1h                   =",
      itd.get("ephemeral_5m_input_tokens", 0) + itd.get("ephemeral_1h_input_tokens", 0))

Observed

normalized input_token_details.cache_creation     = 0
raw    response_metadata.cache_creation_input_tokens = 12030
sum    ephemeral_5m + ephemeral_1h                   = 12030

Full usage_metadata:

{
  "input_tokens": 12037,
  "output_tokens": 92,
  "total_tokens": 12129,
  "input_token_details": {
    "cache_read": 0,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 12030,
    "ephemeral_1h_input_tokens": 0
  }
}

Full response_metadata["usage"]:

{
  "cache_creation": {
    "ephemeral_1h_input_tokens": 0,
    "ephemeral_5m_input_tokens": 12030
  },
  "cache_creation_input_tokens": 12030,
  "cache_read_input_tokens": 0,
  "input_tokens": 7,
  "output_tokens": 92
}

Expected

input_token_details["cache_creation"] should equal the number of tokens written to cache on this request — i.e., cache_creation_input_tokens (12030) or equivalently the sum of the ephemeral TTL breakdown (12030).

The cache_read field in the same dict does match its raw counterpart (cache_read_input_tokens) — the bug is specifically in the cache_creation derivation.

Impact

Callers that rely on the standardized usage_metadata["input_token_details"]["cache_creation"] field (tests, cost dashboards, observability) see 0 while the cache write actually occurred. The only reliable source today is response_metadata["usage"]["cache_creation_input_tokens"], which defeats the purpose of the normalized surface.

Concrete example: an end-to-end test verifying that fork-mode subagent plumbing preserves prompt caching has to read the raw field — the normalized one reports no cache activity even when 12k+ tokens were written.

System info

Repro run on:

  • langchain-anthropic latest (check pip show langchain-anthropic)
  • langchain-core latest
  • Python 3.11
  • Model: claude-haiku-4-5-20251001 (also reproduces on claude-sonnet-4-5)

Related

  • #32818 (closed) — related issue about input_tokens accuracy under cache usage. Fixes there appear to have left cache_creation derivation broken for the ephemeral-TTL response shape.
  • langchain-ai/langchainjs#10249 — streaming double-counts cache tokens (different symptom, JS).
  • langchain-ai/langsmith-sdk#2150 — confusing LangSmith dashboard breakdown when both cache_creation and ephemeral_5m_input_tokens appear.

extent analysis

TL;DR

The input_token_details.cache_creation field in usage_metadata is incorrectly reporting 0, despite tokens being written to the cache, and can be temporarily worked around by using response_metadata.usage.cache_creation_input_tokens instead.

Guidance

  • The issue seems to be related to the derivation of cache_creation in input_token_details, which is not correctly reflecting the number of tokens written to the cache.
  • To verify the issue, compare the values of input_token_details.cache_creation and response_metadata.usage.cache_creation_input_tokens in the usage_metadata and response_metadata respectively.
  • As a temporary workaround, use response_metadata.usage.cache_creation_input_tokens to get the correct number of tokens written to the cache.
  • Review the code that calculates cache_creation in input_token_details to identify the root cause of the discrepancy.

Example

No code example is provided as the issue is more related to the logic of calculating cache_creation rather than a specific code snippet.

Notes

The issue seems to be specific to the langchain-anthropic library and the claude-haiku-4-5-20251001 model. The workaround provided may not be applicable in all scenarios and a proper fix would require identifying and correcting the root cause of the issue.

Recommendation

Apply workaround: use response_metadata.usage.cache_creation_input_tokens instead of input_token_details.cache_creation to get the correct number of tokens written to the cache, until a proper fix is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

langchain - ✅(Solved) Fix ChatAnthropic: `usage_metadata.input_token_details.cache_creation` is 0 when tokens were written to cache [1 pull requests, 1 comments, 2 participants]