langchain - ✅(Solved) Fix ChatAnthropic: `usage_metadata.input_token_details.cache_creation` is 0 when tokens were written to cache [1 pull requests, 1 comments, 2 participants]

langchain2026-04-24 16:34:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langchain-ai/langchain#36991•Fetched 2026-04-25 06:03:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ramon-langchain

Participants

langchain-automated-triage[bot]

ramon-langchain

Timeline (top)

closed ×1commented ×1cross-referenced ×1labeled ×1

Fix Action

Fixed

Fixed by PR: feat(sdk): add fork mode to subagents for prompt-cache reuse (https://github.com/langchain-ai/deepagents/pull/2907)

PR fix notes

PR #2907: feat(sdk): add fork mode to subagents for prompt-cache reuse

Repository: langchain-ai/deepagents
Author: ramon-langchain
State: open | merged: False
Link: https://github.com/langchain-ai/deepagents/pull/2907

Description (problem / solution / changelog)

Summary

Adds an opt-in fork: bool field to SubAgent. When true, the subagent inherits the parent's composed system prompt and full message history as its prefix, seeds the task description as an additional HumanMessage, and defaults its model to the parent's (mismatch raises at build time).

The prefix alignment is what unlocks prompt-cache reuse on every supported provider — Anthropic via cache_control markers, OpenAI via the automatic Responses-API cache, Gemini 2.5 via implicit caching, and OpenRouter via pass-through. Isolation semantics are unchanged: only the fork's final message is surfaced back to the parent as a ToolMessage, and fork intermediate state is not written back.

To keep the main agent from wasting tokens re-stating context the fork already has, the task tool description annotates forked subagents with [forked — inherits full conversation context] and appends a usage-guidance block instructing the caller to pass only the task delta. Fork invocations are tagged ls_agent_type="fork-subagent" so LangSmith can filter them separately (needed to measure cache-read savings in production).

CompiledSubAgent + fork=True is rejected at build time — compiled subagents own their own system prompt and graph, so splicing in the parent prefix would be ambiguous.

Provider coverage

Fork itself is provider-neutral; each provider's existing cache path realizes the savings once the prefix aligns:

Provider	How cache kicks in	Change required?
Anthropic	`AnthropicPromptCachingMiddleware` already marks the system prompt as cacheable; shared prefix → hit	None
OpenAI	Automatic server-side (≥1024 tokens, Responses API); `_openai.py` already forces `use_responses_api=True`	None
Gemini 2.5	Implicit caching, 75% automatic discount on shared prefix	None
OpenRouter	Sticky routing + pass-through from the underlying provider	None

Explicit Gemini caching is skipped (incompatible with LangChain tool binding per langchain-google#1528). Anthropic-via-OpenRouter cache_control passthrough is a pre-existing deepagents limitation (AnthropicPromptCachingMiddleware no-ops on ChatOpenRouter) and is out of scope here.

Usage

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-5-20250929",
    system_prompt="<large shared instructions>",
    tools=[...],
    subagents=[
        {
            "name": "reviewer",
            "description": "Review a single artifact.",
            "system_prompt": "You review one artifact at a time.",
            "tools": [...],
            "fork": True,  # inherits parent prefix for cache reuse
        },
    ],
)

Test results

Unit tests — libs/deepagents/tests/unit_tests/test_fork_subagents.py (9 new, all pass):

test_fork_prepends_parent_messages_when_seeding_state — fork's state["messages"] is parent history + HumanMessage(description); non-fork is just the description.
test_fork_sets_ls_agent_type_to_fork_subagent — telemetry tag is "fork-subagent" for forks, "subagent" otherwise.
test_forked_subagent_rendered_with_marker_and_guidance — task tool description shows [forked — inherits full conversation context] and appends the usage-guidance block.
test_no_fork_no_marker_or_guidance — zero forks → no marker, no guidance.
test_fork_composes_parent_prefix_and_inherits_message_history — captures the actual SystemMessage the fork's model receives; asserts it contains parent prefix + fork suffix and that parent HumanMessage is in the fork's message list.
test_fork_without_model_inherits_parent_model — no model → defaults to parent.
test_fork_with_mismatched_model_raises — different model class → ValueError.
test_compiled_subagent_with_fork_raises — guards compiled + fork combo.
test_subagent_intermediate_messages_do_not_leak_to_parent — isolation guarantee.

Full suite: 1186 passed, 84 skipped, 4 xfailed (uv run --group test pytest tests/unit_tests/ --no-cov --disable-socket --allow-unix-socket).

Live integration test — libs/deepagents/tests/integration_tests/test_fork_caching_anthropic.py (2 tests against Claude Haiku 4.5 claude-haiku-4-5-20251001):

Test	Classification	cache_creation	cache_read
`fork=True`, invocation 1	fork-subagent	469	9683
`fork=True`, invocation 2	fork-subagent	469	9683
`fork=False`, invocation 1	subagent	0	0
`fork=False`, invocation 2	subagent	0	0

Numbers captured via on_chat_model_start / on_llm_end callbacks, classifying each LLM call by system-message content (ls_agent_type propagates through the LangSmith tracer, not the callback metadata channel). The fork positive case reads ~80% of its input from cache on both runs; the non-fork negative control shows zero cache activity — isolating the fork flag as the cause of the savings.

Uses shared system prompt sized above Haiku's 2048-token minimum cacheable-block floor. Reads raw Anthropic cache_creation_input_tokens / cache_read_input_tokens directly because LangChain's normalized input_token_details.cache_creation is zeroed when the ephemeral TTL breakdown is present — see langchain-ai/langchain#36991.

Test Plan

uv run --group test pytest tests/unit_tests/ — 1186 passed
uv run --group test pytest tests/integration_tests/test_fork_caching_anthropic.py — 2 passed (live Haiku 4.5)
uv run --all-groups ruff check deepagents tests/
uv run --all-groups ruff format --diff
uv run --all-groups ty check deepagents

Changed files

libs/cli/deepagents_cli/_env_vars.py (modified, +3/-0)
libs/cli/deepagents_cli/agent.py (modified, +12/-3)
libs/cli/tests/unit_tests/test_agent.py (modified, +56/-0)
libs/deepagents/deepagents/graph.py (modified, +196/-51)
libs/deepagents/deepagents/middleware/async_subagents.py (modified, +12/-5)
libs/deepagents/deepagents/middleware/subagents.py (modified, +283/-28)
libs/deepagents/tests/integration_tests/fork_cache_utils.py (added, +191/-0)
libs/deepagents/tests/integration_tests/test_fork_caching_anthropic.py (added, +38/-0)
libs/deepagents/tests/integration_tests/test_fork_caching_openai.py (added, +35/-0)
libs/deepagents/tests/unit_tests/test_fork_subagents.py (added, +761/-0)

Code Example

import json
import uuid

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage

LARGE_SYSTEM_PROMPT_TEXT = (
    f"nonce-{uuid.uuid4()}\n"  # force a fresh cache write on every run
    + ("You are a helpful research assistant specialized in long-form technical "
       "documentation synthesis. ") * 800
)

system = SystemMessage(
    content=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_TEXT,
            "cache_control": {"type": "ephemeral", "ttl": "5m"},
        }
    ]
)

llm = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)
ai = llm.invoke([system, HumanMessage("hi")])

itd = ai.usage_metadata["input_token_details"]
raw = ai.response_metadata["usage"]["cache_creation_input_tokens"]

print("normalized input_token_details.cache_creation     =", itd["cache_creation"])
print("raw    response_metadata.cache_creation_input_tokens =", raw)
print("sum    ephemeral_5m + ephemeral_1h                   =",
      itd.get("ephemeral_5m_input_tokens", 0) + itd.get("ephemeral_1h_input_tokens", 0))

---

normalized input_token_details.cache_creation     = 0
raw    response_metadata.cache_creation_input_tokens = 12030
sum    ephemeral_5m + ephemeral_1h                   = 12030

---

{
  "input_tokens": 12037,
  "output_tokens": 92,
  "total_tokens": 12129,
  "input_token_details": {
    "cache_read": 0,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 12030,
    "ephemeral_1h_input_tokens": 0
  }
}

---

{
  "cache_creation": {
    "ephemeral_1h_input_tokens": 0,
    "ephemeral_5m_input_tokens": 12030
  },
  "cache_creation_input_tokens": 12030,
  "cache_read_input_tokens": 0,
  "input_tokens": 7,
  "output_tokens": 92
}

RAW_BUFFERClick to expand / collapse

`ChatAnthropic`: `usage_metadata.input_token_details.cache_creation` reports 0 when tokens were written to cache

Checked other resources

This is a bug, not a usage question.
I searched existing issues — closest is #32818, which is closed and focuses on input_tokens accuracy. The specific symptom below (normalized cache_creation is 0 while ephemeral_5m_input_tokens and raw cache_creation_input_tokens are both correct) does not appear to be filed.
The bug reproduces on the latest stable langchain-anthropic.

Example code

import json
import uuid

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage

LARGE_SYSTEM_PROMPT_TEXT = (
    f"nonce-{uuid.uuid4()}\n"  # force a fresh cache write on every run
    + ("You are a helpful research assistant specialized in long-form technical "
       "documentation synthesis. ") * 800
)

system = SystemMessage(
    content=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_TEXT,
            "cache_control": {"type": "ephemeral", "ttl": "5m"},
        }
    ]
)

llm = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)
ai = llm.invoke([system, HumanMessage("hi")])

itd = ai.usage_metadata["input_token_details"]
raw = ai.response_metadata["usage"]["cache_creation_input_tokens"]

print("normalized input_token_details.cache_creation     =", itd["cache_creation"])
print("raw    response_metadata.cache_creation_input_tokens =", raw)
print("sum    ephemeral_5m + ephemeral_1h                   =",
      itd.get("ephemeral_5m_input_tokens", 0) + itd.get("ephemeral_1h_input_tokens", 0))

Observed

normalized input_token_details.cache_creation     = 0
raw    response_metadata.cache_creation_input_tokens = 12030
sum    ephemeral_5m + ephemeral_1h                   = 12030

Full usage_metadata:

{
  "input_tokens": 12037,
  "output_tokens": 92,
  "total_tokens": 12129,
  "input_token_details": {
    "cache_read": 0,
    "cache_creation": 0,
    "ephemeral_5m_input_tokens": 12030,
    "ephemeral_1h_input_tokens": 0
  }
}

Full response_metadata["usage"]:

{
  "cache_creation": {
    "ephemeral_1h_input_tokens": 0,
    "ephemeral_5m_input_tokens": 12030
  },
  "cache_creation_input_tokens": 12030,
  "cache_read_input_tokens": 0,
  "input_tokens": 7,
  "output_tokens": 92
}

Expected

input_token_details["cache_creation"] should equal the number of tokens written to cache on this request — i.e., cache_creation_input_tokens (12030) or equivalently the sum of the ephemeral TTL breakdown (12030).

The cache_read field in the same dict does match its raw counterpart (cache_read_input_tokens) — the bug is specifically in the cache_creation derivation.

Impact

Callers that rely on the standardized usage_metadata["input_token_details"]["cache_creation"] field (tests, cost dashboards, observability) see 0 while the cache write actually occurred. The only reliable source today is response_metadata["usage"]["cache_creation_input_tokens"], which defeats the purpose of the normalized surface.

Concrete example: an end-to-end test verifying that fork-mode subagent plumbing preserves prompt caching has to read the raw field — the normalized one reports no cache activity even when 12k+ tokens were written.

System info

Repro run on:

langchain-anthropic latest (check pip show langchain-anthropic)
langchain-core latest
Python 3.11
Model: claude-haiku-4-5-20251001 (also reproduces on claude-sonnet-4-5)

#32818 (closed) — related issue about input_tokens accuracy under cache usage. Fixes there appear to have left cache_creation derivation broken for the ephemeral-TTL response shape.
langchain-ai/langchainjs#10249 — streaming double-counts cache tokens (different symptom, JS).
langchain-ai/langsmith-sdk#2150 — confusing LangSmith dashboard breakdown when both cache_creation and ephemeral_5m_input_tokens appear.

extent analysis

TL;DR

The input_token_details.cache_creation field in usage_metadata is incorrectly reporting 0, despite tokens being written to the cache, and can be temporarily worked around by using response_metadata.usage.cache_creation_input_tokens instead.

Guidance

The issue seems to be related to the derivation of cache_creation in input_token_details, which is not correctly reflecting the number of tokens written to the cache.
To verify the issue, compare the values of input_token_details.cache_creation and response_metadata.usage.cache_creation_input_tokens in the usage_metadata and response_metadata respectively.
As a temporary workaround, use response_metadata.usage.cache_creation_input_tokens to get the correct number of tokens written to the cache.
Review the code that calculates cache_creation in input_token_details to identify the root cause of the discrepancy.

Example

No code example is provided as the issue is more related to the logic of calculating cache_creation rather than a specific code snippet.

Notes

The issue seems to be specific to the langchain-anthropic library and the claude-haiku-4-5-20251001 model. The workaround provided may not be applicable in all scenarios and a proper fix would require identifying and correcting the root cause of the issue.

Recommendation

Apply workaround: use response_metadata.usage.cache_creation_input_tokens instead of input_token_details.cache_creation to get the correct number of tokens written to the cache, until a proper fix is available.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model compatibility #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.