langchain - 💡(How to fix) Fix AnomalyDetectionCallbackHandler : real-time statistical anomaly detection for LLM monitoring

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

on_anomaly="log", # or "warn", "raise", or callable What it tracks per LLM call: latency (e2e and TTFT), token usage (prompt + completion), estimated cost, error rate (rolling window), tool call frequency and duration. On detection: Configurable action - log, warn, raise, or user-provided callable. Each AnomalyEvent (Pydantic model) includes timestamp, metric name, observed value, expected range, severity score, and run context.

  • Inherits BaseCallbackHandler, uses on_llm_start/end/error and on_tool_start/end on_anomaly="log", # "log", "warn", "raise", or callable Metrics tracked per LLM call: latency (e2e and TTFT), token usage (prompt + completion), estimated cost, error rate (rolling window), tool call frequency and duration. On anomaly detection: configurable action - log, warn, raise, or user-provided callable. Each AnomalyEvent is a Pydantic model with timestamp, metric name, observed value, expected range, severity score, and run context.

Code Example

from langchain_core.callbacks import AnomalyDetectionCallbackHandler

handler = AnomalyDetectionCallbackHandler(
    metrics=["latency", "token_count", "cost"],
    detector="zscore",
    warmup_calls=50,
    threshold=3.0,
    on_anomaly="log",  # or "warn", "raise", or callable
)
chain.invoke(input, config={"callbacks": [handler]})
handler.get_anomalies()  # list of AnomalyEvent objects

---

from langchain_core.callbacks import AnomalyDetectionCallbackHandler

handler = AnomalyDetectionCallbackHandler(
    metrics=["latency", "token_count", "cost"],
    detector="zscore",        # or "iqr", "isolation_forest"
    warmup_calls=50,          # learn baseline from first N calls
    threshold=3.0,            # z-score threshold for flagging
    on_anomaly="log",         # "log", "warn", "raise", or callable
)

chain.invoke(input, config={"callbacks": [handler]})
handler.get_anomalies()  # list of AnomalyEvent objects
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

Production LLM applications need monitoring for anomalous behavior - cost spikes, latency outliers, token usage anomalies, or unexpected tool-calling patterns. Today, users either build this from scratch or rely on static threshold alerts that don't adapt to different chain/agent profiles.

I'm proposing an AnomalyDetectionCallbackHandler that uses statistical anomaly detection (z-score, IQR, optional Isolation Forest) to learn the baseline distribution of an application's LLM calls and flag deviations automatically.

from langchain_core.callbacks import AnomalyDetectionCallbackHandler

handler = AnomalyDetectionCallbackHandler(
    metrics=["latency", "token_count", "cost"],
    detector="zscore",
    warmup_calls=50,
    threshold=3.0,
    on_anomaly="log",  # or "warn", "raise", or callable
)
chain.invoke(input, config={"callbacks": [handler]})
handler.get_anomalies()  # list of AnomalyEvent objects

What it tracks per LLM call: latency (e2e and TTFT), token usage (prompt + completion), estimated cost, error rate (rolling window), tool call frequency and duration.

Detection methods: Z-score (default, lightweight), IQR-based (robust to skewed distributions), Isolation Forest (optional via scikit-learn for more complex patterns). Each metric tracked independently with its own learned baseline.

On detection: Configurable action - log, warn, raise, or user-provided callable. Each AnomalyEvent (Pydantic model) includes timestamp, metric name, observed value, expected range, severity score, and run context.

Design:

  • Inherits BaseCallbackHandler, uses on_llm_start/end/error and on_tool_start/end
  • Zero required external deps beyond numpy (already a dependency); scikit-learn optional
  • Thread-safe metrics buffer for concurrent executions
  • Complements LangSmith tracing - LangSmith records everything, this flags what's abnormal in real-time without requiring a subscription

I'm happy to implement this with full tests and an example notebook if this would be welcome. Open to feedback on API design and scoping.

Use Case

Production LLM applications need real-time monitoring for anomalous behavior - cost spikes, latency outliers, token usage anomalies, or unexpected tool-calling patterns. Currently, users either build custom monitoring from scratch or pipe everything to an external observability platform and set static threshold alerts.

Static thresholds are brittle. A 5-second response is anomalous for a simple Q&A chain but normal for a multi-step agent with tool calls. What's needed is statistical anomaly detection that learns the baseline distribution of an application's behavior and flags deviations automatically.

LangChain's callback system already receives all the required signals (token counts, latency, costs, tool calls, errors) but has no built-in handler that performs anomaly detection on them. The existing CostCallbackHandler tracks cumulative costs - this feature extends that concept to full statistical monitoring across multiple metrics.

This would serve as a lightweight local-first detection layer that flags potential issues in real-time during development and testing. For production monitoring at scale, this naturally pairs with LangSmith - the handler could optionally forward detected anomaly events to LangSmith for dashboard visualization and historical analysis, acting as an intelligent filter that reduces alert noise.

Proposed Solution

An AnomalyDetectionCallbackHandler that inherits BaseCallbackHandler and monitors LLM interactions using statistical anomaly detection:

from langchain_core.callbacks import AnomalyDetectionCallbackHandler

handler = AnomalyDetectionCallbackHandler(
    metrics=["latency", "token_count", "cost"],
    detector="zscore",        # or "iqr", "isolation_forest"
    warmup_calls=50,          # learn baseline from first N calls
    threshold=3.0,            # z-score threshold for flagging
    on_anomaly="log",         # "log", "warn", "raise", or callable
)

chain.invoke(input, config={"callbacks": [handler]})
handler.get_anomalies()  # list of AnomalyEvent objects

Metrics tracked per LLM call: latency (e2e and TTFT), token usage (prompt + completion), estimated cost, error rate (rolling window), tool call frequency and duration.

Detection methods:

  • Z-score (default - lightweight, only needs numpy)
  • IQR-based (robust to skewed distributions)
  • Isolation Forest (optional via scikit-learn, for complex patterns)

Each metric tracked independently with its own learned baseline.

On anomaly detection: configurable action - log, warn, raise, or user-provided callable. Each AnomalyEvent is a Pydantic model with timestamp, metric name, observed value, expected range, severity score, and run context.

Design choices:

  • Uses on_llm_start, on_llm_end, on_llm_error, on_tool_start, on_tool_end hooks
  • Zero required external deps beyond numpy; scikit-learn optional for Isolation Forest
  • Thread-safe metrics buffer for concurrent chain executions
  • Complements LangSmith — acts as a local pre-filter that can forward anomaly events to LangSmith for visualization, reducing noise in production tracing by highlighting only statistically significant deviations

I'm happy to implement this with full tests and an example notebook. Open to feedback on API design and scoping.

Alternatives Considered

  1. Static threshold alerts : Simple but brittle; thresholds need manual tuning per chain/agent and break when usage patterns shift.
  2. External APM tools (Datadog, Prometheus + Grafana) : Powerful but require infrastructure setup, custom metric emission, and don't understand LangChain-specific semantics (e.g., distinguishing agent loops from normal multi-step chains).
  3. LangSmith : The ideal production monitoring solution. This handler is designed to complement LangSmith, not replace it - providing lightweight local detection during development/testing, with optional forwarding of anomaly events to LangSmith for production dashboards.
  4. Custom callback handlers : What most teams do today, but everyone rebuilds the same wheel. A standardized implementation would benefit the entire ecosystem.

Additional Context

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING