claude-code - 💡(How to fix) Fix [FEATURE] Environment variable flag to export LLM response content and thinking steps in OTEL telemetry [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#46118Fetched 2026-04-11 06:28:36
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
labeled ×2

Root Cause

This means LLM observability platforms cannot run evaluators on Claude Code sessions because their evaluation frameworks require spans with complete LLM interaction data including both inputs and outputs.

Fix Action

Fix / Workaround

Current workarounds (all inadequate):

Code Example

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",        // ✅ User input captured
  "OTEL_LOG_TOOL_DETAILS": "1",        // ✅ Tool parameters captured
  "OTEL_LOG_TOOL_CONTENT": "1"         // ✅ Tool output captured
}

---

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",
  "OTEL_LOG_LLM_RESPONSES": "1",        // ← NEW: exports model response text
  "OTEL_LOG_THINKING": "1"              // ← NEW: exports thinking/reasoning content
}

---

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",      // from OTEL_LOG_USER_PROMPTS
    "llm.output": "Here's what I found...",    // from OTEL_LOG_LLM_RESPONSES
    "llm.thinking": "I should use Grep..."     // from OTEL_LOG_THINKING
  }
}

---

# Default (current behavior)
OTEL_LOG_LLM_RESPONSES=0  # No response content exported

# Opt-in (proposed)
OTEL_LOG_LLM_RESPONSES=1  # Exports model response text as span attributes
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing requests and this feature hasn't been requested yet
  • This is a single feature request (not multiple features)

Problem Statement

Claude Code's current OTEL telemetry exports metadata about LLM interactions (token counts, model name, latency, cost) but not the actual content (user prompts, model responses, thinking steps). This creates a critical gap for enterprise observability and AI evaluation platforms.

What's currently available via environment variables:

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",        // ✅ User input captured
  "OTEL_LOG_TOOL_DETAILS": "1",        // ✅ Tool parameters captured
  "OTEL_LOG_TOOL_CONTENT": "1"         // ✅ Tool output captured
}

What's missing:

  • Model response text — what the LLM actually said to the user
  • Thinking content — the model's chain-of-thought reasoning
  • System prompts — the instructions sent to the model (acknowledged as potentially sensitive)

This means LLM observability platforms cannot run evaluators on Claude Code sessions because their evaluation frameworks require spans with complete LLM interaction data including both inputs and outputs.

Proposed Solution

Add an opt-in environment variable that exports LLM response content in OTEL telemetry, similar to how OTEL_LOG_USER_PROMPTS=1 works for user input:

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",
  "OTEL_LOG_LLM_RESPONSES": "1",        // ← NEW: exports model response text
  "OTEL_LOG_THINKING": "1"              // ← NEW: exports thinking/reasoning content
}

What should be exported:

  1. Model response text (high priority)

    • Attribute: llm.output or assistant_response
    • Where: On existing llm_request trace spans
    • Truncation: 60KB limit (consistent with OTEL_LOG_TOOL_CONTENT)
  2. Thinking content (high priority)

    • Attribute: llm.thinking or thinking_output
    • Where: On tool execution spans or llm_request spans
    • Use case: Understanding model reasoning for debugging
  3. System prompts (medium priority, acknowledged as sensitive)

    • Attribute: llm.system_prompt
    • Privacy consideration: May contain Anthropic IP, could be opt-in separately
    • Alternative: Redacted/templated version showing structure without specifics

Implementation approach:

Add the response content as span attributes on existing llm_request spans:

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",      // from OTEL_LOG_USER_PROMPTS
    "llm.output": "Here's what I found...",    // from OTEL_LOG_LLM_RESPONSES
    "llm.thinking": "I should use Grep..."     // from OTEL_LOG_THINKING
  }
}

This keeps all LLM interaction data in a single span, which is what standard observability platforms expect for LLM evaluation workflows.

Alternative Solutions

Current workarounds (all inadequate):

  1. Hook-based transcript parsing — Reading .jsonl transcript files via Stop hooks works but:

    • Requires building custom tooling
    • Creates separate spans that don't integrate with standard observability platforms
    • Adds operational complexity (proxy servers, state tracking)
    • Still can't access system prompts
  2. Manual transcript review — Teams currently read raw .jsonl files for debugging:

    • Not scalable for production monitoring
    • Can't be integrated into automated evaluation pipelines
    • No correlation with structured telemetry data
  3. BeforeModel/AfterModel hooks (issue #21531) — Would solve this but:

    • Not yet implemented
    • May not expose system prompts depending on design

Priority

Critical - Blocking production deployment of AI observability

Feature Category

Monitoring and observability

Use Case Example

Scenario 1: LLM quality evaluators

AI observability platforms run automated evaluators on LLM spans to detect:

  • Hallucinations (faithfulness to sources)
  • Toxicity and bias
  • PII leakage
  • Prompt injection attempts
  • Answer quality and relevance

These evaluators require access to both user input and model output (and sometimes System Instructions) in the same span. Currently:

  • ✅ User input available via OTEL_LOG_USER_PROMPTS=1
  • ❌ System Instructions not available
  • ❌ Model output not available
  • Result: Cannot run automated quality checks on Claude Code sessions

Scenario 2: Compliance and audit trails

Regulated industries (healthcare, finance, legal) need complete records of AI interactions for:

  • GDPR Article 22 (right to explanation of automated decisions)
  • SOC 2 audit requirements
  • Internal compliance reviews

Current telemetry captures what the user asked but not what the AI responded, creating an incomplete audit trail.

Scenario 3: Debugging model behavior

When users report incorrect or unexpected responses, teams need to see:

  • What the model actually said (not just that a response was generated)
  • The thinking/reasoning that led to that response
  • How the response relates to tool calls

Without this data, debugging requires manually reading .jsonl transcripts instead of querying structured observability platforms.

Scenario 4: Cost optimization via content analysis

Teams want to correlate response verbosity with token costs:

  • Which types of prompts generate unnecessarily long responses?
  • Are responses including redundant explanations?
  • Can we optimize prompts to reduce output tokens?

This requires joining response text with token usage in analytics queries.

Scenario 5: Performance analysis

Understanding the relationship between response quality and latency:

  • Do faster responses sacrifice quality?
  • Which types of queries produce the most concise answers?
  • How does thinking time correlate with response accuracy?

Without access to response content, these analyses are impossible.

Additional Context

Related issues:

  • #42281 — Native OTLP trace/span export (addresses trace structure but not content visibility)
  • #21531 — BeforeModel/AfterModel hooks (alternative approach to expose LLM request/response)
  • #17212 — Privacy concerns about tool_parameters leaking prompt data (this request is the inverse — asking for opt-in content logging)

Security/privacy considerations:

We understand this is sensitive data. Our proposal is:

  • Opt-in only (off by default, requires explicit env var)
  • Same privacy model as existing flags (OTEL_LOG_USER_PROMPTS, OTEL_LOG_TOOL_CONTENT)
  • Same truncation limits (60KB like tool content)
  • System prompts can be separate flag if they contain proprietary Anthropic content

The precedent already exists: OTEL_LOG_USER_PROMPTS=1 exports user input (which can contain secrets, PII, etc.), and it's opt-in. We're asking for the same treatment for model output.

Industry standard:

LLM frameworks that export response content via OTEL:

  • LangChainlangchain.llm spans include output attribute
  • LlamaIndexllm spans include response attribute
  • OpenAI SDK (via third-party instrumentations) — Response content in traces

Claude Code should offer comparable observability to these widely-adopted LLM frameworks.

Proposed Flag Behavior

# Default (current behavior)
OTEL_LOG_LLM_RESPONSES=0  # No response content exported

# Opt-in (proposed)
OTEL_LOG_LLM_RESPONSES=1  # Exports model response text as span attributes

Expected Benefits

  1. Complete observability — Organizations can monitor both inputs and outputs through their existing OTLP infrastructure
  2. Automated quality assurance — Enable evaluation frameworks to run automated checks on production traffic
  3. Better debugging — Teams can query structured telemetry instead of reading raw transcript files
  4. Cost optimization — Correlate response characteristics with token usage for optimization opportunities
  5. Compliance readiness — Provide complete audit trails for regulated industries
  6. Parity with SDK usage — Claude Code observability matches what's available when using the Anthropic SDK directly

extent analysis

TL;DR

To address the critical gap in LLM interaction data, introduce an opt-in environment variable OTEL_LOG_LLM_RESPONSES to export model response content in OTEL telemetry.

Guidance

  1. Implement the proposed environment variable: Add OTEL_LOG_LLM_RESPONSES to control the export of model response text, similar to OTEL_LOG_USER_PROMPTS.
  2. Define the export structure: Use span attributes on existing llm_request spans to include llm.output for model response text and llm.thinking for thinking content.
  3. Handle sensitive data: Consider separate opt-in flags for system prompts due to potential sensitivity and ensure truncation limits are applied consistently.
  4. Review industry standards: Align Claude Code's observability with other LLM frameworks like LangChain, LlamaIndex, and OpenAI SDK for comprehensive telemetry.

Example

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",
    "llm.output": "Here's what I found...",
    "llm.thinking": "I should use Grep..."
  }
}

Notes

The implementation should prioritize model response text and thinking content, with system prompts considered separately due to potential sensitivity. Ensuring the new telemetry aligns with existing privacy models and truncation limits is crucial.

Recommendation

Apply the workaround by introducing the OTEL_LOG_LLM_RESPONSES environment variable to enable the export of model response content, enhancing observability and compliance capabilities.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING