claude-code - 💡(How to fix) Fix [FEATURE] Environment variable flag to export LLM response content and thinking steps in OTEL telemetry [1 participants]

claude-code2026-04-10 07:35:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#46118•Fetched 2026-04-11 06:28:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

robin-fiddler

Participants

robin-fiddler

Timeline (top)

labeled ×2

Root Cause

This means LLM observability platforms cannot run evaluators on Claude Code sessions because their evaluation frameworks require spans with complete LLM interaction data including both inputs and outputs.

Fix Action

Fix / Workaround

Current workarounds (all inadequate):

Code Example

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",        // ✅ User input captured
  "OTEL_LOG_TOOL_DETAILS": "1",        // ✅ Tool parameters captured
  "OTEL_LOG_TOOL_CONTENT": "1"         // ✅ Tool output captured
}

---

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",
  "OTEL_LOG_LLM_RESPONSES": "1",        // ← NEW: exports model response text
  "OTEL_LOG_THINKING": "1"              // ← NEW: exports thinking/reasoning content
}

---

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",      // from OTEL_LOG_USER_PROMPTS
    "llm.output": "Here's what I found...",    // from OTEL_LOG_LLM_RESPONSES
    "llm.thinking": "I should use Grep..."     // from OTEL_LOG_THINKING
  }
}

---

# Default (current behavior)
OTEL_LOG_LLM_RESPONSES=0  # No response content exported

# Opt-in (proposed)
OTEL_LOG_LLM_RESPONSES=1  # Exports model response text as span attributes

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing requests and this feature hasn't been requested yet
This is a single feature request (not multiple features)

Problem Statement

Claude Code's current OTEL telemetry exports metadata about LLM interactions (token counts, model name, latency, cost) but not the actual content (user prompts, model responses, thinking steps). This creates a critical gap for enterprise observability and AI evaluation platforms.

What's currently available via environment variables:

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",        // ✅ User input captured
  "OTEL_LOG_TOOL_DETAILS": "1",        // ✅ Tool parameters captured
  "OTEL_LOG_TOOL_CONTENT": "1"         // ✅ Tool output captured
}

What's missing:

❌ Model response text — what the LLM actually said to the user
❌ Thinking content — the model's chain-of-thought reasoning
❌ System prompts — the instructions sent to the model (acknowledged as potentially sensitive)

Proposed Solution

Add an opt-in environment variable that exports LLM response content in OTEL telemetry, similar to how OTEL_LOG_USER_PROMPTS=1 works for user input:

{
  "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
  "OTEL_LOG_USER_PROMPTS": "1",
  "OTEL_LOG_LLM_RESPONSES": "1",        // ← NEW: exports model response text
  "OTEL_LOG_THINKING": "1"              // ← NEW: exports thinking/reasoning content
}

What should be exported:

Model response text (high priority)
- Attribute: llm.output or assistant_response
- Where: On existing llm_request trace spans
- Truncation: 60KB limit (consistent with OTEL_LOG_TOOL_CONTENT)
Thinking content (high priority)
- Attribute: llm.thinking or thinking_output
- Where: On tool execution spans or llm_request spans
- Use case: Understanding model reasoning for debugging
System prompts (medium priority, acknowledged as sensitive)
- Attribute: llm.system_prompt
- Privacy consideration: May contain Anthropic IP, could be opt-in separately
- Alternative: Redacted/templated version showing structure without specifics

Implementation approach:

Add the response content as span attributes on existing llm_request spans:

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",      // from OTEL_LOG_USER_PROMPTS
    "llm.output": "Here's what I found...",    // from OTEL_LOG_LLM_RESPONSES
    "llm.thinking": "I should use Grep..."     // from OTEL_LOG_THINKING
  }
}

This keeps all LLM interaction data in a single span, which is what standard observability platforms expect for LLM evaluation workflows.

Alternative Solutions

Current workarounds (all inadequate):

Hook-based transcript parsing — Reading .jsonl transcript files via Stop hooks works but:
- Requires building custom tooling
- Creates separate spans that don't integrate with standard observability platforms
- Adds operational complexity (proxy servers, state tracking)
- Still can't access system prompts
Manual transcript review — Teams currently read raw .jsonl files for debugging:
- Not scalable for production monitoring
- Can't be integrated into automated evaluation pipelines
- No correlation with structured telemetry data
BeforeModel/AfterModel hooks (issue #21531) — Would solve this but:
- Not yet implemented
- May not expose system prompts depending on design

Priority

Critical - Blocking production deployment of AI observability

Feature Category

Monitoring and observability

Use Case Example

Scenario 1: LLM quality evaluators

AI observability platforms run automated evaluators on LLM spans to detect:

Hallucinations (faithfulness to sources)
Toxicity and bias
PII leakage
Prompt injection attempts
Answer quality and relevance

These evaluators require access to both user input and model output (and sometimes System Instructions) in the same span. Currently:

✅ User input available via OTEL_LOG_USER_PROMPTS=1
❌ System Instructions not available
❌ Model output not available
Result: Cannot run automated quality checks on Claude Code sessions

Scenario 2: Compliance and audit trails

Regulated industries (healthcare, finance, legal) need complete records of AI interactions for:

GDPR Article 22 (right to explanation of automated decisions)
SOC 2 audit requirements
Internal compliance reviews

Current telemetry captures what the user asked but not what the AI responded, creating an incomplete audit trail.

Scenario 3: Debugging model behavior

When users report incorrect or unexpected responses, teams need to see:

What the model actually said (not just that a response was generated)
The thinking/reasoning that led to that response
How the response relates to tool calls

Without this data, debugging requires manually reading .jsonl transcripts instead of querying structured observability platforms.

Scenario 4: Cost optimization via content analysis

Teams want to correlate response verbosity with token costs:

Which types of prompts generate unnecessarily long responses?
Are responses including redundant explanations?
Can we optimize prompts to reduce output tokens?

This requires joining response text with token usage in analytics queries.

Scenario 5: Performance analysis

Understanding the relationship between response quality and latency:

Do faster responses sacrifice quality?
Which types of queries produce the most concise answers?
How does thinking time correlate with response accuracy?

Without access to response content, these analyses are impossible.

Additional Context

Related issues:

#42281 — Native OTLP trace/span export (addresses trace structure but not content visibility)
#21531 — BeforeModel/AfterModel hooks (alternative approach to expose LLM request/response)
#17212 — Privacy concerns about tool_parameters leaking prompt data (this request is the inverse — asking for opt-in content logging)

Security/privacy considerations:

We understand this is sensitive data. Our proposal is:

Opt-in only (off by default, requires explicit env var)
Same privacy model as existing flags (OTEL_LOG_USER_PROMPTS, OTEL_LOG_TOOL_CONTENT)
Same truncation limits (60KB like tool content)
System prompts can be separate flag if they contain proprietary Anthropic content

The precedent already exists: OTEL_LOG_USER_PROMPTS=1 exports user input (which can contain secrets, PII, etc.), and it's opt-in. We're asking for the same treatment for model output.

Industry standard:

LLM frameworks that export response content via OTEL:

LangChain — langchain.llm spans include output attribute
LlamaIndex — llm spans include response attribute
OpenAI SDK (via third-party instrumentations) — Response content in traces

Claude Code should offer comparable observability to these widely-adopted LLM frameworks.

Proposed Flag Behavior

# Default (current behavior)
OTEL_LOG_LLM_RESPONSES=0  # No response content exported

# Opt-in (proposed)
OTEL_LOG_LLM_RESPONSES=1  # Exports model response text as span attributes

Expected Benefits

Complete observability — Organizations can monitor both inputs and outputs through their existing OTLP infrastructure
Automated quality assurance — Enable evaluation frameworks to run automated checks on production traffic
Better debugging — Teams can query structured telemetry instead of reading raw transcript files
Cost optimization — Correlate response characteristics with token usage for optimization opportunities
Compliance readiness — Provide complete audit trails for regulated industries
Parity with SDK usage — Claude Code observability matches what's available when using the Anthropic SDK directly

extent analysis

TL;DR

To address the critical gap in LLM interaction data, introduce an opt-in environment variable OTEL_LOG_LLM_RESPONSES to export model response content in OTEL telemetry.

Guidance

Implement the proposed environment variable: Add OTEL_LOG_LLM_RESPONSES to control the export of model response text, similar to OTEL_LOG_USER_PROMPTS.
Define the export structure: Use span attributes on existing llm_request spans to include llm.output for model response text and llm.thinking for thinking content.
Handle sensitive data: Consider separate opt-in flags for system prompts due to potential sensitivity and ensure truncation limits are applied consistently.
Review industry standards: Align Claude Code's observability with other LLM frameworks like LangChain, LlamaIndex, and OpenAI SDK for comprehensive telemetry.

Example

{
  "span_name": "llm_request",
  "attributes": {
    "input_tokens": 1500,
    "output_tokens": 500,
    "llm.user_input": "Search for files",
    "llm.output": "Here's what I found...",
    "llm.thinking": "I should use Grep..."
  }
}

Notes

The implementation should prioritize model response text and thinking content, with system prompts considered separately due to potential sensitivity. Ensuring the new telemetry aligns with existing privacy models and truncation limits is crucial.

Recommendation

Apply the workaround by introducing the OTEL_LOG_LLM_RESPONSES environment variable to enable the export of model response content, enhancing observability and compliance capabilities.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #LLM response #environment variable #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [FEATURE] Environment variable flag to export LLM response content and thinking steps in OTEL telemetry [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Preflight Checklist

Problem Statement

Proposed Solution

Alternative Solutions

Priority

Feature Category

Use Case Example

Additional Context

Proposed Flag Behavior

Expected Benefits

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [FEATURE] Environment variable flag to export LLM response content and thinking steps in OTEL telemetry [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Preflight Checklist

Problem Statement

Proposed Solution

Alternative Solutions

Priority

Feature Category

Use Case Example

Additional Context

Proposed Flag Behavior

Expected Benefits

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING