claude-code - 💡(How to fix) Fix Claude Code conducts symptom-driven exploration instead of root-cause analysis on complex multi-component systems — concrete side-by-side with Codex showing the gap [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#56069Fetched 2026-05-05 05:59:00
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×5commented ×1unlabeled ×1

Error Message

neutralize it (or warn explicitly that conclusions have a freshness window) before deep analysis.

Error Messages/Logs

No error message — this is a behavioral defect, not a crash.

Root Cause

  • Claude Code CLI, model: Opus 4.7 (1M context), effort=max
    • Skill invoked: custom /jellyfin-audit (32-layer audit checklist that explicitly names the organizer as "Schicht 6 — KRITISCHSTE SCHICHT" and states "no symptom-fixing, root cause first")
    • Project: Proxmox homeserver, multi-component Jellyfin stack (LXC 101 Jellyfin, 102 Caddy, 103 Failsafe-Dashboard,
      104 VS Code) plus a custom Python organizer (jellyfin-organizer.py, 4099 lines / 177 KB) that ingests uploads,
      converts BDMV/DVD/ISO to MKV, normalizes audio tracks, organizes series, maintains a film index, and runs as a systemd-supervised watcher.
    • User profile: ADHD. Explicitly relies on the agent to hold the structural overview that the user cannot hold themselves.

Fix Action

Fix / Workaround

A pre-tool-call planning step that, when a custom skill with criticality-ranked layers is loaded, produces a written
component-criticality ranking and a written symptom-to-component mapping before the first tool call. Re-emit (and update) that mapping every time the user pushes back on the investigation premise. Optionally bias subagent dispatch
to the highest-criticality layer first.

Code Example

No error message — this is a behavioral defect, not a crash.    
                                                                                                                        
  Symptom: ~25 tool calls were spent on layers that cannot produce the reported symptom classes                         
  (web UI search algorithm, custom CSS focus outline, MutationObserver scope, APT hooks,                                
  scheduled-task triggers, branding configuration). The component that can produce all three                            
  symptom classes (the custom organizer + its live watcher) was never opened in the entire session.                     
                                                                                                                        
  The user ended the session and switched to Codex, which on the identical prompt:                                      
  - identified the organizer as the suspect within the first ~10 calls                                                  
  - stopped the live watcher to preserve evidence                                                                       
  - produced a per-episode integrity matrix using the organizer's own backup directories as ground truth
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

Environment

  • Claude Code CLI, model: Opus 4.7 (1M context), effort=max
  • Skill invoked: custom /jellyfin-audit (32-layer audit checklist that explicitly names the organizer as "Schicht 6 — KRITISCHSTE SCHICHT" and states "no symptom-fixing, root cause first")
  • Project: Proxmox homeserver, multi-component Jellyfin stack (LXC 101 Jellyfin, 102 Caddy, 103 Failsafe-Dashboard,
    104 VS Code) plus a custom Python organizer (jellyfin-organizer.py, 4099 lines / 177 KB) that ingests uploads,
    converts BDMV/DVD/ISO to MKV, normalizes audio tracks, organizes series, maintains a film index, and runs as a systemd-supervised watcher.
  • User profile: ADHD. Explicitly relies on the agent to hold the structural overview that the user cannot hold themselves.

What the user asked for (verbatim, both sessions)

▎ "Hello, I have problems with Jellyfin. Either series like Terminator SCC are incomplete, audio tracks are totally
▎ unusable, or e.g. with Jason Bourne the film scenes are jumbled. Jellyfin is unusable in this state. Do a complete, ▎ semantic, forensic, comprehensive deep analysis — every single dependency, every single line of code, every
▎ configuration must be fully captured and understood, and only then, when you are 100% sure, fix it."

The same prompt, two agents, completely different opening moves

Claude Code's first ~25 tool calls

  1. pct list (containers)
  2. systemctl status jellyfin
  3. /Devices API (which client does the user use)
  4. /Search/Hints?searchTerm=Jason%20Bourne — reproducing the example the user gave
  5. /Items count vs. user-filter count
  6. Branding/Custom-CSS dump
  7. Plugin list
  8. Webroot diff (index.html, jellyfin-enhancements.js)
  9. jellyfin-collections.json generation chain
  10. encoding.xml, system.xml, network.xml
  11. ScheduledTasks status
  12. JF activity log
  13. Storage health
  14. … etc.

After ~25 calls Claude produced an 8-finding report covering: a CSS focus-outline color, a MutationObserver scope, a
missing APT hook, two scheduled-task triggers, an inventory discrepancy, and the search-AND-match algorithm. The
organizer was never opened. Not once. None of these findings can produce "scenes scrambled" or "audio unusable" or
"series incomplete" — the actual symptom classes. When the user pushed back ("the films are only examples", "must be dynamic", "concerns Jellyfin generally"), Claude treated each correction as an additional constraint to layer on top, never as a signal that the entry point was wrong.

Codex's first ~25 tool calls (same prompt, same machine, same data)

  1. Wrote a 5-step plan ("inventory, config, affected media, root cause, validate") and committed to a non-destructive bring-up.
  2. docker ps, ps aux | grep jellyfin, find / -iname 'jellyfin', dpkg -l, ss -ltnp — establishing what runs where
    before touching anything.
  3. Discovered Jellyfin lives in an LXC namespace and pivoted to /proc/<pid>/root/... and nsenter to see exactly what Jellyfin sees — solving the namespace-mismatch before it caused wrong conclusions.
  4. Read the actual config files Jellyfin uses (not the host-side ones).
  5. Within the first 10 calls noticed the running watcher process (/opt/jellyfin-organizer/jellyfin-watcher.sh) and the bdmv-to-mkv.sh and add-aac-track.sh helpers — i.e. the agent identified the organizer as the suspected fundament
    immediately.
  6. ffprobed the example files and found the smoking gun: every Bourne MKV and several SCC episodes carry
    TAG:ENCODER=Lavf61.7.100, meaning the organizer rewrote them locally (not original rips). Some SCC episodes still
    carry libebml/libmatroska (untouched mkvmerge originals) — i.e. the organizer touched some files and not others, and the touched ones are the broken ones.
  7. Cross-referenced with /mnt/media/.duplikate_2026-05-03/scc_loudnorm/ (organizer-generated backup directory of pre-modification files) and /mnt/media/.scc-rips/ (clean disc rips) to prove the organizer is the source.
  8. When the user asked "could the organizer be defective?", Codex answered "yes, very plausible" and immediately stopped the watcher to prevent further mutation during analysis — kill -TERM, then noticed Restart=always, then
    disabled the systemd unit via the LXC management path, then verified is-enabled=disabled, is-active=inactive. Stopped the bleeding before continuing diagnosis.
  9. Built a full per-episode runtime/size/encoder matrix for SCC and the Bourne films, cross-referenced against the rip sources and NFO metadata.
  10. Read the relevant organizer functions (add_aac_track, check_episode_truncation_via_tmdb, BDMV→MKV path, loudnorm path) at line-number precision to identify where in the code the destruction happens.

The structural difference (this is the actual issue)

┌───────────────────────────┬────────────────────────────┬───────────────────────────────────────────────────────┐
│ │ Claude Code │ Codex │ ├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤ │ First action │ Query the most recently │ Write a plan, then enumerate what runs where │ │ │ touched API surface │ │ ├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤
│ Treated user's example │ Test cases to reproduce │ Hints toward a class of corruption to characterize │
│ titles as │ against │ │
├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤
│ When user said "those are │ Added it as a constraint, │ Would have re-derived from symptom class (didn't need │ │ only examples" │ kept the same │ to — Codex never went down the example-specific │
│ │ investigation tree │ path) │ ├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤
│ Identified the organizer │ Never (in ~25 calls) │ Within the first 10 calls │ ├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤
│ Stopped the live mutation │ Never │ Immediately, with rollback-safe disable │ │ source │ │ │
├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤
│ Used namespace-aware │ No (would have been wrong │ Yes (/proc/<pid>/root/, nsenter) before reading any │ │ paths │ if it had read host │ config │
│ │ config) │ │
├───────────────────────────┼────────────────────────────┼───────────────────────────────────────────────────────┤ │ Symptom-to-component │ │ Implicit but acted upon: scenes-scrambled → BDMV │
│ mapping │ Never written down │ concat, audio-unusable → audio-rewrite chain, │ │ │ │ series-incomplete → series organize logic │
└───────────────────────────┴────────────────────────────┴───────────────────────────────────────────────────────┘

Why this matters specifically for ADHD users

The user explicitly said: "For me with ADHD, Codex is unfortunately the better help." This is not a tone preference — it is a structural one. The value proposition of an agentic coding tool for a user with executive-function challenges is precisely that the tool holds the structural overview the user cannot hold. The user offloads "where do I start,
what's the fundament, what's a downstream symptom" — exactly the part they have trouble with.

When Claude Code instead does breadth-first symptom collection across whichever layer it touches first, it duplicates the user's own difficulty. The user ends up steering the tool turn by turn ("no, the organizer", "no, the fundament", "no, dynamic, not film-specific"), which is the exact cognitive load they tried to delegate. They eventually concluded — accurately — that the tool was costing them time and tokens without converging.

Codex's opening "I will do non-destructive inventory before changing anything, here is the 5-step plan" is what the
ADHD user perceived as help. Claude Code's opening "let me query the device list and search API" is what they perceived as another thing to manage.

Concrete behaviors that should change

  1. Before the first tool call, emit a written symptom-to-component mapping. Given symptoms A, B, C, write: "A is
    plausibly produced by components X, Y; B by Y, Z; C by Y. Intersection = Y. Entry point = Y." This single artifact would have caught this case — all three of the user's symptom classes (scrambled scenes, broken audio, incomplete
    series) trivially intersect at the organizer.
  2. Use the loaded skill's layer ranking as a traversal bias, not background information. When a custom skill is loaded and explicitly ranks layers by criticality (this skill literally writes "KRITISCHSTE SCHICHT" next to layer 6), the
    model's first three tool calls should touch that layer. In this session they did not.
  3. Treat "those are only examples" as a premise reset, not an additional constraint. When the user says the example
    items are not the target, abandon item-specific tests and re-derive from the symptom class. The current behavior
    layers the constraint on top of the wrong investigation tree, so steering keeps producing the same misalignment.
  4. For systems with active mutators (watchers, daemons, cron jobs), stop the mutation before deep analysis. Codex did this within minutes; Claude never recognized the live watcher as a confounder for diagnosis. A general principle: "if a process can change the artifacts you are about to analyze, neutralize it first or note that your conclusions have a freshness window."
  5. Namespace/container awareness as a first-class step. Before reading config files, verify the running process's view (/proc/<pid>/root, nsenter, pct exec) matches the agent's shell view. Codex pivoted to this within the first 10
    calls; Claude never noticed the discrepancy was even possible (in this case it didn't bite, but in many setups it would).
  6. For users who flag ADHD or similar workflows, prefer the Codex-style opening: re-inventory components, draw the dependency graph, name the suspected fundament, then drill. The current default of "start where the most recent tool
    result took us" actively hurts these users.

Reproducer

Hand the model:

  • A multi-component system with one component that is the obvious common cause of multiple user-visible symptoms.
  • A list of three or four user-visible symptoms that all stem from that one component.
  • A skill checklist that names that component as the most critical.

Observe whether the first three tool calls touch that component, or whether they touch the layer with the easiest API surface. In the session documented here, the first ~25 tool calls did not touch the organizer at all, despite the
skill checklist literally calling it the most critical layer.

Suggested fix direction

A pre-tool-call planning step that, when a custom skill with criticality-ranked layers is loaded, produces a written
component-criticality ranking and a written symptom-to-component mapping before the first tool call. Re-emit (and update) that mapping every time the user pushes back on the investigation premise. Optionally bias subagent dispatch
to the highest-criticality layer first.

User impact

After the user spent months building this Jellyfin stack and asked for a forensic root-cause investigation, Claude
Code spent ~25 tool calls and one long report on the wrong layers, never opened the organizer, and never stopped the live watcher that was mutating media files during the analysis. Codex, given the same prompt, identified the organizer as the suspect within the first 10 calls, stopped the watcher to preserve evidence, and produced a per-episode integrity matrix using the organizer's own backup directories as ground truth. The user ended the Claude session and switched to Codex — not because of tone or model quality, but because of investigation strategy.


Filed by a paying Claude Code user with ADHD who genuinely prefers Claude's tone and reasoning when it works, but who currently has to switch to Codex for this class of investigation. The full Codex transcript demonstrating the
alternative behavior is included above.

What Should Happen?

When the user invokes a custom Skill that ranks investigation layers by criticality (e.g. "Schicht 6 — KRITISCHSTE SCHICHT") and provides a list of multiple user-visible symptoms, Claude Code should:

  1. Before the first tool call, emit a written symptom-to-component mapping: for each symptom, name the components that can plausibly produce it; identify the intersection; declare the intersection as the entry point for investigation.
  2. Bias the first tool calls toward the highest-criticality layer named in the loaded Skill, not the layer with the
    easiest API surface.
  3. Treat user corrections like "those are only examples" or "the solution must be dynamic" as a premise reset (re-derive from symptom class), not as an additional constraint to layer on top of the existing investigation tree.
  4. When a live mutator process is detected (watcher, daemon, cron job) that can modify the artifacts being analyzed, neutralize it (or warn explicitly that conclusions have a freshness window) before deep analysis.
  5. Verify namespace/container boundaries (LXC, Docker, chroot) before reading config files, so that the agent reads the same files the running service reads.

In the documented session, all three reported symptom classes (scrambled scenes, broken audio, incomplete series)
trivially intersect at one component (a custom Python organizer with a live systemd watcher). Claude Code should have identified that intersection and opened that component first. Codex, given the identical prompt, did exactly that
within the first ~10 tool calls.

Error Messages/Logs

No error message — this is a behavioral defect, not a crash.    
                                                                                                                        
  Symptom: ~25 tool calls were spent on layers that cannot produce the reported symptom classes                         
  (web UI search algorithm, custom CSS focus outline, MutationObserver scope, APT hooks,                                
  scheduled-task triggers, branding configuration). The component that can produce all three                            
  symptom classes (the custom organizer + its live watcher) was never opened in the entire session.                     
                                                                                                                        
  The user ended the session and switched to Codex, which on the identical prompt:                                      
  - identified the organizer as the suspect within the first ~10 calls                                                  
  - stopped the live watcher to preserve evidence                                                                       
  - produced a per-episode integrity matrix using the organizer's own backup directories as ground truth

Steps to Reproduce

Setup needed:

  • A multi-component system with one component that is the obvious common cause of multiple user-visible symptoms (e.g. a media-management stack where one Python script ingests, converts, and reorganizes files via a live systemd
    watcher).
  • A custom Skill loaded into Claude Code that explicitly ranks investigation layers by criticality (e.g. lists 32
    layers and labels one of them "MOST CRITICAL LAYER").
  • The component named as "most critical" in the Skill must be the one that actually produces the symptoms.

Reproduction:

  1. Load the custom Skill via the Skill tool.
  2. Send a prompt that lists 3+ user-visible symptom classes that all stem from the most-critical component, and ask for a "complete forensic deep analysis, root cause first, no symptom fixes."
  3. Observe the first ~10 tool calls.

Expected: First 3 tool calls touch the most-critical component named by the Skill.
Actual: First 3 tool calls touch the layer with the easiest API surface (in this case, Jellyfin REST API endpoints — devices, search, branding). The most-critical component is never opened.

  1. Push back with "those titles are only examples, must be dynamic, generic across all items."
    Expected: Premise reset — re-derive investigation from symptom class. Actual: The correction is added as a constraint on top of the existing (wrong) investigation tree.

  2. Ask "could the [most-critical component] be defective?"
    Expected: Agent immediately neutralizes any live mutator on that component (stop watcher, disable systemd unit) before deeper analysis.
    Actual: In the documented session, Claude listed the component's functions only after the user announced they were switching to a different tool.

The full Codex transcript on the identical prompt — showing the contrasting behavior call by call — is included in the issue body for direct comparison.

Claude Model

Opus

Is this a regression?

Yes, this worked in a previous version

Last Working Version

No response

Claude Code Version

2.1.126

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

Full side-by-side comparison transcript (Claude Code vs. Codex on identical prompt) is in the issue body above.

Key context for the priority of this issue:

  • The user has ADHD and explicitly relies on the agent to hold the structural overview they cannot hold themselves.
  • The Skill mechanism is supposed to provide exactly that overview (it ranks 32 layers by criticality).
  • When the agent loads the Skill but doesn't use the criticality ranking as a traversal bias, the Skill becomes
    background information instead of an actionable plan.
  • The user paid for ~25 tool calls of analysis that produced zero progress on the actual symptoms, then had to switch tools.

Suggested fix direction:
A pre-tool-call planning step that, when a custom Skill with criticality-ranked layers is loaded, produces a written component-criticality ranking and a written symptom-to-component mapping before the first tool call. Re-emit and
update that mapping every time the user pushes back on the investigation premise.

extent analysis

TL;DR

The issue can be fixed by implementing a pre-tool-call planning step that produces a written symptom-to-component mapping and biases the first tool calls toward the highest-criticality layer named in the loaded Skill.

Guidance

  • Identify the most critical component named in the loaded Skill and prioritize it in the investigation.
  • Implement a pre-tool-call planning step to produce a written symptom-to-component mapping and update it when the user pushes back on the investigation premise.
  • Treat user corrections as a premise reset, rather than adding them as constraints to the existing investigation tree.
  • Neutralize live mutator processes before deep analysis to prevent modification of artifacts.
  • Verify namespace/container boundaries before reading config files to ensure the agent reads the same files as the running service.

Example

No specific code example is provided, as the issue is related to the investigation strategy and tool call prioritization.

Notes

The fix direction involves modifying the investigation strategy to prioritize the most critical component and update the symptom-to-component mapping based on user input. This requires changes to the tool call prioritization and investigation logic.

Recommendation

Apply the suggested fix direction by implementing a pre-tool-call planning step and updating the investigation strategy to prioritize the most critical component. This will improve the effectiveness of the investigation and provide better results for users with ADHD who rely on the agent to hold the structural overview.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Claude Code conducts symptom-driven exploration instead of root-cause analysis on complex multi-component systems — concrete side-by-side with Codex showing the gap [1 comments, 2 participants]