claude-code - 💡(How to fix) Fix [FEATURE] Raise default thinking budget / auto-escalate on diagnostic prompts [3 comments, 2 participants]

claude-code2026-04-14 14:31:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#47948•Fetched 2026-04-15 06:37:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pelegs202-design

Participants

github-actions[bot]

pelegs202-design

Timeline (top)

commented ×3labeled ×3

On claude-opus-4-6[1m] in the Claude Code CLI, the default thinking/effort budget is too shallow for diagnostic and debugging prompts. The model gives surface-level first answers to investigative questions ("why did X crash?", "find the bug", "why is this happening?") and only produces real root-cause analysis after the user adds a magic phrase like "use deep thinking", "think deeper", or "be thorough". Users shouldn't need to know the incantation to get an actual investigation.

Root Cause

Users shouldn't need to know the magic phrase. The gap between "default" and "actually useful on a hard question" is currently bridged by folk knowledge ("say 'ultrathink'", "say 'think deeper'", install a deep-think skill).
Anthropic's own skill ecosystem ships a workaround for this gap. A distributed skill called deep-think activates on phrases like "think deeper", "be thorough", "stop being lazy", "try harder". Its own description reads: "Activate when quality feels low, model is being shallow, skipping research, bailing out..." The existence of this skill is an implicit admission that the default is shallow for a non-trivial class of prompts.
Adjacent issues reinforce this:
- #42796 (closed) quantified that reduced thinking tokens correlated with a measurable shift from research-first to edit-first behavior on complex engineering work.
- #37441 (closed) made the parallel request for the web app.
- #41028 (open) is a bug where even manually passing --effort is silently dropped — so the workaround isn't even reliable.
Diagnostic tasks are exactly the ones where shallow answers waste the most user time. A wrong hypothesis sends the user down a fake debugging path. The cost of extra thinking tokens is trivial compared to the cost of a wrong root cause.

Fix Action

Fix / Workaround

Users shouldn't need to know the magic phrase. The gap between "default" and "actually useful on a hard question" is currently bridged by folk knowledge ("say 'ultrathink'", "say 'think deeper'", install a deep-think skill).
Anthropic's own skill ecosystem ships a workaround for this gap. A distributed skill called deep-think activates on phrases like "think deeper", "be thorough", "stop being lazy", "try harder". Its own description reads: "Activate when quality feels low, model is being shallow, skipping research, bailing out..." The existence of this skill is an implicit admission that the default is shallow for a non-trivial class of prompts.
Adjacent issues reinforce this:
- #42796 (closed) quantified that reduced thinking tokens correlated with a measurable shift from research-first to edit-first behavior on complex engineering work.
- #37441 (closed) made the parallel request for the web app.
- #41028 (open) is a bug where even manually passing --effort is silently dropped — so the workaround isn't even reliable.
Diagnostic tasks are exactly the ones where shallow answers waste the most user time. A wrong hypothesis sends the user down a fake debugging path. The cost of extra thinking tokens is trivial compared to the cost of a wrong root cause.

Raise the default thinking budget for Claude Code CLI, at least on Opus 4.6 1M where users have explicitly opted into the deep-thinking-capable model. The current default appears calibrated for short conversational turns; it is too low for the kind of multi-file, log-reading, evidence-weighing work the CLI is actually used for.
Add automatic effort auto-escalation on diagnostic/investigative prompts. Detect keywords/intents like "why did X crash", "why is X failing", "find the bug", "root cause", "why is this happening", "diagnose", "what's wrong with" — and silently bump the effort level for that turn. This mirrors what the deep-think skill already does heuristically; native handling removes the need for folk-knowledge workarounds.

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing requests and this feature hasn't been requested yet
This is a single feature request (not multiple features)

Summary

Symptom

The first-pass response to a diagnostic question is frequently a plausible-sounding but unverified guess based on a cursory look at the evidence (e.g., reading only the last line of a log file, reasoning from priors about "usual causes"). The evidence needed for a correct answer is available in the session — the model just doesn't spend the thinking tokens to read it carefully. After the user escalates verbally, the same model produces a substantially different and correct answer from the same inputs.

Repro

Concrete example from a session today (2026-04-14):

User asked: "Why did Claude Code crash three times in a row?"
First-pass answer (default thinking): speculated about API timeouts and cache limits, based on last-line inspection of the session JSONL files. This was wrong. It also misidentified two of the three sessions as "dead" when they had actually closed cleanly.
User replied: "use deep thinking".
Second-pass answer: actually read the full tails of each session file, identified that only one of the three was a real crash, and pinpointed the failure mode — a silent drop after a turn that combined parallel MCP tool calls with Bash tool_result blocks. This became issue #47931.

Same session, same files on disk, same model. The evidence was there from the start. The shallow default pass missed it; the escalated pass found it in one additional tool call.

This is not a one-off. It is reproducible on any "why did X fail" / "find the root cause" / "why is this slow" prompt: the first answer tends to be a guess; the escalated answer tends to be grounded.

Why this matters

Users shouldn't need to know the magic phrase. The gap between "default" and "actually useful on a hard question" is currently bridged by folk knowledge ("say 'ultrathink'", "say 'think deeper'", install a deep-think skill).
Anthropic's own skill ecosystem ships a workaround for this gap. A distributed skill called deep-think activates on phrases like "think deeper", "be thorough", "stop being lazy", "try harder". Its own description reads: "Activate when quality feels low, model is being shallow, skipping research, bailing out..." The existence of this skill is an implicit admission that the default is shallow for a non-trivial class of prompts.
Adjacent issues reinforce this:
- #42796 (closed) quantified that reduced thinking tokens correlated with a measurable shift from research-first to edit-first behavior on complex engineering work.
- #37441 (closed) made the parallel request for the web app.
- #41028 (open) is a bug where even manually passing --effort is silently dropped — so the workaround isn't even reliable.
Diagnostic tasks are exactly the ones where shallow answers waste the most user time. A wrong hypothesis sends the user down a fake debugging path. The cost of extra thinking tokens is trivial compared to the cost of a wrong root cause.

Request

Either (or both) of:

Raise the default thinking budget for Claude Code CLI, at least on Opus 4.6 1M where users have explicitly opted into the deep-thinking-capable model. The current default appears calibrated for short conversational turns; it is too low for the kind of multi-file, log-reading, evidence-weighing work the CLI is actually used for.
Add automatic effort auto-escalation on diagnostic/investigative prompts. Detect keywords/intents like "why did X crash", "why is X failing", "find the bug", "root cause", "why is this happening", "diagnose", "what's wrong with" — and silently bump the effort level for that turn. This mirrors what the deep-think skill already does heuristically; native handling removes the need for folk-knowledge workarounds.

If token cost is the concern, make auto-escalation opt-out via settings.json rather than opt-in. Users who want token frugality on trivial turns can disable it; the default should favor correctness on hard questions.

Environment

Claude Code CLI: v2.1.107
OS: Windows 11 Home Single Language 10.0.26100
Shell: Git Bash 5.2.37
Model: claude-opus-4-6[1m] (Opus 4.6, 1M context)
settings.json: "alwaysThinkingEnabled": false (default — no manual override)

#47931 — the Claude Code crash bug that the shallow first-pass diagnostic missed and the escalated pass found. This issue exists because #47931 was nearly missed.
#42796 (closed) — "Claude Code is unusable for complex engineering tasks" — evidence that thinking depth is load-bearing for engineering work.
#37441 (closed) — parallel request for the web app.
#41028 (open) — --effort flag silently dropped, making manual escalation unreliable.

extent analysis

TL;DR

Increase the default thinking budget for Claude Code CLI or implement automatic effort auto-escalation on diagnostic/investigative prompts to improve the accuracy of root cause analysis.

Guidance

Review the settings.json file to understand the current configuration and consider updating the "alwaysThinkingEnabled" setting to true to enable deeper thinking by default.
Analyze the model's behavior on diagnostic prompts and identify opportunities to implement automatic effort auto-escalation based on keywords or intents.
Evaluate the trade-off between token cost and accuracy in root cause analysis, and consider making auto-escalation opt-out via settings.json for users who prioritize token frugality.
Investigate the deep-think skill and its activation phrases to understand how it can be leveraged to improve the model's thinking depth.

Example

No code snippet is provided as the issue is related to model configuration and behavior rather than code implementation.

Notes

The solution may require balancing the trade-off between token cost and accuracy, and may involve updating the model's configuration or implementing custom logic to detect diagnostic prompts and auto-escalate effort.

Recommendation

Apply a workaround by increasing the default thinking budget or implementing automatic effort auto-escalation, as the current default thinking budget is too shallow for diagnostic and debugging prompts, leading to inaccurate root cause analysis.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tensor shape #autograd error #model save/load #optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [FEATURE] Raise default thinking budget / auto-escalate on diagnostic prompts [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Preflight Checklist

Summary

Symptom

Repro

Why this matters

Request

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [FEATURE] Raise default thinking budget / auto-escalate on diagnostic prompts [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Preflight Checklist

Summary

Symptom

Repro

Why this matters

Request

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING