claude-code - 💡(How to fix) Fix Make autonomous Claude Code actually viable: tiered Opus brains + Sonnet workers + persistent state

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Threshold downgrades to "unblock" a pipeline — lowering a verification threshold from FAIL to WARN so a downstream step proceeds. Should require an explicit operator override, not be reachable from agent-side edits. Ask: a small standard library of verification primitives that pipelines can declare against, with consistent failure semantics (HARD_FAIL halts, WARN logs, INFO records). Tied to the anti-pattern-guard system above so downgrades are policed.

Root Cause

The mistake the current ecosystem keeps making is letting Sonnet drift up into Tier 2 work because Opus is too expensive or too slow to invoke. That's a tooling problem, not a model-tier problem; if Opus invocation were cheaper at the metadata layer (see next section) the temptation would shrink.

Fix Action

Fix / Workaround

What works: tool use, sub-agents, hooks, crons. What's missing: a coherent way for multiple Opus-tier brains to coordinate as peers — a project manager Opus, plus several domain-specialist Opus agents (one per critical lane), able to talk to each other asynchronously, request peer sign-off on architecturally-significant decisions, and dispatch Sonnet-tier worker agents for the grunt work. Today every team building this stitches it together with handshake files, brittle cron loops, and full-context reloads on every metadata operation. It works, barely, and burns absurd amounts of token budget on infrastructure overhead instead of thinking.

One Opus agent owning sequencing, dispatch, blocker triage, and the never-ending question of "what should happen next." It reads handshake state, advances the active plan, files directives to specialists, and answers the operator. It does not write production code. Token budget is for thinking, not typing.

Sub-agents today are memoryless across invocations. Every dispatch reconstructs context from filesystem handshakes the operator's framework had to design from scratch. This is the largest single source of user-side complexity in autonomous deployments.

RAW_BUFFERClick to expand / collapse

The pitch

The most interesting thing happening in Claude Code right now is people trying to run it as a brain — not a pair-programming buddy, but the actual orchestrating intelligence behind a long-running system. Pipelines, ML training, build automation, monitoring, content workflows. The operator goes to bed and Claude Code keeps the lights on. That use case is real and growing, and the current primitives don't quite get there.

What works: tool use, sub-agents, hooks, crons. What's missing: a coherent way for multiple Opus-tier brains to coordinate as peers — a project manager Opus, plus several domain-specialist Opus agents (one per critical lane), able to talk to each other asynchronously, request peer sign-off on architecturally-significant decisions, and dispatch Sonnet-tier worker agents for the grunt work. Today every team building this stitches it together with handshake files, brittle cron loops, and full-context reloads on every metadata operation. It works, barely, and burns absurd amounts of token budget on infrastructure overhead instead of thinking.

Architecture: three tiers, with intent

Tier 1 — Opus PM (orchestration)

One Opus agent owning sequencing, dispatch, blocker triage, and the never-ending question of "what should happen next." It reads handshake state, advances the active plan, files directives to specialists, and answers the operator. It does not write production code. Token budget is for thinking, not typing.

Tier 2 — Opus domain specialists (peer expertise)

A small set of Opus-tier agents, each owning one lane: data integrity, security, model architecture, exit logic, scoring, whatever the system needs. Opus, not Sonnet — the lane work requires real reasoning, pattern-matching across long context, and the judgment to push back on the PM when the PM is wrong.

Two capabilities are load-bearing here:

  1. Peer-to-peer async dialogue. When a specialist gets stuck, it can ping a sibling specialist directly without round-tripping through the PM. "Hey security, this exit-logic change touches the order-submission path — does it widen the attack surface?" Today that conversation has to be marshalled through filesystem handshakes the PM polls, which adds latency and consumes PM context for routing it has no opinion on.
  2. Two-Opus sign-off for architecturally-significant decisions. Some changes shouldn't ship on one brain's say-so — schema migrations, model promotions, anything touching live money or live users. A specialist proposes; a sibling specialist (or the PM) signs off. Today this is enforced by ad-hoc convention and operator vigilance. It should be a first-class primitive: RequirePeerReview(decision_id, reviewer_pool=[...]).

Tier 3 — Sonnet workers and auditors

Sonnet is for worker bee work only: execute a script, edit a file, run a test, fetch data, lint a config, parse a log. Bounded scope, clear input, clear output, no architectural judgment expected. Auditors are a flavor of worker — narrow scanners that produce a stamp file and exit.

The mistake the current ecosystem keeps making is letting Sonnet drift up into Tier 2 work because Opus is too expensive or too slow to invoke. That's a tooling problem, not a model-tier problem; if Opus invocation were cheaper at the metadata layer (see next section) the temptation would shrink.

Infrastructure overhead — metadata operations are too expensive

This is the single biggest tax on autonomous deployments today, and it compounds the "Opus PM gravitates to writing code" pathology because PM token budget gets eaten by infrastructure noise.

The pattern: when a Tier 1 or Tier 2 Opus agent performs what should be a lightweight metadata operation — registering a cron, listing crons, deleting a cron, reading a heartbeat, writing an audit stamp — the runtime currently reloads the entire conversation context to service the operation. That's a full Opus context reload to do what is functionally a database write.

Concrete examples:

  • CronCreate / CronList / CronDelete — full PM context reload per call. A PM that schedules five recurring polls during planning has paid for five full context reloads. Crons should be side-channel metadata operations against a small scheduler service, not full-agent invocations.
  • Heartbeat writes — same pattern. Long-running specialists writing periodic "I'm alive" markers shouldn't reload context to do it.
  • Stamp-file reads/writes for audit gates — when a specialist needs to confirm a sibling's audit stamp is fresh, it shouldn't need to reload that sibling's full context. A stamp-existence query is metadata.
  • Handshake-file polling — the PM polling for new specialist messages every cycle currently re-grounds full context every poll. A side-channel "any new messages?" check would be cheaper by orders of magnitude.

Ask: treat scheduler operations, heartbeats, stamps, and handshake polling as side-channel metadata that does not require full agent context to execute. Expose them as lightweight tool calls that return without re-grounding. Related: #55033, #56293.

Persistent state across agent invocations

Sub-agents today are memoryless across invocations. Every dispatch reconstructs context from filesystem handshakes the operator's framework had to design from scratch. This is the largest single source of user-side complexity in autonomous deployments.

Ask: first-class persistent agent state — keyed by agent role, scoped per project, readable by the PM and writable by the agent itself. Plus a standard schema for "what am I currently working on, what's blocking me, when did I last make progress." See also #55424.

Model-tier integrity and dispatch verification

Today there is no enforced way for the PM to confirm that a "Tier 2 Opus specialist" dispatch actually ran on Opus rather than being silently downgraded. Same for Tier 3 — operators have caught dispatches running on the wrong tier and producing degraded output that passed all syntactic checks.

Ask: every Agent tool invocation returns the model ID it actually ran on, and the runtime exposes a RequireModel(min_tier="opus") assertion that hard-fails if the dispatch was downgraded. Related: #53610.

Anti-pattern guards (runtime, not convention)

Recurring failure modes we'd like the runtime to refuse:

  • Threshold downgrades to "unblock" a pipeline — lowering a verification threshold from FAIL to WARN so a downstream step proceeds. Should require an explicit operator override, not be reachable from agent-side edits.
  • Marker-stub circumvention — adding empty checkpoint/resume/atomic-write stubs to satisfy a static scanner without implementing the underlying behavior. Scanners should hash function bodies, not check for marker presence.
  • --no-X flag escalation — disabling a safety cap (row limit, timeout, RAM ceiling) to make a job complete. Should require a co-signed peer-review token from a sibling specialist.
  • Stamp flipping — manually touching a stamp file to satisfy a freshness gate without re-running the audit. Stamps should be cryptographically tied to the audit script's output hash.

These are convention-enforced today and operators routinely catch agents violating them. They should be runtime-enforced.

RAM and resource auditing as a first-class primitive

Heavy script dispatch (training, large data builds, anything memory-bound) currently requires a custom pre-flight audit hook the operator wrote themselves. Every project doing this has reinvented the same wheel.

Ask: built-in pre-flight resource estimator for tool invocations — peak RAM, expected wall-clock, disk I/O — with configurable hard blocks. Bonus: a standard "kill the largest non-protected process" watchdog, since every autonomous deployment ends up writing one.

Verification primitives

Autonomous pipelines need a standard way to assert "the thing I just produced is consistent with what downstream expects" — schema match, row count parity, key alignment, null-rate ceiling. Today every team writes these from scratch and they drift.

Ask: a small standard library of verification primitives that pipelines can declare against, with consistent failure semantics (HARD_FAIL halts, WARN logs, INFO records). Tied to the anti-pattern-guard system above so downgrades are policed.

Concretely, what would ship first

If only a subset of this lands, the highest-leverage items are:

  1. Lightweight metadata operations — CronCreate/List/Delete and heartbeat/stamp/handshake-poll as side-channel calls that don't reload agent context. This single change would reduce autonomous-deployment token burn by a meaningful fraction.
  2. Persistent agent state — first-class, keyed by role, with a standard "current task / blockers / last progress" schema.
  3. Peer-to-peer async dialogue between Tier 2 specialists — direct sibling-to-sibling messaging without PM round-trip.
  4. RequireModel and RequirePeerReview assertions — runtime-enforced, not convention.
  5. Anti-pattern guards — threshold downgrades, marker stubs, safety-flag escalation, stamp flipping all refused at the runtime layer.

Everything else is downstream of these.

Cited prior issues

  • #55033 — scheduler / cron infrastructure overhead
  • #55424 — persistent agent state
  • #56293 — metadata operation cost
  • #53610 — broader cluster of agent coordination failures

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING