openclaw - 💡(How to fix) Fix [QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Codex-more-expensive rows should fail or warn according to a clear threshold policy.

Root Cause

  • Product impact: P2 for token/cost risk if the runtime-tool-fs-read row reflects real default behavior, because Codex used 119,489 tokens and 40 tool calls versus Pi 72,381 tokens and 2 tool calls.
  • Product correctness impact: P4 from this evidence; all codex-native-live functional rows passed.
  • QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.

Fix Action

Fix / Workaround

  • instruction-followthrough-repo-contract: Codex 23,142 vs Pi 54,627 (-57.6%)
  • approval-turn-tool-followthrough: Codex 44,413 vs Pi 54,015 (-17.8%)
  • compaction-retry-mutating-tool: Codex 25,195 vs Pi 94,676 (-73.4%)
  • runtime-tool-apply-patch: Codex 44,584 vs Pi 72,465 (-38.5%)
  • runtime-tool-bash: Codex 44,955 vs Pi 92,394 (-51.3%)
  • runtime-tool-exec: Codex 49,639 vs Pi 111,369 (-55.4%)
  • runtime-tool-fs-write: Codex 46,318 vs Pi 72,374 (-36.0%)
  • runtime-tool-grep: Codex 50,828 vs Pi 92,200 (-44.9%)

Code Example

set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --runtime-suite codex-native-live \
  --codex-tool-loading direct \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5 \
  --output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
  --output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report

---

{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }
RAW_BUFFERClick to expand / collapse

TLDR

A local live run with a real OPENAI_API_KEY made codex-native-live pass functionally (11/11 scenarios), but the live token-efficiency report failed. Most flagged rows are actually Codex savings versus Pi, so the report currently conflates "large token delta" with "token regression." Two rows are genuine Codex-more-expensive candidates and need triage: streaming-final-integrity and especially runtime-tool-fs-read.

This is not a confirmed correctness bug in the Codex runner. It is a release-confidence blocker for the live token-efficiency gate and a possible Codex efficiency regression in one native read fixture.

Impact if OpenClaw moved fully to Codex today

  • Product impact: P2 for token/cost risk if the runtime-tool-fs-read row reflects real default behavior, because Codex used 119,489 tokens and 40 tool calls versus Pi 72,381 tokens and 2 tool calls.
  • Product correctness impact: P4 from this evidence; all codex-native-live functional rows passed.
  • QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.

Reproduction

Local checkout: /Volumes/LEXAR/repos/openclaw-runtime-parity-rebase PR head: 210f900ce81e7cf18f9af921b0f1a31cc7f95c0b Env source: /Users/lume/.openclaw/secrets/openai.env (OPENAI_API_KEY, value not logged)

set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --runtime-suite codex-native-live \
  --codex-tool-loading direct \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5 \
  --output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
  --output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report

Artifacts

  • Suite summary: .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json
  • Token summary: .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report/qa-runtime-token-efficiency-summary.json

Functional result

codex-native-live passed:

{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }

The API key fixed the earlier Pi missing-api-key blocker. Both runtimes produced live assistant-message usage.

Token report result

The token report had status=evaluated, usageSource=live-usage rows, and failed because threshold rows were flagged.

Rows where Codex was cheaper but still flagged:

  • instruction-followthrough-repo-contract: Codex 23,142 vs Pi 54,627 (-57.6%)
  • approval-turn-tool-followthrough: Codex 44,413 vs Pi 54,015 (-17.8%)
  • compaction-retry-mutating-tool: Codex 25,195 vs Pi 94,676 (-73.4%)
  • runtime-tool-apply-patch: Codex 44,584 vs Pi 72,465 (-38.5%)
  • runtime-tool-bash: Codex 44,955 vs Pi 92,394 (-51.3%)
  • runtime-tool-exec: Codex 49,639 vs Pi 111,369 (-55.4%)
  • runtime-tool-fs-write: Codex 46,318 vs Pi 72,374 (-36.0%)
  • runtime-tool-grep: Codex 50,828 vs Pi 92,200 (-44.9%)

Rows where Codex was more expensive:

  • streaming-final-integrity: Codex 21,946 vs Pi 17,887 (+22.7%), no tools.
  • runtime-tool-fs-read: Codex 119,489 vs Pi 72,381 (+65.1%), Codex made 40 tool calls vs Pi 2.

Expected behavior

The live token-efficiency gate should distinguish at least these cases:

  • Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
  • Codex-more-expensive rows should fail or warn according to a clear threshold policy.
  • runtime-tool-fs-read should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

Actual behavior

The token report fails the whole lane with all large absolute deltas flagged, including rows where Codex is substantially cheaper.

Classification

  • Verdict: live token-efficiency finding.
  • Confirmed product correctness bug: no.
  • Possible product efficiency bug: yes, runtime-tool-fs-read needs triage.
  • QA/reporting bug: yes, if savings are not supposed to fail the gate.

Links

  • Umbrella beta.5 confidence issue: #80936
  • Live proof tracker: #80397
  • Previous live zero-usage guard issue: #80411

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The live token-efficiency gate should distinguish at least these cases:

  • Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
  • Codex-more-expensive rows should fail or warn according to a clear threshold policy.
  • runtime-tool-fs-read should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING