The live token-efficiency gate should distinguish at least these cases: - Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection. - Codex-more-expensive rows should fail or warn according to a clear threshold policy. - `runtime-tool-fs-read` should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

openclaw - 💡(How to fix) Fix [QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead

openclaw2026-05-12 16:48:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Codex-more-expensive rows should fail or warn according to a clear threshold policy.

Root Cause

Product impact: P2 for token/cost risk if the runtime-tool-fs-read row reflects real default behavior, because Codex used 119,489 tokens and 40 tool calls versus Pi 72,381 tokens and 2 tool calls.
Product correctness impact: P4 from this evidence; all codex-native-live functional rows passed.
QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.

Fix Action

Fix / Workaround

instruction-followthrough-repo-contract: Codex 23,142 vs Pi 54,627 (-57.6%)
approval-turn-tool-followthrough: Codex 44,413 vs Pi 54,015 (-17.8%)
compaction-retry-mutating-tool: Codex 25,195 vs Pi 94,676 (-73.4%)
runtime-tool-apply-patch: Codex 44,584 vs Pi 72,465 (-38.5%)
runtime-tool-bash: Codex 44,955 vs Pi 92,394 (-51.3%)
runtime-tool-exec: Codex 49,639 vs Pi 111,369 (-55.4%)
runtime-tool-fs-write: Codex 46,318 vs Pi 72,374 (-36.0%)
runtime-tool-grep: Codex 50,828 vs Pi 92,200 (-44.9%)

Code Example

set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --runtime-suite codex-native-live \
  --codex-tool-loading direct \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5 \
  --output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
  --output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report

---

{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }

RAW_BUFFERClick to expand / collapse

TLDR

A local live run with a real OPENAI_API_KEY made codex-native-live pass functionally (11/11 scenarios), but the live token-efficiency report failed. Most flagged rows are actually Codex savings versus Pi, so the report currently conflates "large token delta" with "token regression." Two rows are genuine Codex-more-expensive candidates and need triage: streaming-final-integrity and especially runtime-tool-fs-read.

This is not a confirmed correctness bug in the Codex runner. It is a release-confidence blocker for the live token-efficiency gate and a possible Codex efficiency regression in one native read fixture.

Impact if OpenClaw moved fully to Codex today

Product impact: P2 for token/cost risk if the runtime-tool-fs-read row reflects real default behavior, because Codex used 119,489 tokens and 40 tool calls versus Pi 72,381 tokens and 2 tool calls.
Product correctness impact: P4 from this evidence; all codex-native-live functional rows passed.
QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.

Reproduction

Local checkout: /Volumes/LEXAR/repos/openclaw-runtime-parity-rebase PR head: 210f900ce81e7cf18f9af921b0f1a31cc7f95c0b Env source: /Users/lume/.openclaw/secrets/openai.env (OPENAI_API_KEY, value not logged)

set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --runtime-suite codex-native-live \
  --codex-tool-loading direct \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5 \
  --output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
  --output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report

Artifacts

Suite summary: .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json
Token summary: .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report/qa-runtime-token-efficiency-summary.json

Functional result

codex-native-live passed:

{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }

The API key fixed the earlier Pi missing-api-key blocker. Both runtimes produced live assistant-message usage.

Token report result

The token report had status=evaluated, usageSource=live-usage rows, and failed because threshold rows were flagged.

Rows where Codex was cheaper but still flagged:

instruction-followthrough-repo-contract: Codex 23,142 vs Pi 54,627 (-57.6%)
approval-turn-tool-followthrough: Codex 44,413 vs Pi 54,015 (-17.8%)
compaction-retry-mutating-tool: Codex 25,195 vs Pi 94,676 (-73.4%)
runtime-tool-apply-patch: Codex 44,584 vs Pi 72,465 (-38.5%)
runtime-tool-bash: Codex 44,955 vs Pi 92,394 (-51.3%)
runtime-tool-exec: Codex 49,639 vs Pi 111,369 (-55.4%)
runtime-tool-fs-write: Codex 46,318 vs Pi 72,374 (-36.0%)
runtime-tool-grep: Codex 50,828 vs Pi 92,200 (-44.9%)

Rows where Codex was more expensive:

streaming-final-integrity: Codex 21,946 vs Pi 17,887 (+22.7%), no tools.
runtime-tool-fs-read: Codex 119,489 vs Pi 72,381 (+65.1%), Codex made 40 tool calls vs Pi 2.

Expected behavior

The live token-efficiency gate should distinguish at least these cases:

Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
Codex-more-expensive rows should fail or warn according to a clear threshold policy.
runtime-tool-fs-read should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

Actual behavior

The token report fails the whole lane with all large absolute deltas flagged, including rows where Codex is substantially cheaper.

Classification

Verdict: live token-efficiency finding.
Confirmed product correctness bug: no.
Possible product efficiency bug: yes, runtime-tool-fs-read needs triage.
QA/reporting bug: yes, if savings are not supposed to fail the gate.

FAQ

Expected behavior

The live token-efficiency gate should distinguish at least these cases:

Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
Codex-more-expensive rows should fail or warn according to a clear threshold policy.
runtime-tool-fs-read should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

#api #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

TLDR

Impact if OpenClaw moved fully to Codex today

Reproduction

Artifacts

Functional result

Token report result

Expected behavior

Actual behavior

Classification

Links

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

TLDR

Impact if OpenClaw moved fully to Codex today

Reproduction

Artifacts

Functional result

Token report result

Expected behavior

Actual behavior

Classification

Links

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING