openclaw - 💡(How to fix) Fix [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80397Fetched 2026-05-11 03:15:09
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
2
Timeline (top)
cross-referenced ×4commented ×2

Root Cause

  • first-hour: 17 runtime-pair scenarios; report failed on known/observed runtime drift; token-efficiency summary was estimated.
  • maintainer-gate / first-hour-20: 18 runtime-pair scenarios; report failed on known/observed runtime drift; token-efficiency summary was estimated.
  • tool-defaults: 20 fixtures; report found 13 tool-call-shape drift rows; token-efficiency summary was estimated.
  • tool-coverage: 20 tools, 14 required default, 6 optional/plugin-dependent; passed because drift rows have tracking.
  • compare-harnesses: pi vs pi on approval-turn-tool-followthrough; passed.
  • jsonl-replay: 3 curated transcripts, 7 user turns; 0 drifted transcripts.
  • soak-100: completed under mock mode; found structural drift tracked in #80395.
RAW_BUFFERClick to expand / collapse

Parent: #80171 Related PR: #80323 Related plugin wrapper issue: #80365 Related local mock drift issues: #80364, #80395

Why this issue exists

The runtime/prompt/tool parity harness now has local mock proof across the implemented suites, including plugin-backed runs for first-hour, first-hour-20, tool-defaults, tool-coverage, harness-parity, jsonl-replay, and soak-100. That does not replace the live/Testbox proof requested by the expansion plan.

This issue tracks the remaining validation gap so the project does not accidentally treat mock-estimate token efficiency as live token truth.

Completed local proof

Plugin-backed local mock runs completed for:

  • first-hour: 17 runtime-pair scenarios; report failed on known/observed runtime drift; token-efficiency summary was estimated.
  • maintainer-gate / first-hour-20: 18 runtime-pair scenarios; report failed on known/observed runtime drift; token-efficiency summary was estimated.
  • tool-defaults: 20 fixtures; report found 13 tool-call-shape drift rows; token-efficiency summary was estimated.
  • tool-coverage: 20 tools, 14 required default, 6 optional/plugin-dependent; passed because drift rows have tracking.
  • compare-harnesses: pi vs pi on approval-turn-tool-followthrough; passed.
  • jsonl-replay: 3 curated transcripts, 7 user turns; 0 drifted transcripts.
  • soak-100: completed under mock mode; found structural drift tracked in #80395.

Remaining proof needed

  • Run one live first-hour parity lane with real assistant-message usage captured from live provider responses.
  • Run selected live first-hour-20 rows for token-efficiency comparison if maintainers want the release report to include them.
  • Run/schedule the optional soak-100 lane in Testbox or scheduled infrastructure, not as a required maintainer gate.
  • Attach the live token-efficiency artifacts to #80323 or the follow-up PR/issue once available.

Guardrail

Mock-mode token efficiency must remain clearly labeled as an estimate. Do not use mock-mode token rows as live-token proof.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof [2 comments, 2 participants]