openclaw - ✅(Solved) Fix [Codex×Pi parity Phase 2] Per-tool fixture set [2 pull requests, 2 comments, 2 participants]

100yenadmin · 2026-05-10T08:20:28Z

[openclaw] PR 80179: docs qa-lab : runtime-parity gate design Pi vs Codex harness - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged:… # PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness) - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/80179 ## Description (problem / solution / changelog) ## Summary Adds `extensions/qa-lab/transport-parity-gate.md`, a design-only doc covering the Codex-vs-Pi runtime parity QA harness scoped in #80171. The doc lifts forward the transport-parity-gate.md sketch from closed PR #78512 (which was originally tracking #78457) and expands it to include the surfaces the maintainer thread asked for: - Runtime-parity (`pi` vs `codex` for the same model+provider) — the higher-value gate now that Codex is the default for OpenAI turns - Per-tool fixture set so "tool X breaks under codex" surfaces at tool granularity, not session-level - Codex-plugin lifecycle stress (cold install, version pinning, install racing first turn, doctor migration safety) - Auth-shape coverage (oauth-only, apikey-only, mixed-profiles) for the #78499 class - Token-efficiency report — the side-by-side per-runtime cost table pash explicitly asked for - JSONL session-replay harness for Eva's "loop 3 agents on real jsonl" ask The doc is the shared artifact the implementing agent (and `@Eva-⚡🐑` / `@pash` for review) work against; sub-issues #80172, #80173, #80174, #80175, #80176 are the actual implementation work. ## Why this is design-only The original #78512 was closed because its `it.fails` reproduction test no longer encodes the right invariant against post-#79238 main. The design doc itself, however, is still load-bearing — it's the only place the matrix shape, drift classifier, capture format, and CI wiring intent are written down. Splitting it out as a design-only PR avoids re-litigating the closure on every implementation PR and gives reviewers something to react to before code lands. ## Verification - `pnpm exec oxfmt --check --threads=1 extensions/qa-lab/transport-parity-gate.md` — clean - No code, runtime, workflow, or test changes — pure docs - Markdown-only diff; refs all link to issues that exist (#80171–#80176, #74290, #79347, #78457, #78055, #78060, #78407, #78499, #79238, #74622) ## Test plan - [x] Format check passes (oxfmt) - [x] All referenced issues exist - [x] Design intent matches the maintainer thread (pash + Eva + ai-hpc, Yesterday) - [ ] Maintainer review on matrix shape and per-cell capture format before Phase 1 (#80172) starts implementation ## References - RFC + tracking: #80171 - Sub-issues: - Phase 1 (Runtime axis): #80172 - Phase 2 (Per-tool fixtures): #80173 - Phase 3 (Codex-plugin lifecycle): #80174 - Phase 4 (Token-efficiency report): #80175 - Phase 5 (JSONL replay): #80176 - Sibling model-axis parity: #74290 (closed) → #79347 (in flight) - Original transport-parity proposal: #78457 - Closed PR with the original draft of this doc: #78512 ## Changed files - `extensions/qa-lab/transport-parity-gate.md` (added, +148/-0) --- # PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5 - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/80323 ## Description (problem / solution / changelog) ## Summary Adds the Codex-vs-Pi runtime parity QA harness across `extensions/qa-lab`, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring. This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture: - Codex-native workspace tools (`read`, `write`, `edit`, `apply_patch`, `exec`, `process`, `update_plan`) are no longer expected to appear as duplicate OpenClaw dynamic tools. - OpenClaw integration tools (`image_generate`, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows. - Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled. - Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence. - Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action. ## Why OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports. ## Verification Passing targeted/current-scope checks: - `pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runti

openclaw2026-05-10 08:20:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#80173•Fetched 2026-05-11 03:18:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

100yenadmin

Participants

100yenadmin

clawsweeper[bot]

Timeline (top)

cross-referenced ×7commented ×2

Error Message

For each tool family, also one fixture for the failure mode (denied input, oversized payload, etc.) so error-path drift is captured.

Fix Action

Fix / Workaround

bash — bash echo hello
exec — approval-required exec "ls -la /tmp" flow
fs.read, fs.write, fs.list — read/write/list a temp file
grep — grep for a literal in a fixture file
edit / apply-patch — apply a small unified diff
web_search — search for a fixed query (mock provider returns fixed results)
web_fetch — fetch a fixed URL (mock provider returns fixed body)
tavily_search, tavily_extract
image_generate — generate against the qa-lab mock image provider
tts — synth a fixed phrase against the mock TTS provider
message-tool — message-tool send to a mock channel; media variant
session_status, sessions_spawn
memory.recall, memory.add (if pi-only, mark as expected drift with a known-broken marker)
skill_* invocations

PR fix notes

PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness)

Repository: openclaw/openclaw
Author: 100yenadmin
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/80179

Description (problem / solution / changelog)

Summary

Adds extensions/qa-lab/transport-parity-gate.md, a design-only doc covering the Codex-vs-Pi runtime parity QA harness scoped in #80171.

The doc lifts forward the transport-parity-gate.md sketch from closed PR #78512 (which was originally tracking #78457) and expands it to include the surfaces the maintainer thread asked for:

Runtime-parity (pi vs codex for the same model+provider) — the higher-value gate now that Codex is the default for OpenAI turns
Per-tool fixture set so "tool X breaks under codex" surfaces at tool granularity, not session-level
Codex-plugin lifecycle stress (cold install, version pinning, install racing first turn, doctor migration safety)
Auth-shape coverage (oauth-only, apikey-only, mixed-profiles) for the #78499 class
Token-efficiency report — the side-by-side per-runtime cost table pash explicitly asked for
JSONL session-replay harness for Eva's "loop 3 agents on real jsonl" ask

The doc is the shared artifact the implementing agent (and @Eva-⚡🐑 / @pash for review) work against; sub-issues #80172, #80173, #80174, #80175, #80176 are the actual implementation work.

Why this is design-only

The original #78512 was closed because its it.fails reproduction test no longer encodes the right invariant against post-#79238 main. The design doc itself, however, is still load-bearing — it's the only place the matrix shape, drift classifier, capture format, and CI wiring intent are written down. Splitting it out as a design-only PR avoids re-litigating the closure on every implementation PR and gives reviewers something to react to before code lands.

Verification

pnpm exec oxfmt --check --threads=1 extensions/qa-lab/transport-parity-gate.md — clean
No code, runtime, workflow, or test changes — pure docs
Markdown-only diff; refs all link to issues that exist (#80171–#80176, #74290, #79347, #78457, #78055, #78060, #78407, #78499, #79238, #74622)

Test plan

Format check passes (oxfmt)
All referenced issues exist
Design intent matches the maintainer thread (pash + Eva + ai-hpc, Yesterday)
Maintainer review on matrix shape and per-cell capture format before Phase 1 (#80172) starts implementation

References

RFC + tracking: #80171
Sub-issues:
- Phase 1 (Runtime axis): #80172
- Phase 2 (Per-tool fixtures): #80173
- Phase 3 (Codex-plugin lifecycle): #80174
- Phase 4 (Token-efficiency report): #80175
- Phase 5 (JSONL replay): #80176
Sibling model-axis parity: #74290 (closed) → #79347 (in flight)
Original transport-parity proposal: #78457
Closed PR with the original draft of this doc: #78512

Changed files

extensions/qa-lab/transport-parity-gate.md (added, +148/-0)

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Repository: openclaw/openclaw
Author: 100yenadmin
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/80323

Description (problem / solution / changelog)

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring.

This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture:

Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are no longer expected to appear as duplicate OpenClaw dynamic tools.
OpenClaw integration tools (image_generate, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows.
Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled.
Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence.
Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Verification

Passing targeted/current-scope checks:

pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runtime-suite.test.ts extensions/qa-lab/src/suite.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts
pnpm tsgo:extensions:test
pnpm check:test-types
git diff --check

Real Behavior Proof

Behavior or issue addressed: Corrects the runtime parity tool-defaults harness so Codex-native workspace tools are no longer falsely required as duplicate OpenClaw dynamic tools, while OpenClaw dynamic integration rows remain visible and tracked.
Real environment tested: Local OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1 on branch codex-vs-pi-runtime-parity-tools, running the real pnpm openclaw qa CLI against the embedded gateway and mock OpenAI provider after this patch.
Exact steps or command run after this patch:

OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite tool-defaults --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/runtime-tools-correction
pnpm openclaw qa tool-coverage --repo-root . --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --runtime-pair pi,codex --output .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite openclaw-dynamic-tools --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/openclaw-dynamic-tools-correction
pnpm openclaw qa parity-report --repo-root . --runtime-axis --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --output-dir .artifacts/qa-e2e/runtime-tools-correction/parity --token-efficiency

Evidence after fix: Terminal output produced these real local artifacts: .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json, .artifacts/qa-e2e/runtime-tools-correction/qa-suite-report.md, .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md, .artifacts/qa-e2e/openclaw-dynamic-tools-correction/qa-suite-summary.json, and .artifacts/qa-e2e/runtime-tools-correction/parity/qa-runtime-token-efficiency-report.md.
Observed result after fix: tool-defaults completed with 20 scenarios, 15 pass, 5 report-only skip, 0 fail. Tool coverage verdict was pass with 13 required tools, 8 Codex-native workspace tools, 5 OpenClaw dynamic integration tools, 7 optional/profile/plugin tools, and 0 failing tools. The focused openclaw-dynamic-tools suite completed with 5 report-only rows tracked under #80319. Token efficiency report verdict was pass with usage source mock-estimate.
What was not tested: Live frontier token-efficiency proof was not completed because local direct OpenAI auth is missing; optional scheduled/Testbox soak-100 proof was not completed; broad first-hour-20 remains red and is tracked in #80434.

Known Broad/Latest Blockers

First first-hour-20 attempt hit a pre-suite tsdown SIGSEGV; retry reached QA.
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite first-hour-20 --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/first-hour-20-correction-retry is not green: 18 total, 6 pass, 12 fail; tracked in #80434.
pnpm check fails unrelated Discord lint: #80428.
pnpm test fails unrelated agents-core / ACPx / Mattermost shards: #80429, #80430, #80431, #67784.
Live token-efficiency proof path renders artifacts, but local direct OpenAI auth is missing so the attempted live run is not valid proof; tracked in #80175.
Optional soak-100 exists but is not scheduled/Testbox-wired; tracked in #80433.

Linked Issues

Umbrella/spec: #80171

Phase issues: #80172, #80173, #80174, #80175, #80176

Harness correction issues: #80236, #80312, #80319, #80320; #80321 is closed as fixed by this PR branch.

Fresh broad-rerun follow-ups: #80428, #80429, #80430, #80431, #80433, #80434, #67784

Changed files

.github/workflows/openclaw-release-checks.yml (modified, +115/-0)
.github/workflows/qa-live-transports-convex.yml (modified, +77/-0)
apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +4/-0)
extensions/codex/src/app-server/schema-normalization-runtime-contract.test.ts (modified, +9/-4)
extensions/lmstudio/src/models.test.ts (modified, +1/-1)
extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +120/-0)
extensions/qa-lab/src/agentic-parity-report.ts (modified, +218/-0)
extensions/qa-lab/src/auth-profile-fixture.ts (added, +177/-0)
extensions/qa-lab/src/cli.runtime.test.ts (modified, +282/-0)
extensions/qa-lab/src/cli.runtime.ts (modified, +416/-3)
extensions/qa-lab/src/cli.ts (modified, +175/-7)
extensions/qa-lab/src/codex-plugin-fixture.ts (added, +282/-0)
extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (added, +190/-0)
extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
extensions/qa-lab/src/harness-parity.test.ts (added, +144/-0)
extensions/qa-lab/src/harness-parity.ts (added, +415/-0)
extensions/qa-lab/src/jsonl-replay.test.ts (added, +169/-0)
extensions/qa-lab/src/jsonl-replay.ts (added, +270/-0)
extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
extensions/qa-lab/src/providers/mock-openai/server.ts (modified, +74/-3)
extensions/qa-lab/src/runtime-parity.test.ts (added, +427/-0)
extensions/qa-lab/src/runtime-parity.ts (added, +1119/-0)
extensions/qa-lab/src/runtime-suite.test.ts (added, +75/-0)
extensions/qa-lab/src/runtime-suite.ts (added, +147/-0)
extensions/qa-lab/src/runtime-tool-fixture.test.ts (added, +156/-0)
extensions/qa-lab/src/runtime-tool-fixture.ts (added, +291/-0)
extensions/qa-lab/src/runtime-tool-metadata.ts (added, +142/-0)
extensions/qa-lab/src/scenario-catalog.test.ts (modified, +10/-0)
extensions/qa-lab/src/scenario-catalog.ts (modified, +4/-0)
extensions/qa-lab/src/scenario-flow-runner.ts (modified, +1/-1)
extensions/qa-lab/src/scenario-runtime-api.test.ts (modified, +1/-0)
extensions/qa-lab/src/scenario-runtime-api.ts (modified, +3/-0)
extensions/qa-lab/src/suite-runtime-flow.ts (modified, +13/-1)
extensions/qa-lab/src/suite-summary.ts (modified, +4/-1)
extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
extensions/qa-lab/src/suite.test.ts (modified, +100/-0)
extensions/qa-lab/src/suite.ts (modified, +449/-2)
extensions/qa-lab/src/token-efficiency-report.test.ts (added, +218/-0)
extensions/qa-lab/src/token-efficiency-report.ts (added, +379/-0)
extensions/qa-lab/src/tool-coverage-report.test.ts (added, +288/-0)
extensions/qa-lab/src/tool-coverage-report.ts (added, +285/-0)
extensions/qa-lab/transport-parity-gate.md (added, +66/-0)
extensions/qqbot/src/bridge/tools/remind.test.ts (modified, +1/-1)
extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts (modified, +1/-1)
extensions/slack/src/monitor/media.test.ts (modified, +3/-3)
extensions/tavily/src/tavily-tools.test.ts (modified, +3/-1)
qa/scenarios/agents/instruction-followthrough-repo-contract.md (modified, +1/-0)
qa/scenarios/agents/subagent-fanout-synthesis.md (modified, +1/-0)
qa/scenarios/agents/subagent-handoff.md (modified, +1/-0)
qa/scenarios/agents/subagent-stale-child-links.md (modified, +1/-0)
qa/scenarios/channels/channel-chat-baseline.md (modified, +1/-0)
qa/scenarios/config/config-restart-capability-flip.md (modified, +1/-0)
qa/scenarios/jsonl-replay/plan-mode-boundaries.jsonl (added, +8/-0)
qa/scenarios/jsonl-replay/recovery-partial-session.jsonl (added, +4/-0)
qa/scenarios/jsonl-replay/repo-triage-tool-loop.jsonl (added, +7/-0)
qa/scenarios/memory/memory-recall.md (modified, +1/-0)
qa/scenarios/memory/thread-memory-isolation.md (modified, +1/-0)
qa/scenarios/models/model-switch-tool-continuity.md (modified, +1/-0)
qa/scenarios/runtime/approval-turn-tool-followthrough.md (modified, +1/-0)
qa/scenarios/runtime/auth-profile-codex-mixed-profiles.md (added, +39/-0)
qa/scenarios/runtime/auth-profile-doctor-migration-safety.md (added, +44/-0)
qa/scenarios/runtime/codex-plugin-cold-install.md (added, +42/-0)
qa/scenarios/runtime/codex-plugin-install-race.md (added, +38/-0)
qa/scenarios/runtime/codex-plugin-pinned-new.md (added, +39/-0)
qa/scenarios/runtime/codex-plugin-pinned-old.md (added, +39/-0)
qa/scenarios/runtime/compaction-retry-mutating-tool.md (modified, +1/-0)
qa/scenarios/runtime/first-hour-20-turn.md (added, +68/-0)
qa/scenarios/runtime/soak-100-turn.md (added, +68/-0)
qa/scenarios/runtime/tools/apply-patch.md (added, +54/-0)
qa/scenarios/runtime/tools/bash.md (added, +55/-0)
qa/scenarios/runtime/tools/edit.md (added, +54/-0)
qa/scenarios/runtime/tools/exec.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-list.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-read.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-write.md (added, +54/-0)
qa/scenarios/runtime/tools/grep.md (added, +54/-0)
qa/scenarios/runtime/tools/image-generate.md (added, +55/-0)
qa/scenarios/runtime/tools/memory-add.md (added, +54/-0)
qa/scenarios/runtime/tools/memory-recall.md (added, +54/-0)
qa/scenarios/runtime/tools/message-tool.md (added, +52/-0)
qa/scenarios/runtime/tools/session-status.md (added, +54/-0)
qa/scenarios/runtime/tools/sessions-spawn.md (added, +54/-0)
qa/scenarios/runtime/tools/skill-invocation.md (added, +54/-0)
qa/scenarios/runtime/tools/tavily-extract.md (added, +53/-0)
qa/scenarios/runtime/tools/tavily-search.md (added, +53/-0)
qa/scenarios/runtime/tools/tts.md (added, +54/-0)
qa/scenarios/runtime/tools/web-fetch.md (added, +54/-0)
qa/scenarios/runtime/tools/web-search.md (added, +54/-0)
qa/scenarios/workspace/source-docs-discovery-report.md (modified, +1/-0)
scripts/deadcode-unused-files.allowlist.mjs (modified, +2/-0)
src/agents/model-runtime-policy.test.ts (added, +91/-0)
src/agents/model-runtime-policy.ts (modified, +16/-0)

Code Example

| tool | pi | codex | drift | tracking |
  |------|----|-------|-------|----------|
  | bash | ✅  | ✅     | none  |          |
  | exec | ✅  | ❌     | tool-result-shape | #issue |

RAW_BUFFERClick to expand / collapse

Tracking parent: #80171 Depends on: Phase 1 #80172

Goal

Build a deterministic per-tool fixture set so the runtime-parity harness can surface "tool X breaks under codex" at the tool granularity, not just session-level. This is the deliverable Eva called out: "test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases."

Scope

One fixture per tool family. Each fixture is deterministic: the prompt forces exactly one tool call with predictable arguments. The harness asserts the tool was invoked, completed, and result shape matches between runtimes.

Tool families to cover

(Source: src/agents/pi-tools.create-openclaw-coding-tools.ts and Codex harness contract — finalise the list in the PR by reading both surfaces.)

bash — bash echo hello
exec — approval-required exec "ls -la /tmp" flow
fs.read, fs.write, fs.list — read/write/list a temp file
grep — grep for a literal in a fixture file
edit / apply-patch — apply a small unified diff
web_search — search for a fixed query (mock provider returns fixed results)
web_fetch — fetch a fixed URL (mock provider returns fixed body)
tavily_search, tavily_extract
image_generate — generate against the qa-lab mock image provider
tts — synth a fixed phrase against the mock TTS provider
message-tool — message-tool send to a mock channel; media variant
session_status, sessions_spawn
memory.recall, memory.add (if pi-only, mark as expected drift with a known-broken marker)
skill_* invocations

For each tool family, also one fixture for the failure mode (denied input, oversized payload, etc.) so error-path drift is captured.

Concrete deliverables

Fixtures

qa/scenarios/runtime/tools/<tool>.md — one file per family. Reuse the existing scenario format already used by approval-turn-tool-followthrough.md.
Each fixture exports both a happy-path and a failure-path scenario.

Code

Extend extensions/qa-lab/src/runtime-parity.ts (from Phase 1) — add toolBreakdown field to the report so per-tool drift surfaces alongside per-scenario drift.

New extensions/qa-lab/src/tool-coverage-report.ts — generates a Markdown coverage table:

| tool | pi | codex | drift | tracking |
|------|----|-------|-------|----------|
| bash | ✅  | ✅     | none  |          |
| exec | ✅  | ❌     | tool-result-shape | #issue |

Extend extensions/qa-lab/src/cli.ts — new qa tool-coverage --runtime-pair pi,codex command.

Tests

Each fixture has a self-test running it through the mock provider on both runtimes (no qa-lab harness dependency for the self-test — keeps fixtures portable).
Coverage report rendering test.

Acceptance criteria

Each tool family in the list above has a qa/scenarios/runtime/tools/<tool>.md fixture.
Each fixture passes both cells under --runtime-pair pi,codex against current main, OR is annotated with a known-broken marker pointing at a tracking issue (file the tracking issue as part of this PR if discovered).
The runtime-parity report enumerates per-tool drift, not just per-scenario drift.
pnpm openclaw qa tool-coverage --runtime-pair pi,codex produces a Markdown table suitable for the README of the harness.
pnpm check:test-types and pnpm exec oxlint clean.

Out of scope

Plugin-lifecycle stress (Phase 3).
Token efficiency (Phase 4).
Live-mode runs — fixtures must be hermetic in this PR.

References

Tracking parent: #80171
Phase 1: #80172
Existing scenario format: qa/scenarios/runtime/approval-turn-tool-followthrough.md
Tool surface (Pi): src/agents/pi-tools.create-openclaw-coding-tools.ts
Tool surface (Codex): codex harness contract — see extensions/codex/src/

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#environment setup #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Codex×Pi parity Phase 2] Per-tool fixture set [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness)

Description (problem / solution / changelog)

Summary

Why this is design-only

Verification

Test plan

References

Changed files

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Description (problem / solution / changelog)

Summary

Why

Verification

Real Behavior Proof

Known Broad/Latest Blockers

Linked Issues

Changed files

Code Example

Goal

Scope

Tool families to cover

Concrete deliverables

Fixtures

Code

Tests

Acceptance criteria

Out of scope

References

Still need to ship something?

RELATED_DISCOVERY

TRENDING