openclaw - ✅(Solved) Fix Codex-vs-Pi runtime parity QA harness (RFC + tracking) [3 pull requests, 7 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80171Fetched 2026-05-11 03:18:10
View on GitHub
Comments
7
Participants
2
Timeline
32
Reactions
2
Timeline (top)
cross-referenced ×25commented ×7

Per the maintainer thread between @pash, @Eva-⚡🐑, and @ai-hpc on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped — recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.

The maintainer ask:

  • @pash — stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.
  • @Eva-⚡🐑 — full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
  • @ai-hpc — already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.

The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares gpt-5.5 vs claude-opus-4-7 — same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.

This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in extensions/qa-lab/transport-parity-gate.md from closed PR #78512.

Error Message

  • wall-clock-ms, transport-error-class?, runtime-error-class?.
  • Failure modes have asserted error messages (string match) so any wording regression is caught.

Root Cause

Per the maintainer thread between @pash, @Eva-⚡🐑, and @ai-hpc on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped — recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.

The maintainer ask:

  • @pash — stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.
  • @Eva-⚡🐑 — full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
  • @ai-hpc — already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.

The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares gpt-5.5 vs claude-opus-4-7 — same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.

This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in extensions/qa-lab/transport-parity-gate.md from closed PR #78512.

Fix Action

Fix / Workaround

  • bash
  • exec (approval flow)
  • fs.read, fs.write, fs.list
  • grep
  • edit / apply-patch
  • web_search, web_fetch
  • tavily_search, tavily_extract
  • image_generate
  • tts
  • message-tool (send + media variants)
  • session_status, sessions_spawn
  • memory.recall, memory.add (if pi-only, mark as expected drift)
  • skill_* invocations

PR fix notes

PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness)

Description (problem / solution / changelog)

Summary

Adds extensions/qa-lab/transport-parity-gate.md, a design-only doc covering the Codex-vs-Pi runtime parity QA harness scoped in #80171.

The doc lifts forward the transport-parity-gate.md sketch from closed PR #78512 (which was originally tracking #78457) and expands it to include the surfaces the maintainer thread asked for:

  • Runtime-parity (pi vs codex for the same model+provider) — the higher-value gate now that Codex is the default for OpenAI turns
  • Per-tool fixture set so "tool X breaks under codex" surfaces at tool granularity, not session-level
  • Codex-plugin lifecycle stress (cold install, version pinning, install racing first turn, doctor migration safety)
  • Auth-shape coverage (oauth-only, apikey-only, mixed-profiles) for the #78499 class
  • Token-efficiency report — the side-by-side per-runtime cost table pash explicitly asked for
  • JSONL session-replay harness for Eva's "loop 3 agents on real jsonl" ask

The doc is the shared artifact the implementing agent (and @Eva-⚡🐑 / @pash for review) work against; sub-issues #80172, #80173, #80174, #80175, #80176 are the actual implementation work.

Why this is design-only

The original #78512 was closed because its it.fails reproduction test no longer encodes the right invariant against post-#79238 main. The design doc itself, however, is still load-bearing — it's the only place the matrix shape, drift classifier, capture format, and CI wiring intent are written down. Splitting it out as a design-only PR avoids re-litigating the closure on every implementation PR and gives reviewers something to react to before code lands.

Verification

  • pnpm exec oxfmt --check --threads=1 extensions/qa-lab/transport-parity-gate.md — clean
  • No code, runtime, workflow, or test changes — pure docs
  • Markdown-only diff; refs all link to issues that exist (#80171–#80176, #74290, #79347, #78457, #78055, #78060, #78407, #78499, #79238, #74622)

Test plan

  • Format check passes (oxfmt)
  • All referenced issues exist
  • Design intent matches the maintainer thread (pash + Eva + ai-hpc, Yesterday)
  • Maintainer review on matrix shape and per-cell capture format before Phase 1 (#80172) starts implementation

References

  • RFC + tracking: #80171
  • Sub-issues:
    • Phase 1 (Runtime axis): #80172
    • Phase 2 (Per-tool fixtures): #80173
    • Phase 3 (Codex-plugin lifecycle): #80174
    • Phase 4 (Token-efficiency report): #80175
    • Phase 5 (JSONL replay): #80176
  • Sibling model-axis parity: #74290 (closed) → #79347 (in flight)
  • Original transport-parity proposal: #78457
  • Closed PR with the original draft of this doc: #78512

Changed files

  • extensions/qa-lab/transport-parity-gate.md (added, +148/-0)

PR #80238: test(qa-lab): add Codex vs Pi runtime parity harness

Description (problem / solution / changelog)

Why

Codex is moving toward the default OpenAI runtime, but the existing release parity checks compare model behavior, not runtime behavior. That leaves a known blind spot: the same scenario and same model can pass under Pi while drifting under Codex at the tool layer.

This adds the Phase 1 runtime axis from #80172 so qa-lab can run each scenario once as pi and once as codex, capture per-runtime cells, and classify drift at the tool/result/structure/failure level instead of only reporting a session-level pass/fail.

Part of #80171. Closes #80172. Detected follow-up drift: #80236.

What Changed

  • Adds the private-QA-only OPENCLAW_QA_FORCE_RUNTIME=pi|codex override in resolveModelRuntimePolicy, gated by OPENCLAW_BUILD_PRIVATE_QA=1.
  • Adds extensions/qa-lab/src/runtime-parity.ts with the runtime cell shape, assistant-message usage capture, provider-side mock /debug/requests tool capture, and six-bucket drift classifier.
  • Adds qa suite --runtime-pair pi,codex and runtime-axis qa parity-report --runtime-axis --summary <path>.
  • Extends suite summaries and runtime parity Markdown reporting with per-runtime cells and aggregate drift counts.
  • Wires qa_lab_runtime_parity_release_checks into openclaw-release-checks.yml next to the existing model-axis parity lane.

Real-Behavior Proof

The harness caught a real drift in approval-turn-tool-followthrough:

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
OPENCLAW_QA_SUITE_PROGRESS=1 \
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --scenario approval-turn-tool-followthrough \
  --concurrency 1 \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5

The suite exits nonzero because drift is present, but it writes the runtime summary. The follow-up report command:

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --summary .artifacts/qa-e2e/runtime-parity-proof-approval-remap5/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5-report

Observed report excerpt:

| Tool-result-shape drift | 1 |

- Approval turn tool followthrough drift=tool-result-shape (tool result 1 differs (read)).

pi: pass (1 tool calls, 256 tokens)
codex: fail (1 tool calls, 176 tokens)

The captured cells show both runtimes planned read with the same args hash, while Codex returned unsupported call: read. I filed that runtime bug as #80236 instead of hiding it in the harness PR.

Verification

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/runtime-parity.test.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/multipass.runtime.test.ts \
  extensions/qa-lab/src/suite.test.ts

pnpm exec vitest run --config test/vitest/vitest.agents.config.ts \
  src/agents/model-runtime-policy.test.ts

pnpm tsgo:core
pnpm tsgo:core:test
pnpm tsgo:extensions:test
pnpm check:test-types
pnpm exec oxlint --type-aware --tsconfig config/tsconfig/oxlint.json --allow eslint/no-underscore-dangle \
  extensions/qa-lab/src/runtime-parity.ts \
  extensions/qa-lab/src/runtime-parity.test.ts \
  extensions/qa-lab/src/suite.ts \
  extensions/qa-lab/src/suite.test.ts \
  extensions/qa-lab/src/suite-summary.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/agentic-parity-report.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/cli.ts \
  extensions/qa-lab/src/cli.runtime.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/multipass.runtime.ts \
  extensions/qa-lab/src/multipass.runtime.test.ts \
  extensions/qa-lab/src/gateway-child.ts \
  src/agents/model-runtime-policy.ts \
  src/agents/model-runtime-policy.test.ts

Non-Goals

  • Does not add the Phase 2 per-tool fixture set yet.
  • Does not add Phase 3 Codex plugin lifecycle cells yet.
  • Does not add Phase 4 token-efficiency reporting beyond capturing per-cell assistant-message usage.
  • Does not add Phase 5 JSONL replay yet.

Changed files

  • .github/workflows/openclaw-release-checks.yml (modified, +73/-0)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +108/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (modified, +205/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +109/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +62/-2)
  • extensions/qa-lab/src/cli.ts (modified, +17/-7)
  • extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
  • extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
  • extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
  • extensions/qa-lab/src/runtime-parity.test.ts (added, +313/-0)
  • extensions/qa-lab/src/runtime-parity.ts (added, +899/-0)
  • extensions/qa-lab/src/suite-summary.ts (modified, +3/-0)
  • extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
  • extensions/qa-lab/src/suite.test.ts (modified, +47/-0)
  • extensions/qa-lab/src/suite.ts (modified, +372/-0)
  • src/agents/model-runtime-policy.test.ts (added, +91/-0)
  • src/agents/model-runtime-policy.ts (modified, +16/-0)

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Description (problem / solution / changelog)

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring.

This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture:

  • Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are no longer expected to appear as duplicate OpenClaw dynamic tools.
  • OpenClaw integration tools (image_generate, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows.
  • Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled.
  • Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence.
  • Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Verification

Passing targeted/current-scope checks:

  • pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runtime-suite.test.ts extensions/qa-lab/src/suite.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts
  • pnpm tsgo:extensions:test
  • pnpm check:test-types
  • git diff --check

Real Behavior Proof

  • Behavior or issue addressed: Corrects the runtime parity tool-defaults harness so Codex-native workspace tools are no longer falsely required as duplicate OpenClaw dynamic tools, while OpenClaw dynamic integration rows remain visible and tracked.
  • Real environment tested: Local OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1 on branch codex-vs-pi-runtime-parity-tools, running the real pnpm openclaw qa CLI against the embedded gateway and mock OpenAI provider after this patch.
  • Exact steps or command run after this patch:
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite tool-defaults --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/runtime-tools-correction
pnpm openclaw qa tool-coverage --repo-root . --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --runtime-pair pi,codex --output .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite openclaw-dynamic-tools --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/openclaw-dynamic-tools-correction
pnpm openclaw qa parity-report --repo-root . --runtime-axis --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --output-dir .artifacts/qa-e2e/runtime-tools-correction/parity --token-efficiency
  • Evidence after fix: Terminal output produced these real local artifacts: .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json, .artifacts/qa-e2e/runtime-tools-correction/qa-suite-report.md, .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md, .artifacts/qa-e2e/openclaw-dynamic-tools-correction/qa-suite-summary.json, and .artifacts/qa-e2e/runtime-tools-correction/parity/qa-runtime-token-efficiency-report.md.
  • Observed result after fix: tool-defaults completed with 20 scenarios, 15 pass, 5 report-only skip, 0 fail. Tool coverage verdict was pass with 13 required tools, 8 Codex-native workspace tools, 5 OpenClaw dynamic integration tools, 7 optional/profile/plugin tools, and 0 failing tools. The focused openclaw-dynamic-tools suite completed with 5 report-only rows tracked under #80319. Token efficiency report verdict was pass with usage source mock-estimate.
  • What was not tested: Live frontier token-efficiency proof was not completed because local direct OpenAI auth is missing; optional scheduled/Testbox soak-100 proof was not completed; broad first-hour-20 remains red and is tracked in #80434.

Known Broad/Latest Blockers

  • First first-hour-20 attempt hit a pre-suite tsdown SIGSEGV; retry reached QA.
  • OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite first-hour-20 --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/first-hour-20-correction-retry is not green: 18 total, 6 pass, 12 fail; tracked in #80434.
  • pnpm check fails unrelated Discord lint: #80428.
  • pnpm test fails unrelated agents-core / ACPx / Mattermost shards: #80429, #80430, #80431, #67784.
  • Live token-efficiency proof path renders artifacts, but local direct OpenAI auth is missing so the attempted live run is not valid proof; tracked in #80175.
  • Optional soak-100 exists but is not scheduled/Testbox-wired; tracked in #80433.

Linked Issues

Umbrella/spec: #80171

Phase issues: #80172, #80173, #80174, #80175, #80176

Harness correction issues: #80236, #80312, #80319, #80320; #80321 is closed as fixed by this PR branch.

Fresh broad-rerun follow-ups: #80428, #80429, #80430, #80431, #80433, #80434, #67784

Changed files

  • .github/workflows/openclaw-release-checks.yml (modified, +115/-0)
  • .github/workflows/qa-live-transports-convex.yml (modified, +77/-0)
  • apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +4/-0)
  • extensions/codex/src/app-server/schema-normalization-runtime-contract.test.ts (modified, +9/-4)
  • extensions/lmstudio/src/models.test.ts (modified, +1/-1)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +120/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (modified, +218/-0)
  • extensions/qa-lab/src/auth-profile-fixture.ts (added, +177/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +282/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +416/-3)
  • extensions/qa-lab/src/cli.ts (modified, +175/-7)
  • extensions/qa-lab/src/codex-plugin-fixture.ts (added, +282/-0)
  • extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (added, +190/-0)
  • extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
  • extensions/qa-lab/src/harness-parity.test.ts (added, +144/-0)
  • extensions/qa-lab/src/harness-parity.ts (added, +415/-0)
  • extensions/qa-lab/src/jsonl-replay.test.ts (added, +169/-0)
  • extensions/qa-lab/src/jsonl-replay.ts (added, +270/-0)
  • extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
  • extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
  • extensions/qa-lab/src/providers/mock-openai/server.ts (modified, +74/-3)
  • extensions/qa-lab/src/runtime-parity.test.ts (added, +427/-0)
  • extensions/qa-lab/src/runtime-parity.ts (added, +1119/-0)
  • extensions/qa-lab/src/runtime-suite.test.ts (added, +75/-0)
  • extensions/qa-lab/src/runtime-suite.ts (added, +147/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.test.ts (added, +156/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.ts (added, +291/-0)
  • extensions/qa-lab/src/runtime-tool-metadata.ts (added, +142/-0)
  • extensions/qa-lab/src/scenario-catalog.test.ts (modified, +10/-0)
  • extensions/qa-lab/src/scenario-catalog.ts (modified, +4/-0)
  • extensions/qa-lab/src/scenario-flow-runner.ts (modified, +1/-1)
  • extensions/qa-lab/src/scenario-runtime-api.test.ts (modified, +1/-0)
  • extensions/qa-lab/src/scenario-runtime-api.ts (modified, +3/-0)
  • extensions/qa-lab/src/suite-runtime-flow.ts (modified, +13/-1)
  • extensions/qa-lab/src/suite-summary.ts (modified, +4/-1)
  • extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
  • extensions/qa-lab/src/suite.test.ts (modified, +100/-0)
  • extensions/qa-lab/src/suite.ts (modified, +449/-2)
  • extensions/qa-lab/src/token-efficiency-report.test.ts (added, +218/-0)
  • extensions/qa-lab/src/token-efficiency-report.ts (added, +379/-0)
  • extensions/qa-lab/src/tool-coverage-report.test.ts (added, +288/-0)
  • extensions/qa-lab/src/tool-coverage-report.ts (added, +285/-0)
  • extensions/qa-lab/transport-parity-gate.md (added, +66/-0)
  • extensions/qqbot/src/bridge/tools/remind.test.ts (modified, +1/-1)
  • extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts (modified, +1/-1)
  • extensions/slack/src/monitor/media.test.ts (modified, +3/-3)
  • extensions/tavily/src/tavily-tools.test.ts (modified, +3/-1)
  • qa/scenarios/agents/instruction-followthrough-repo-contract.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-fanout-synthesis.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-handoff.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-stale-child-links.md (modified, +1/-0)
  • qa/scenarios/channels/channel-chat-baseline.md (modified, +1/-0)
  • qa/scenarios/config/config-restart-capability-flip.md (modified, +1/-0)
  • qa/scenarios/jsonl-replay/plan-mode-boundaries.jsonl (added, +8/-0)
  • qa/scenarios/jsonl-replay/recovery-partial-session.jsonl (added, +4/-0)
  • qa/scenarios/jsonl-replay/repo-triage-tool-loop.jsonl (added, +7/-0)
  • qa/scenarios/memory/memory-recall.md (modified, +1/-0)
  • qa/scenarios/memory/thread-memory-isolation.md (modified, +1/-0)
  • qa/scenarios/models/model-switch-tool-continuity.md (modified, +1/-0)
  • qa/scenarios/runtime/approval-turn-tool-followthrough.md (modified, +1/-0)
  • qa/scenarios/runtime/auth-profile-codex-mixed-profiles.md (added, +39/-0)
  • qa/scenarios/runtime/auth-profile-doctor-migration-safety.md (added, +44/-0)
  • qa/scenarios/runtime/codex-plugin-cold-install.md (added, +42/-0)
  • qa/scenarios/runtime/codex-plugin-install-race.md (added, +38/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-new.md (added, +39/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-old.md (added, +39/-0)
  • qa/scenarios/runtime/compaction-retry-mutating-tool.md (modified, +1/-0)
  • qa/scenarios/runtime/first-hour-20-turn.md (added, +68/-0)
  • qa/scenarios/runtime/soak-100-turn.md (added, +68/-0)
  • qa/scenarios/runtime/tools/apply-patch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/bash.md (added, +55/-0)
  • qa/scenarios/runtime/tools/edit.md (added, +54/-0)
  • qa/scenarios/runtime/tools/exec.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-list.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-read.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-write.md (added, +54/-0)
  • qa/scenarios/runtime/tools/grep.md (added, +54/-0)
  • qa/scenarios/runtime/tools/image-generate.md (added, +55/-0)
  • qa/scenarios/runtime/tools/memory-add.md (added, +54/-0)
  • qa/scenarios/runtime/tools/memory-recall.md (added, +54/-0)
  • qa/scenarios/runtime/tools/message-tool.md (added, +52/-0)
  • qa/scenarios/runtime/tools/session-status.md (added, +54/-0)
  • qa/scenarios/runtime/tools/sessions-spawn.md (added, +54/-0)
  • qa/scenarios/runtime/tools/skill-invocation.md (added, +54/-0)
  • qa/scenarios/runtime/tools/tavily-extract.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tavily-search.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tts.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-fetch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-search.md (added, +54/-0)
  • qa/scenarios/workspace/source-docs-discovery-report.md (modified, +1/-0)
  • scripts/deadcode-unused-files.allowlist.mjs (modified, +2/-0)
  • src/agents/model-runtime-policy.test.ts (added, +91/-0)
  • src/agents/model-runtime-policy.ts (modified, +16/-0)

Code Example

scenarios × runtimes × plugin-states × auth-shapes × provider-mode

---

scenario              | pi tokens | codex tokens | Δ      | tools used
----------------------|-----------|--------------|--------|----------
bash-list-files       |   1,240   |    1,180     | -4.8%  | bash
exec-approval-loop    |   3,840   |    4,210     | +9.6%  | exec, message-tool
web-search-then-fetch |   2,100   |    1,950     | -7.1%  | web_search, web_fetch
                       ...
TOTAL                 |  N        |   M          |  ±x%   | -
RAW_BUFFERClick to expand / collapse

Codex-vs-Pi runtime parity QA harness (RFC + tracking)

Context

Per the maintainer thread between @pash, @Eva-⚡🐑, and @ai-hpc on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped — recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.

The maintainer ask:

  • @pash — stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.
  • @Eva-⚡🐑 — full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
  • @ai-hpc — already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.

The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares gpt-5.5 vs claude-opus-4-7 — same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.

This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in extensions/qa-lab/transport-parity-gate.md from closed PR #78512.

Architecture

The matrix

scenarios × runtimes × plugin-states × auth-shapes × provider-mode
AxisValuesPurpose
scenariosper-tool fixtures + jsonl-replay scenarios + existing agentic-parity scenariosWhat the agent is asked to do
runtimespi, codexThe "primary subject" of the comparison — same model, forced runtime
plugin-statescodex-missing, codex-pinned-old, codex-current, codex-headStress codex-as-plugin lifecycle
auth-shapesoauth-only, apikey-only, mixed-profilesCatches auth-selection bugs (#78499 class)
provider-modemock-openai (hermetic, default), live-frontier (real, gated)Cost/speed vs realism trade-off

Full Cartesian is huge; we run a small hermetic subset on every PR (mock-openai × current-codex × oauth-only across the per-tool fixtures) and the full live matrix on schedule (release-checks workflow, gated behind OPENCLAW_BUILD_PRIVATE_QA=1).

Per-cell capture

For every cell of the matrix, emit:

  • transcript-bytes — full JSONL of the turn chain (already produced by qa-lab; just needs runtime tagging).
  • tool-calls[] — ordered list of { tool_name, args_hash, result_hash, error_class? }.
  • final-text — assistant final answer text, normalized for whitespace.
  • usage{ input_tokens, output_tokens, total_tokens, cache_read?, cache_write? }. Aggregate per-turn and per-scenario.
  • wall-clock-ms, transport-error-class?, runtime-error-class?.
  • boot-stategateway.err.log lines containing FailoverError, No API key found, Codex app-server, etc.

Drift classifier

When transcripts differ between the pi and codex cells of the same scenario, classify:

  • text-only — final answers differ in wording but mean the same thing (allowed within model-eval tolerance, same rubric the existing agentic-parity-report.test.ts uses).
  • tool-call-shape — different tools called, different arg shapes, different ordering.
  • tool-result-shape — same tool called but result is interpreted differently.
  • structural — different turn count, different phase structure, missing/extra final answer.
  • failure-mode — one cell errors, the other doesn't.

The harness reports drift category per scenario, not just pass/fail. This is what makes it actionable for "lots of tools break under codex" — you see exactly which tool family drifts.

Token-efficiency report

For live-mode runs: per-scenario, side-by-side table:

scenario              | pi tokens | codex tokens | Δ      | tools used
----------------------|-----------|--------------|--------|----------
bash-list-files       |   1,240   |    1,180     | -4.8%  | bash
exec-approval-loop    |   3,840   |    4,210     | +9.6%  | exec, message-tool
web-search-then-fetch |   2,100   |    1,950     | -7.1%  | web_search, web_fetch
                       ...
TOTAL                 |  N        |   M          |  ±x%   | -

Plus per-runtime aggregates (total, p50, p90 per turn) and a flag when delta >15% so model-cost regressions surface as PR-blockers.

Components — file-level layout

New / extended filePurposePhase
extensions/qa-lab/src/runtime-parity.ts (new)Orchestrator: takes a scenario, runs it twice with pi and codex forced, returns per-cell capture1
extensions/qa-lab/src/runtime-parity.test.ts (new)Unit tests for orchestrator + drift classifier1
src/agents/model-runtime-policy.ts (extend)Add an OPENCLAW_QA_FORCE_RUNTIME env-var seam (test-only) so the harness can override agentRuntime.id resolution without mutating user config. Document as test-only in the export's JSDoc.1
extensions/qa-lab/src/agentic-parity-report.ts (extend)Add runtime field to per-cell summary, runtimeDrift rollup section1
extensions/qa-lab/src/cli.ts (extend)New qa suite --runtime-pair pi,codex flag, propagates to suite runner1
qa/scenarios/runtime/tools/<tool>.md (new)One scenario per tool family — see Phase 2 list below2
extensions/qa-lab/src/codex-plugin-fixture.ts (new)Helpers to seed ~/.openclaw/npm/node_modules/@openclaw/codex to a known version (or absent) before a cell3
extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (new)Asserts doctor + first-turn flow under each plugin-state3
extensions/qa-lab/src/token-efficiency-report.ts (new)Side-by-side token report; integrates into qa parity-report4
extensions/qa-lab/src/jsonl-replay.ts (new)Replays real captured session transcripts through both runtimes5
.github/workflows/openclaw-release-checks.yml (extend)Wire the runtime-pair lane into the same matrix that already runs the model-pair lane1

Phasing — five PRs, staged

Sub-issues filed:

  • Phase 1 — Runtime axis: #80172
  • Phase 2 — Per-tool fixture set: #80173
  • Phase 3 — Codex-plugin lifecycle: #80174
  • Phase 4 — Token-efficiency report: #80175
  • Phase 5 — JSONL replay: #80176

Phase 1 — Runtime axis (smallest, lands first)

Scope: add the runtime dimension to the existing parity machinery. Reuse current scenarios; do not add new fixtures yet.

Files: runtime-parity.ts, model-runtime-policy.ts extension, agentic-parity-report.ts extension, cli.ts flag, workflow wiring, tests.

Acceptance:

  • pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex runs each existing agentic scenario twice (once per runtime) and produces a summary with a runtime field per cell.
  • The drift classifier is implemented and emits one of {none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode} per scenario.
  • New qa parity-report mode --runtime-axis produces a side-by-side table.
  • OPENCLAW_QA_FORCE_RUNTIME=pi|codex env var, set at policy resolution time, is documented as test-only and gated to OPENCLAW_BUILD_PRIVATE_QA=1.
  • CI wiring: a new step in .github/workflows/openclaw-release-checks.yml (folded into the same matrix as #74622's parity lane) running the runtime-pair on the existing scenarios.
  • All existing parity tests still green; no behavior change for non-QA users.

Phase 2 — Per-tool fixture set

Scope: one fixture per tool family. Each fixture is deterministic: prompts the agent in a way that forces exactly one tool call with predictable arguments. Asserts the tool was invoked, completed, and result shape matches between runtimes.

Tool families (from src/agents/pi-tools.create-openclaw-coding-tools.ts and codex harness contract — exact list to be confirmed in the PR):

  • bash
  • exec (approval flow)
  • fs.read, fs.write, fs.list
  • grep
  • edit / apply-patch
  • web_search, web_fetch
  • tavily_search, tavily_extract
  • image_generate
  • tts
  • message-tool (send + media variants)
  • session_status, sessions_spawn
  • memory.recall, memory.add (if pi-only, mark as expected drift)
  • skill_* invocations

Acceptance:

  • Each tool has a qa/scenarios/runtime/tools/<tool>.md fixture.
  • Each fixture passes both cells when run under --runtime-pair pi,codex against current main, OR is annotated with a known-broken marker that points at a tracking issue.
  • The runtime-parity report enumerates per-tool drift, not just per-scenario.
  • A pnpm openclaw qa tool-coverage --runtime-pair pi,codex command produces a Markdown table of "tool X: pi=✅ codex=❌ #issue" for the README of the harness.

Phase 3 — Codex-plugin lifecycle harness

Scope: stress the codex-plugin install / update / version-pinning flows that pash flagged.

Cells (from the bug clusters and ai-hpc's manual matrix):

  1. Cold install — clean home, no codex plugin → openclaw doctor --fix from a config that needs codex → assert remediation message clear, install completes, retry succeeds, no $ leakage to api-key path.
  2. OAuth-only with mixed-profiles — both openai-codex:* and openai:* profiles in auth-profiles.json → assert codex auth picked, not the api-key (#78499 case).
  3. Pinned-old codex plugin + new openclaw — codex plugin pinned to release N-1, openclaw on N → assert version mismatch detected and reported with a clear remediation hint.
  4. Pinned-new codex plugin + old openclaw — same axis flipped.
  5. Codex plugin install racing first agent turn — concurrent install + agent run → assert ordering doesn't lose tokens or produce a duplicate response.
  6. Doctor migration safety — codify @ai-hpc's four manual cells as automated checks: oauth-only, mixed-profile, mixed + defaults pin, mixed + per-agent pin → assert doctor --fix strips pins and codex auto-routes.

Acceptance:

  • Each cell is automated, runs in mock-openai mode, completes <60s.
  • Failure modes have asserted error messages (string match) so any wording regression is caught.
  • Live-mode variant gated to scheduled runs.

Phase 4 — Token-efficiency report

Scope: capture and surface per-runtime token usage. Live mode only (mock-openai returns fixed counts so deltas there are meaningless).

Acceptance:

  • qa parity-report --runtime-axis --token-efficiency produces the side-by-side table described above.
  • Per-runtime aggregates: total, p50, p90, per-turn.
  • Flag when scenario-level delta >15%.
  • Stored as a release artifact for week-over-week tracking.

Phase 5 — JSONL replay (lower priority, separate track)

Scope: Eva's "loop 3 agents on difficult scenarios from real jsonl session history."

Approach: take captured session transcripts (from a maintainer-supplied jsonl set, stripped of PII), extract user turns, replay through fresh sessions on each runtime. Diff trajectories.

Acceptance: harness accepts a directory of jsonl, runs each through --runtime-pair, produces a drift report with the same drift classifier from Phase 1. PR is gated behind a curated fixture set so it can land without a real-customer transcript dump.

Performance / cost budget

  • Hermetic on-PR runs (mock-openai, single auth-shape, codex-current only): target <5 min total for all scenarios across both runtimes. Parallelizable per scenario.
  • Full live release-checks lane: target <30 min with parallelism, gated behind OPENCLAW_BUILD_PRIVATE_QA=1.
  • Token-efficiency live runs: separate scheduled cron, not on every release; nightly is fine.

Out of scope

  • Cross-vendor model parity stays in the existing model-axis gate (#74290 / #79347).
  • CLI surface / message-clarity work like #77221.
  • Mobile/iOS replay — separate harness if needed.
  • Real-customer transcript ingestion — Phase 5 uses curated fixtures only.

Failure-mode taxonomy (for triage)

When the harness reports drift, the triage flow is:

  1. failure-mode drift = one runtime errors, the other doesn't → blocking. Open a P1 bug.
  2. structural drift = turn count or phase structure differs → likely blocking. Investigate before merging anything that touches that code path.
  3. tool-call-shape drift = wrong/missing tool → P1-P2 depending on the tool family.
  4. tool-result-shape drift = same tool, different parsing → P2 unless it changes outcomes.
  5. text-only drift within tolerance = expected; no action.
  6. text-only drift outside tolerance = model-eval rubric escalation.

References

  • #78457 — original transport-parity gate proposal (this RFC supersedes its scope).
  • #78055, #78060, #78407 — bug cluster that motivates the harness.
  • #78499 — Codex auth profile selection (residual of #78407).
  • #79238 — most recent runtime-policy fix on main (changed how openai/* routes).
  • #74290 (closed) → #79347 (slim follow-up in flight) — sibling model-axis parity.
  • Closed #78512 — original transport-parity-gate.md design doc + reproduction test (test was reframed-out by #79238; doc lifts forward into this RFC).
  • extensions/qa-lab/transport-parity-gate.md — design-doc-only PR will be filed extracting the doc content from #78512 and updating it for current main.
  • @ai-hpc's four-cell manual matrix verification (Yesterday).

Handoff notes for the implementing agent

  • Read extensions/qa-lab/AGENTS.md and the scoped extensions/qa-lab/src/CLAUDE.md (if present) before touching code.
  • The OPENCLAW_QA_FORCE_RUNTIME seam is the only runtime-mutation surface added by Phase 1 — keep it gated and test-only. Do not let it leak into production code paths.
  • Phase 1 is the unblocker. Phases 2–4 can be parallelised once Phase 1 lands. Phase 5 is independent and lowest priority.
  • The drift classifier in Phase 1 must use the same rubric as the existing agentic-parity-report.test.ts for text-only drift to keep tolerance consistent across the two parity gates.
  • For Phase 3 cell 5 (install race), avoid timing-based assertions — use deterministic ordering primitives.
  • For Phase 4, capture usage at the assistant-message level (AssistantMessage.usage) rather than at the transport level — the transport-level shapes differ between Pi and Codex but the assistant-message shape is normalized.
  • Sub-issues will be filed for each phase. This issue is the tracking parent.

cc @pash @Eva-⚡🐑 @ai-hpc

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Codex-vs-Pi runtime parity QA harness (RFC + tracking) [3 pull requests, 7 comments, 2 participants]