openclaw - ✅(Solved) Fix [QA harness] Mock approval followthrough emits undeclared read for Codex app-server lane [2 pull requests, 5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80236Fetched 2026-05-11 03:17:11
View on GitHub
Comments
5
Participants
2
Timeline
12
Reactions
2
Timeline (top)
cross-referenced ×6commented ×5renamed ×1

Error Message

Codex should execute and surface the same read result shape as Pi for this scenario, or the runtime should fail with a clear structured tool error before final-answer synthesis. It should not synthesize a successful-looking answer from unsupported call: read.

Root Cause

The Codex default-runtime flip needs tool-level parity against the existing Pi path. During the Phase 1 harness proof run, the new runtime-pair lane caught a deterministic drift in an existing agentic scenario: both runtimes planned the same read call, but Codex produced an unsupported-call result instead of the file contents Pi received.

This is exactly the class of regression the runtime-parity harness is meant to make visible before Codex becomes the default OpenAI runtime.

Fix Action

Fix / Workaround

The original issue overclaimed this as a P1 Codex runtime problem. A higher-confidence code-path audit shows the mock provider emits a provider-level read function call from prompt text even when the Codex app-server lane does not declare read as an OpenClaw dynamic tool. Codex intentionally owns workspace tools such as read/write/edit/exec/apply_patch natively rather than exposing them through the OpenClaw dynamic-tool bridge.

  • Codex dynamic tools intentionally exclude read, write, edit, apply_patch, exec, process, and update_plan.
  • The mock provider can still emit read based on prompt text.
  • Runtime parity previously preferred /debug/requests provider-plan snapshots over transcript-derived tool events.

PR fix notes

PR #80238: test(qa-lab): add Codex vs Pi runtime parity harness

Description (problem / solution / changelog)

Why

Codex is moving toward the default OpenAI runtime, but the existing release parity checks compare model behavior, not runtime behavior. That leaves a known blind spot: the same scenario and same model can pass under Pi while drifting under Codex at the tool layer.

This adds the Phase 1 runtime axis from #80172 so qa-lab can run each scenario once as pi and once as codex, capture per-runtime cells, and classify drift at the tool/result/structure/failure level instead of only reporting a session-level pass/fail.

Part of #80171. Closes #80172. Detected follow-up drift: #80236.

What Changed

  • Adds the private-QA-only OPENCLAW_QA_FORCE_RUNTIME=pi|codex override in resolveModelRuntimePolicy, gated by OPENCLAW_BUILD_PRIVATE_QA=1.
  • Adds extensions/qa-lab/src/runtime-parity.ts with the runtime cell shape, assistant-message usage capture, provider-side mock /debug/requests tool capture, and six-bucket drift classifier.
  • Adds qa suite --runtime-pair pi,codex and runtime-axis qa parity-report --runtime-axis --summary <path>.
  • Extends suite summaries and runtime parity Markdown reporting with per-runtime cells and aggregate drift counts.
  • Wires qa_lab_runtime_parity_release_checks into openclaw-release-checks.yml next to the existing model-axis parity lane.

Real-Behavior Proof

The harness caught a real drift in approval-turn-tool-followthrough:

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
OPENCLAW_QA_SUITE_PROGRESS=1 \
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --scenario approval-turn-tool-followthrough \
  --concurrency 1 \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5

The suite exits nonzero because drift is present, but it writes the runtime summary. The follow-up report command:

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --summary .artifacts/qa-e2e/runtime-parity-proof-approval-remap5/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5-report

Observed report excerpt:

| Tool-result-shape drift | 1 |

- Approval turn tool followthrough drift=tool-result-shape (tool result 1 differs (read)).

pi: pass (1 tool calls, 256 tokens)
codex: fail (1 tool calls, 176 tokens)

The captured cells show both runtimes planned read with the same args hash, while Codex returned unsupported call: read. I filed that runtime bug as #80236 instead of hiding it in the harness PR.

Verification

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/runtime-parity.test.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/multipass.runtime.test.ts \
  extensions/qa-lab/src/suite.test.ts

pnpm exec vitest run --config test/vitest/vitest.agents.config.ts \
  src/agents/model-runtime-policy.test.ts

pnpm tsgo:core
pnpm tsgo:core:test
pnpm tsgo:extensions:test
pnpm check:test-types
pnpm exec oxlint --type-aware --tsconfig config/tsconfig/oxlint.json --allow eslint/no-underscore-dangle \
  extensions/qa-lab/src/runtime-parity.ts \
  extensions/qa-lab/src/runtime-parity.test.ts \
  extensions/qa-lab/src/suite.ts \
  extensions/qa-lab/src/suite.test.ts \
  extensions/qa-lab/src/suite-summary.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/agentic-parity-report.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/cli.ts \
  extensions/qa-lab/src/cli.runtime.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/multipass.runtime.ts \
  extensions/qa-lab/src/multipass.runtime.test.ts \
  extensions/qa-lab/src/gateway-child.ts \
  src/agents/model-runtime-policy.ts \
  src/agents/model-runtime-policy.test.ts

Non-Goals

  • Does not add the Phase 2 per-tool fixture set yet.
  • Does not add Phase 3 Codex plugin lifecycle cells yet.
  • Does not add Phase 4 token-efficiency reporting beyond capturing per-cell assistant-message usage.
  • Does not add Phase 5 JSONL replay yet.

Changed files

  • .github/workflows/openclaw-release-checks.yml (modified, +73/-0)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +108/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (modified, +205/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +109/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +62/-2)
  • extensions/qa-lab/src/cli.ts (modified, +17/-7)
  • extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
  • extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
  • extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
  • extensions/qa-lab/src/runtime-parity.test.ts (added, +313/-0)
  • extensions/qa-lab/src/runtime-parity.ts (added, +899/-0)
  • extensions/qa-lab/src/suite-summary.ts (modified, +3/-0)
  • extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
  • extensions/qa-lab/src/suite.test.ts (modified, +47/-0)
  • extensions/qa-lab/src/suite.ts (modified, +372/-0)
  • src/agents/model-runtime-policy.test.ts (added, +91/-0)
  • src/agents/model-runtime-policy.ts (modified, +16/-0)

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Description (problem / solution / changelog)

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring.

This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture:

  • Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are no longer expected to appear as duplicate OpenClaw dynamic tools.
  • OpenClaw integration tools (image_generate, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows.
  • Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled.
  • Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence.
  • Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Verification

Passing targeted/current-scope checks:

  • pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runtime-suite.test.ts extensions/qa-lab/src/suite.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts
  • pnpm tsgo:extensions:test
  • pnpm check:test-types
  • git diff --check

Real Behavior Proof

  • Behavior or issue addressed: Corrects the runtime parity tool-defaults harness so Codex-native workspace tools are no longer falsely required as duplicate OpenClaw dynamic tools, while OpenClaw dynamic integration rows remain visible and tracked.
  • Real environment tested: Local OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1 on branch codex-vs-pi-runtime-parity-tools, running the real pnpm openclaw qa CLI against the embedded gateway and mock OpenAI provider after this patch.
  • Exact steps or command run after this patch:
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite tool-defaults --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/runtime-tools-correction
pnpm openclaw qa tool-coverage --repo-root . --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --runtime-pair pi,codex --output .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite openclaw-dynamic-tools --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/openclaw-dynamic-tools-correction
pnpm openclaw qa parity-report --repo-root . --runtime-axis --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --output-dir .artifacts/qa-e2e/runtime-tools-correction/parity --token-efficiency
  • Evidence after fix: Terminal output produced these real local artifacts: .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json, .artifacts/qa-e2e/runtime-tools-correction/qa-suite-report.md, .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md, .artifacts/qa-e2e/openclaw-dynamic-tools-correction/qa-suite-summary.json, and .artifacts/qa-e2e/runtime-tools-correction/parity/qa-runtime-token-efficiency-report.md.
  • Observed result after fix: tool-defaults completed with 20 scenarios, 15 pass, 5 report-only skip, 0 fail. Tool coverage verdict was pass with 13 required tools, 8 Codex-native workspace tools, 5 OpenClaw dynamic integration tools, 7 optional/profile/plugin tools, and 0 failing tools. The focused openclaw-dynamic-tools suite completed with 5 report-only rows tracked under #80319. Token efficiency report verdict was pass with usage source mock-estimate.
  • What was not tested: Live frontier token-efficiency proof was not completed because local direct OpenAI auth is missing; optional scheduled/Testbox soak-100 proof was not completed; broad first-hour-20 remains red and is tracked in #80434.

Known Broad/Latest Blockers

  • First first-hour-20 attempt hit a pre-suite tsdown SIGSEGV; retry reached QA.
  • OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite first-hour-20 --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/first-hour-20-correction-retry is not green: 18 total, 6 pass, 12 fail; tracked in #80434.
  • pnpm check fails unrelated Discord lint: #80428.
  • pnpm test fails unrelated agents-core / ACPx / Mattermost shards: #80429, #80430, #80431, #67784.
  • Live token-efficiency proof path renders artifacts, but local direct OpenAI auth is missing so the attempted live run is not valid proof; tracked in #80175.
  • Optional soak-100 exists but is not scheduled/Testbox-wired; tracked in #80433.

Linked Issues

Umbrella/spec: #80171

Phase issues: #80172, #80173, #80174, #80175, #80176

Harness correction issues: #80236, #80312, #80319, #80320; #80321 is closed as fixed by this PR branch.

Fresh broad-rerun follow-ups: #80428, #80429, #80430, #80431, #80433, #80434, #67784

Changed files

  • .github/workflows/openclaw-release-checks.yml (modified, +115/-0)
  • .github/workflows/qa-live-transports-convex.yml (modified, +77/-0)
  • apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +4/-0)
  • extensions/codex/src/app-server/schema-normalization-runtime-contract.test.ts (modified, +9/-4)
  • extensions/lmstudio/src/models.test.ts (modified, +1/-1)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +120/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (modified, +218/-0)
  • extensions/qa-lab/src/auth-profile-fixture.ts (added, +177/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +282/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +416/-3)
  • extensions/qa-lab/src/cli.ts (modified, +175/-7)
  • extensions/qa-lab/src/codex-plugin-fixture.ts (added, +282/-0)
  • extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (added, +190/-0)
  • extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
  • extensions/qa-lab/src/harness-parity.test.ts (added, +144/-0)
  • extensions/qa-lab/src/harness-parity.ts (added, +415/-0)
  • extensions/qa-lab/src/jsonl-replay.test.ts (added, +169/-0)
  • extensions/qa-lab/src/jsonl-replay.ts (added, +270/-0)
  • extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
  • extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
  • extensions/qa-lab/src/providers/mock-openai/server.ts (modified, +74/-3)
  • extensions/qa-lab/src/runtime-parity.test.ts (added, +427/-0)
  • extensions/qa-lab/src/runtime-parity.ts (added, +1119/-0)
  • extensions/qa-lab/src/runtime-suite.test.ts (added, +75/-0)
  • extensions/qa-lab/src/runtime-suite.ts (added, +147/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.test.ts (added, +156/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.ts (added, +291/-0)
  • extensions/qa-lab/src/runtime-tool-metadata.ts (added, +142/-0)
  • extensions/qa-lab/src/scenario-catalog.test.ts (modified, +10/-0)
  • extensions/qa-lab/src/scenario-catalog.ts (modified, +4/-0)
  • extensions/qa-lab/src/scenario-flow-runner.ts (modified, +1/-1)
  • extensions/qa-lab/src/scenario-runtime-api.test.ts (modified, +1/-0)
  • extensions/qa-lab/src/scenario-runtime-api.ts (modified, +3/-0)
  • extensions/qa-lab/src/suite-runtime-flow.ts (modified, +13/-1)
  • extensions/qa-lab/src/suite-summary.ts (modified, +4/-1)
  • extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
  • extensions/qa-lab/src/suite.test.ts (modified, +100/-0)
  • extensions/qa-lab/src/suite.ts (modified, +449/-2)
  • extensions/qa-lab/src/token-efficiency-report.test.ts (added, +218/-0)
  • extensions/qa-lab/src/token-efficiency-report.ts (added, +379/-0)
  • extensions/qa-lab/src/tool-coverage-report.test.ts (added, +288/-0)
  • extensions/qa-lab/src/tool-coverage-report.ts (added, +285/-0)
  • extensions/qa-lab/transport-parity-gate.md (added, +66/-0)
  • extensions/qqbot/src/bridge/tools/remind.test.ts (modified, +1/-1)
  • extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts (modified, +1/-1)
  • extensions/slack/src/monitor/media.test.ts (modified, +3/-3)
  • extensions/tavily/src/tavily-tools.test.ts (modified, +3/-1)
  • qa/scenarios/agents/instruction-followthrough-repo-contract.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-fanout-synthesis.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-handoff.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-stale-child-links.md (modified, +1/-0)
  • qa/scenarios/channels/channel-chat-baseline.md (modified, +1/-0)
  • qa/scenarios/config/config-restart-capability-flip.md (modified, +1/-0)
  • qa/scenarios/jsonl-replay/plan-mode-boundaries.jsonl (added, +8/-0)
  • qa/scenarios/jsonl-replay/recovery-partial-session.jsonl (added, +4/-0)
  • qa/scenarios/jsonl-replay/repo-triage-tool-loop.jsonl (added, +7/-0)
  • qa/scenarios/memory/memory-recall.md (modified, +1/-0)
  • qa/scenarios/memory/thread-memory-isolation.md (modified, +1/-0)
  • qa/scenarios/models/model-switch-tool-continuity.md (modified, +1/-0)
  • qa/scenarios/runtime/approval-turn-tool-followthrough.md (modified, +1/-0)
  • qa/scenarios/runtime/auth-profile-codex-mixed-profiles.md (added, +39/-0)
  • qa/scenarios/runtime/auth-profile-doctor-migration-safety.md (added, +44/-0)
  • qa/scenarios/runtime/codex-plugin-cold-install.md (added, +42/-0)
  • qa/scenarios/runtime/codex-plugin-install-race.md (added, +38/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-new.md (added, +39/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-old.md (added, +39/-0)
  • qa/scenarios/runtime/compaction-retry-mutating-tool.md (modified, +1/-0)
  • qa/scenarios/runtime/first-hour-20-turn.md (added, +68/-0)
  • qa/scenarios/runtime/soak-100-turn.md (added, +68/-0)
  • qa/scenarios/runtime/tools/apply-patch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/bash.md (added, +55/-0)
  • qa/scenarios/runtime/tools/edit.md (added, +54/-0)
  • qa/scenarios/runtime/tools/exec.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-list.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-read.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-write.md (added, +54/-0)
  • qa/scenarios/runtime/tools/grep.md (added, +54/-0)
  • qa/scenarios/runtime/tools/image-generate.md (added, +55/-0)
  • qa/scenarios/runtime/tools/memory-add.md (added, +54/-0)
  • qa/scenarios/runtime/tools/memory-recall.md (added, +54/-0)
  • qa/scenarios/runtime/tools/message-tool.md (added, +52/-0)
  • qa/scenarios/runtime/tools/session-status.md (added, +54/-0)
  • qa/scenarios/runtime/tools/sessions-spawn.md (added, +54/-0)
  • qa/scenarios/runtime/tools/skill-invocation.md (added, +54/-0)
  • qa/scenarios/runtime/tools/tavily-extract.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tavily-search.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tts.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-fetch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-search.md (added, +54/-0)
  • qa/scenarios/workspace/source-docs-discovery-report.md (modified, +1/-0)
  • scripts/deadcode-unused-files.allowlist.mjs (modified, +2/-0)
  • src/agents/model-runtime-policy.test.ts (added, +91/-0)
  • src/agents/model-runtime-policy.ts (modified, +16/-0)

Code Example

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
OPENCLAW_QA_SUITE_PROGRESS=1 \
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --scenario approval-turn-tool-followthrough \
  --concurrency 1 \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --summary .artifacts/qa-e2e/runtime-parity-proof-approval-remap5/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5-report
RAW_BUFFERClick to expand / collapse

Correction TLDR

Status: harness/mock-provider artifact, not a proven user-facing Codex app-server bug.

The original issue overclaimed this as a P1 Codex runtime problem. A higher-confidence code-path audit shows the mock provider emits a provider-level read function call from prompt text even when the Codex app-server lane does not declare read as an OpenClaw dynamic tool. Codex intentionally owns workspace tools such as read/write/edit/exec/apply_patch natively rather than exposing them through the OpenClaw dynamic-tool bridge.

What actually breaks: the QA parity harness is comparing a malformed mock-provider plan against the Codex app-server lane. This is not enough evidence that real users lose approval-followthrough reads.

Impact if OpenClaw moved fully to Codex today: P4 until live/native proof says otherwise. The remaining risk is harness fidelity and malformed mock-provider robustness, not a demonstrated production approval-read regression.

Correct Fix

  • Gate mock read planning on declared/available tools, or model Codex-native read through the real Codex app-server native tool protocol.
  • Keep mock provider-plan diagnostics separate from runtime transcript/tool-call evidence.
  • Reopen/escalate as a product bug only if a live/native Codex run shows approved reads fail outside this mock contract.

Evidence From Re-audit

  • Codex dynamic tools intentionally exclude read, write, edit, apply_patch, exec, process, and update_plan.
  • The mock provider can still emit read based on prompt text.
  • Runtime parity previously preferred /debug/requests provider-plan snapshots over transcript-derived tool events.

Superseded Original Report

Parent: #80171 Detected by: #80172 Phase 1 runtime-parity harness work

Why this matters

The Codex default-runtime flip needs tool-level parity against the existing Pi path. During the Phase 1 harness proof run, the new runtime-pair lane caught a deterministic drift in an existing agentic scenario: both runtimes planned the same read call, but Codex produced an unsupported-call result instead of the file contents Pi received.

This is exactly the class of regression the runtime-parity harness is meant to make visible before Codex becomes the default OpenAI runtime.

Reproduction

From the Phase 1 branch:

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
OPENCLAW_QA_SUITE_PROGRESS=1 \
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --scenario approval-turn-tool-followthrough \
  --concurrency 1 \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5

OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --summary .artifacts/qa-e2e/runtime-parity-proof-approval-remap5/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/runtime-parity-proof-approval-remap5-report

Observed

  • Scenario: approval-turn-tool-followthrough
  • Drift class: tool-result-shape
  • Pi cell: plans read, args hash 462521a229a053d20c4c8121cecce65e885c7d2b0f94347c1d4922445a701263, receives the QA_KICKOFF_TASK.md mission text, and passes.
  • Codex cell: plans the same read with the same args hash, but the provider-side result hash differs and the final assistant text is Protocol note: I reviewed the requested material. Evidence snippet: unsupported call: read.
  • Codex then times out downstream, but the actionable difference is the tool result shape.

Expected

Codex should execute and surface the same read result shape as Pi for this scenario, or the runtime should fail with a clear structured tool error before final-answer synthesis. It should not synthesize a successful-looking answer from unsupported call: read.

Links

  • Umbrella RFC/tracker: #80171
  • Phase 1 runtime-axis implementation issue: #80172
  • Related earlier bug cluster: #78055, #78060, #78407, #78499

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [QA harness] Mock approval followthrough emits undeclared read for Codex app-server lane [2 pull requests, 5 comments, 2 participants]