openclaw - ✅(Solved) Fix QA tool-defaults suite conflates Codex-native tools with OpenClaw dynamic tool parity [1 pull requests, 6 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80319Fetched 2026-05-11 03:16:17
View on GitHub
Comments
6
Participants
2
Timeline
16
Reactions
2
Timeline (top)
cross-referenced ×8commented ×6renamed ×2

Root Cause

The runtime-parity harness is meant to catch exactly this class of default-runtime regression before Codex becomes the default OpenAI agent runtime. In the Phase 2 per-tool fixture suite, Pi sends the expected mock provider tool calls for many tool families, while Codex returns an acknowledgement without sending the planned tool call, causing the fixture to fail before tool-result comparison.

Fix Action

Fix / Workaround

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-tools-phase2-full \
  --allow-failures \
  --concurrency 2 \
  --scenario runtime-tool-apply-patch \
  --scenario runtime-tool-bash \
  --scenario runtime-tool-edit \
  --scenario runtime-tool-exec \
  --scenario runtime-tool-fs-list \
  --scenario runtime-tool-fs-read \
  --scenario runtime-tool-fs-write \
  --scenario runtime-tool-grep \
  --scenario runtime-tool-image-generate \
  --scenario runtime-tool-memory-add \
  --scenario runtime-tool-memory-recall \
  --scenario runtime-tool-message-tool \
  --scenario runtime-tool-session-status \
  --scenario runtime-tool-sessions-spawn \
  --scenario runtime-tool-skill-invocation \
  --scenario runtime-tool-tavily-extract \
  --scenario runtime-tool-tavily-search \
  --scenario runtime-tool-tts \
  --scenario runtime-tool-web-fetch \
  --scenario runtime-tool-web-search

PR fix notes

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Description (problem / solution / changelog)

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring.

This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture:

  • Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are no longer expected to appear as duplicate OpenClaw dynamic tools.
  • OpenClaw integration tools (image_generate, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows.
  • Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled.
  • Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence.
  • Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Verification

Passing targeted/current-scope checks:

  • pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runtime-suite.test.ts extensions/qa-lab/src/suite.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts
  • pnpm tsgo:extensions:test
  • pnpm check:test-types
  • git diff --check

Real Behavior Proof

  • Behavior or issue addressed: Corrects the runtime parity tool-defaults harness so Codex-native workspace tools are no longer falsely required as duplicate OpenClaw dynamic tools, while OpenClaw dynamic integration rows remain visible and tracked.
  • Real environment tested: Local OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1 on branch codex-vs-pi-runtime-parity-tools, running the real pnpm openclaw qa CLI against the embedded gateway and mock OpenAI provider after this patch.
  • Exact steps or command run after this patch:
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite tool-defaults --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/runtime-tools-correction
pnpm openclaw qa tool-coverage --repo-root . --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --runtime-pair pi,codex --output .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite openclaw-dynamic-tools --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/openclaw-dynamic-tools-correction
pnpm openclaw qa parity-report --repo-root . --runtime-axis --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --output-dir .artifacts/qa-e2e/runtime-tools-correction/parity --token-efficiency
  • Evidence after fix: Terminal output produced these real local artifacts: .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json, .artifacts/qa-e2e/runtime-tools-correction/qa-suite-report.md, .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md, .artifacts/qa-e2e/openclaw-dynamic-tools-correction/qa-suite-summary.json, and .artifacts/qa-e2e/runtime-tools-correction/parity/qa-runtime-token-efficiency-report.md.
  • Observed result after fix: tool-defaults completed with 20 scenarios, 15 pass, 5 report-only skip, 0 fail. Tool coverage verdict was pass with 13 required tools, 8 Codex-native workspace tools, 5 OpenClaw dynamic integration tools, 7 optional/profile/plugin tools, and 0 failing tools. The focused openclaw-dynamic-tools suite completed with 5 report-only rows tracked under #80319. Token efficiency report verdict was pass with usage source mock-estimate.
  • What was not tested: Live frontier token-efficiency proof was not completed because local direct OpenAI auth is missing; optional scheduled/Testbox soak-100 proof was not completed; broad first-hour-20 remains red and is tracked in #80434.

Known Broad/Latest Blockers

  • First first-hour-20 attempt hit a pre-suite tsdown SIGSEGV; retry reached QA.
  • OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite first-hour-20 --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/first-hour-20-correction-retry is not green: 18 total, 6 pass, 12 fail; tracked in #80434.
  • pnpm check fails unrelated Discord lint: #80428.
  • pnpm test fails unrelated agents-core / ACPx / Mattermost shards: #80429, #80430, #80431, #67784.
  • Live token-efficiency proof path renders artifacts, but local direct OpenAI auth is missing so the attempted live run is not valid proof; tracked in #80175.
  • Optional soak-100 exists but is not scheduled/Testbox-wired; tracked in #80433.

Linked Issues

Umbrella/spec: #80171

Phase issues: #80172, #80173, #80174, #80175, #80176

Harness correction issues: #80236, #80312, #80319, #80320; #80321 is closed as fixed by this PR branch.

Fresh broad-rerun follow-ups: #80428, #80429, #80430, #80431, #80433, #80434, #67784

Changed files

  • .github/workflows/openclaw-release-checks.yml (modified, +115/-0)
  • .github/workflows/qa-live-transports-convex.yml (modified, +77/-0)
  • apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +4/-0)
  • extensions/codex/src/app-server/schema-normalization-runtime-contract.test.ts (modified, +9/-4)
  • extensions/lmstudio/src/models.test.ts (modified, +1/-1)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +120/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (modified, +218/-0)
  • extensions/qa-lab/src/auth-profile-fixture.ts (added, +177/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +282/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +416/-3)
  • extensions/qa-lab/src/cli.ts (modified, +175/-7)
  • extensions/qa-lab/src/codex-plugin-fixture.ts (added, +282/-0)
  • extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (added, +190/-0)
  • extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
  • extensions/qa-lab/src/harness-parity.test.ts (added, +144/-0)
  • extensions/qa-lab/src/harness-parity.ts (added, +415/-0)
  • extensions/qa-lab/src/jsonl-replay.test.ts (added, +169/-0)
  • extensions/qa-lab/src/jsonl-replay.ts (added, +270/-0)
  • extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
  • extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
  • extensions/qa-lab/src/providers/mock-openai/server.ts (modified, +74/-3)
  • extensions/qa-lab/src/runtime-parity.test.ts (added, +427/-0)
  • extensions/qa-lab/src/runtime-parity.ts (added, +1119/-0)
  • extensions/qa-lab/src/runtime-suite.test.ts (added, +75/-0)
  • extensions/qa-lab/src/runtime-suite.ts (added, +147/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.test.ts (added, +156/-0)
  • extensions/qa-lab/src/runtime-tool-fixture.ts (added, +291/-0)
  • extensions/qa-lab/src/runtime-tool-metadata.ts (added, +142/-0)
  • extensions/qa-lab/src/scenario-catalog.test.ts (modified, +10/-0)
  • extensions/qa-lab/src/scenario-catalog.ts (modified, +4/-0)
  • extensions/qa-lab/src/scenario-flow-runner.ts (modified, +1/-1)
  • extensions/qa-lab/src/scenario-runtime-api.test.ts (modified, +1/-0)
  • extensions/qa-lab/src/scenario-runtime-api.ts (modified, +3/-0)
  • extensions/qa-lab/src/suite-runtime-flow.ts (modified, +13/-1)
  • extensions/qa-lab/src/suite-summary.ts (modified, +4/-1)
  • extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
  • extensions/qa-lab/src/suite.test.ts (modified, +100/-0)
  • extensions/qa-lab/src/suite.ts (modified, +449/-2)
  • extensions/qa-lab/src/token-efficiency-report.test.ts (added, +218/-0)
  • extensions/qa-lab/src/token-efficiency-report.ts (added, +379/-0)
  • extensions/qa-lab/src/tool-coverage-report.test.ts (added, +288/-0)
  • extensions/qa-lab/src/tool-coverage-report.ts (added, +285/-0)
  • extensions/qa-lab/transport-parity-gate.md (added, +66/-0)
  • extensions/qqbot/src/bridge/tools/remind.test.ts (modified, +1/-1)
  • extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts (modified, +1/-1)
  • extensions/slack/src/monitor/media.test.ts (modified, +3/-3)
  • extensions/tavily/src/tavily-tools.test.ts (modified, +3/-1)
  • qa/scenarios/agents/instruction-followthrough-repo-contract.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-fanout-synthesis.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-handoff.md (modified, +1/-0)
  • qa/scenarios/agents/subagent-stale-child-links.md (modified, +1/-0)
  • qa/scenarios/channels/channel-chat-baseline.md (modified, +1/-0)
  • qa/scenarios/config/config-restart-capability-flip.md (modified, +1/-0)
  • qa/scenarios/jsonl-replay/plan-mode-boundaries.jsonl (added, +8/-0)
  • qa/scenarios/jsonl-replay/recovery-partial-session.jsonl (added, +4/-0)
  • qa/scenarios/jsonl-replay/repo-triage-tool-loop.jsonl (added, +7/-0)
  • qa/scenarios/memory/memory-recall.md (modified, +1/-0)
  • qa/scenarios/memory/thread-memory-isolation.md (modified, +1/-0)
  • qa/scenarios/models/model-switch-tool-continuity.md (modified, +1/-0)
  • qa/scenarios/runtime/approval-turn-tool-followthrough.md (modified, +1/-0)
  • qa/scenarios/runtime/auth-profile-codex-mixed-profiles.md (added, +39/-0)
  • qa/scenarios/runtime/auth-profile-doctor-migration-safety.md (added, +44/-0)
  • qa/scenarios/runtime/codex-plugin-cold-install.md (added, +42/-0)
  • qa/scenarios/runtime/codex-plugin-install-race.md (added, +38/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-new.md (added, +39/-0)
  • qa/scenarios/runtime/codex-plugin-pinned-old.md (added, +39/-0)
  • qa/scenarios/runtime/compaction-retry-mutating-tool.md (modified, +1/-0)
  • qa/scenarios/runtime/first-hour-20-turn.md (added, +68/-0)
  • qa/scenarios/runtime/soak-100-turn.md (added, +68/-0)
  • qa/scenarios/runtime/tools/apply-patch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/bash.md (added, +55/-0)
  • qa/scenarios/runtime/tools/edit.md (added, +54/-0)
  • qa/scenarios/runtime/tools/exec.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-list.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-read.md (added, +54/-0)
  • qa/scenarios/runtime/tools/fs-write.md (added, +54/-0)
  • qa/scenarios/runtime/tools/grep.md (added, +54/-0)
  • qa/scenarios/runtime/tools/image-generate.md (added, +55/-0)
  • qa/scenarios/runtime/tools/memory-add.md (added, +54/-0)
  • qa/scenarios/runtime/tools/memory-recall.md (added, +54/-0)
  • qa/scenarios/runtime/tools/message-tool.md (added, +52/-0)
  • qa/scenarios/runtime/tools/session-status.md (added, +54/-0)
  • qa/scenarios/runtime/tools/sessions-spawn.md (added, +54/-0)
  • qa/scenarios/runtime/tools/skill-invocation.md (added, +54/-0)
  • qa/scenarios/runtime/tools/tavily-extract.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tavily-search.md (added, +53/-0)
  • qa/scenarios/runtime/tools/tts.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-fetch.md (added, +54/-0)
  • qa/scenarios/runtime/tools/web-search.md (added, +54/-0)
  • qa/scenarios/workspace/source-docs-discovery-report.md (modified, +1/-0)
  • scripts/deadcode-unused-files.allowlist.mjs (modified, +2/-0)
  • src/agents/model-runtime-policy.test.ts (added, +91/-0)
  • src/agents/model-runtime-policy.ts (modified, +16/-0)

Code Example

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-tools-phase2-full \
  --allow-failures \
  --concurrency 2 \
  --scenario runtime-tool-apply-patch \
  --scenario runtime-tool-bash \
  --scenario runtime-tool-edit \
  --scenario runtime-tool-exec \
  --scenario runtime-tool-fs-list \
  --scenario runtime-tool-fs-read \
  --scenario runtime-tool-fs-write \
  --scenario runtime-tool-grep \
  --scenario runtime-tool-image-generate \
  --scenario runtime-tool-memory-add \
  --scenario runtime-tool-memory-recall \
  --scenario runtime-tool-message-tool \
  --scenario runtime-tool-session-status \
  --scenario runtime-tool-sessions-spawn \
  --scenario runtime-tool-skill-invocation \
  --scenario runtime-tool-tavily-extract \
  --scenario runtime-tool-tavily-search \
  --scenario runtime-tool-tts \
  --scenario runtime-tool-web-fetch \
  --scenario runtime-tool-web-search

---

Protocol note: acknowledged. Continue with the QA scenario plan and report worked, failed, and blocked items.
RAW_BUFFERClick to expand / collapse

Correction TLDR

Status: harness/mock-provider incompatibility, not a proven broad Codex runtime tool dropout.

The original issue overclaimed that Codex drops planned tool calls for most fixtures. The stronger audit shows the mock provider only plans direct tool calls when the target tool appears in body.tools. Codex app-server intentionally excludes native workspace tools from OpenClaw dynamic tools and defaults remaining dynamic tools to searchable/deferred loading, which this mock planner does not fully model.

What actually breaks: the QA tool-defaults mock lane cannot yet faithfully drive Codex native/searchable tool surfaces. The report was treating mock-planner absence as runtime tool loss.

Impact if OpenClaw moved fully to Codex today: P4 as filed, P2 for harness readiness. The product risk may still exist for individual tools, but this issue does not prove it. Each tool needs live/native proof or a Codex-aware mock planner before it can be filed as a product bug.

Correct Fix

  • Split Codex-native workspace tools from OpenClaw dynamic tools in the fixture suite.
  • Either force direct dynamic tool loading in mock fixture mode or teach the mock provider to emulate Codex searchable/deferred tools.
  • Keep provider-plan diagnostics separate from transcript/app-server execution evidence.
  • Move remaining rows to report-only/harness-gap until the mock can drive them honestly.

Superseded Original Report

Tracking parent: #80171 Phase: #80173 Found by: Phase 2 full runtime tool fixture run

Why this matters

The runtime-parity harness is meant to catch exactly this class of default-runtime regression before Codex becomes the default OpenAI agent runtime. In the Phase 2 per-tool fixture suite, Pi sends the expected mock provider tool calls for many tool families, while Codex returns an acknowledgement without sending the planned tool call, causing the fixture to fail before tool-result comparison.

Evidence

Command:

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --runtime-pair pi,codex \
  --output-dir .artifacts/qa-e2e/runtime-tools-phase2-full \
  --allow-failures \
  --concurrency 2 \
  --scenario runtime-tool-apply-patch \
  --scenario runtime-tool-bash \
  --scenario runtime-tool-edit \
  --scenario runtime-tool-exec \
  --scenario runtime-tool-fs-list \
  --scenario runtime-tool-fs-read \
  --scenario runtime-tool-fs-write \
  --scenario runtime-tool-grep \
  --scenario runtime-tool-image-generate \
  --scenario runtime-tool-memory-add \
  --scenario runtime-tool-memory-recall \
  --scenario runtime-tool-message-tool \
  --scenario runtime-tool-session-status \
  --scenario runtime-tool-sessions-spawn \
  --scenario runtime-tool-skill-invocation \
  --scenario runtime-tool-tavily-extract \
  --scenario runtime-tool-tavily-search \
  --scenario runtime-tool-tts \
  --scenario runtime-tool-web-fetch \
  --scenario runtime-tool-web-search

Artifact paths from that run:

  • .artifacts/qa-e2e/runtime-tools-phase2-full/qa-suite-summary.json
  • .artifacts/qa-e2e/runtime-tools-phase2-full/qa-suite-report.md
  • .artifacts/qa-e2e/tool-coverage-phase2-full-runtime.md

Affected rows from the coverage report:

  • bash: Pi planned 2 exec calls; Codex planned 0 and failed with expected mock happy-path request for exec.
  • exec: Pi planned 2 exec calls; Codex planned 0 and failed with expected mock happy-path request for exec.
  • edit: Pi planned 2 edit calls; Codex planned 0 and failed with expected mock happy-path request for edit.
  • fs.write: Pi planned 2 write calls; Codex planned 0 and failed with expected mock happy-path request for write.
  • grep: Pi planned 2 exec calls; Codex planned 0 and failed with expected mock happy-path request for exec.
  • image_generate: Pi planned 2 image_generate calls; Codex planned 0 and failed with expected mock happy-path request for image_generate.
  • session_status: Pi planned 2 session_status calls; Codex planned 0 and failed with expected mock happy-path request for session_status.
  • sessions_spawn: Pi planned 2 sessions_spawn calls; Codex planned 0 and failed with expected mock happy-path request for sessions_spawn.
  • web_fetch: Pi planned 2 web_fetch calls; Codex planned 0 and failed with expected mock happy-path request for web_fetch.
  • web_search: Pi planned 2 web_search calls; Codex planned 0 and failed with expected mock happy-path request for web_search.

Representative Codex final text in these cells was the generic acknowledgement:

Protocol note: acknowledged. Continue with the QA scenario plan and report worked, failed, and blocked items.

Expected

For each fixture, Codex should route the same prompt/model through the same mock provider plan and emit the expected happy-path and denied-input failure-path tool calls, or the harness should classify a deliberate runtime-specific unsupported-tool case with an explicit known-broken marker.

Actual

Codex does not emit the planned tool request for these fixtures, so Phase 2 currently records tool-call-shape drift against the runtime axis.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix QA tool-defaults suite conflates Codex-native tools with OpenClaw dynamic tool parity [1 pull requests, 6 comments, 2 participants]