openclaw - ✅(Solved) Fix benchmarks: add GPT-5.4 vs Opus 4.6 agentic parity harness and release gate [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64233Fetched 2026-04-11 06:15:44
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×4

Add the shared benchmark harness and release gate for GPT-5.4 vs Opus 4.6 so parity claims are evidence-backed.

Root Cause

Add the shared benchmark harness and release gate for GPT-5.4 vs Opus 4.6 so parity claims are evidence-backed.

Fix Action

Fixed

PR fix notes

PR #64286: openai-codex: fix auth scope handling and classify provider/runtime failures

Description (problem / solution / changelog)

Summary

This is PR 2 of the GPT-5.4 / Codex agentic runtime parity program tracked in #64227 and scoped by #64229.

It fixes the maintained-source OpenAI Codex OAuth scope gap in OpenClaw's login wrapper and adds a separate provider/runtime failure taxonomy that makes auth-scope, refresh, HTML 403, proxy, DNS, timeout, schema, sandbox-blocked, and replay-invalid failures observable in logs and easier to explain to users.

What changed

  • normalize OpenAI Codex authorize URLs so the required scopes are always present:
    • openid
    • profile
    • email
    • offline_access
    • model.request
    • api.responses.write
  • add classifyProviderRuntimeFailureKind(...) as a typed provider/runtime failure classifier
  • keep the older failover-reason contract intact instead of widening it in this slice
  • thread providerRuntimeFailureKind through embedded-run observation fields and lifecycle logging
  • surface more truthful user-facing copy for:
    • OAuth refresh failures
    • missing OpenAI Codex scopes
    • HTML 403 auth failures
    • proxy/tunnel misroutes
    • replay-invalid failures
  • add focused regressions for scope failures, refresh failures, HTML 403, proxy, DNS, timeout, schema, sandbox-blocked, and replay-invalid paths

Why

GPT-5.4 / Codex failures in OpenClaw are still too easy to misdiagnose as generic model stops. This slice makes the auth/runtime layer tell the truth before we move on to tool-contract and parity-harness work.

Non-goals

  • does not implement tool compatibility work from #64230
  • does not implement permission truthfulness work from #64231
  • does not implement replay/liveness hardening from #64232
  • does not implement the benchmark harness from #64233
  • does not widen the generic failover-reason enum for every caller in this slice

Builds on prior groundwork

  • #45176
  • #48592
  • #53702
  • #55206
  • #44019

Validation

Focused checks run:

  • CI=1 pnpm exec vitest run src/commands/openai-codex-oauth.test.ts src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts src/agents/failover-error.test.ts src/agents/pi-embedded-error-observation.test.ts src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts
  • repo hook gate during commit:
    • pnpm check:no-conflict-markers
    • pnpm tool-display:check
    • pnpm check:host-env-policy:swift
    • pnpm tsgo
    • node scripts/prepare-extension-package-boundary-artifacts.mjs
    • pnpm lint
    • pnpm lint:webhook:no-low-level-body-read
    • pnpm lint:auth:no-pairing-store-group
    • pnpm lint:auth:pairing-account-scope

Linked issues

  • Closes #64229
  • Refs #64227
  • Refs #64133
  • Refs #64174
  • Refs #64092
  • Refs #57399
  • Refs #62672

Changed files

  • src/agents/failover-error.test.ts (modified, +10/-0)
  • src/agents/pi-embedded-error-observation.test.ts (modified, +14/-0)
  • src/agents/pi-embedded-error-observation.ts (modified, +23/-4)
  • src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts (modified, +67/-0)
  • src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts (modified, +79/-0)
  • src/agents/pi-embedded-helpers.ts (modified, +2/-0)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +219/-4)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts (modified, +22/-0)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts (modified, +16/-3)
  • src/commands/openai-codex-oauth.test.ts (modified, +28/-3)
  • src/plugins/provider-openai-codex-oauth.ts (modified, +40/-1)

PR #64300: agents: add OpenAI/Codex tool compatibility and replay/liveness state

Description (problem / solution / changelog)

Summary

  • keep the provider-owned OpenAI/Codex tool-compat layer via the existing provider hook surface
  • add replay/liveness state surfacing so long-running embedded runs stop disappearing silently
  • compact the original Contracts 2 and 5 into one execution-correctness PR in the GPT-5.4 / Codex parity program tracked by #64227

Scope

  • Refs #64230
  • Refs #64232
  • Refs #64227
  • combines provider-owned tool compatibility with replay/liveness hardening
  • no auth / permission truthfulness changes in this PR
  • no self-elected continuation scope from #38780
  • no benchmark harness work from #64233

What changed

  • add an openai tool-compat family to buildProviderToolCompatFamilyHooks(...)
  • gate the family to native OpenAI/OpenAI Codex response routes only
  • normalize provider-owned parameter-free and missing-object-shape tool schemas for strict OpenAI/Codex routes
  • surface provider-owned diagnostics for remaining strict-schema incompatibilities
  • attach the compat hooks in extensions/openai/index.ts so OpenAI and OpenAI Codex providers both expose them
  • add replay/liveness state to embedded run results and lifecycle surfaces
  • classify replay/liveness outcomes as observable working, paused, blocked, or abandoned states instead of silent disappearance
  • preserve replay-invalid truth across compaction retries after mutating tool side effects
  • add focused regressions for replay/liveness surfacing alongside the existing tool-compat coverage

Validation

  • pnpm build
  • CI=1 pnpm exec vitest run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.handlers.compaction.test.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts src/agents/pi-embedded-runner/run/attempt.spawn-workspace.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts

Non-goals

  • does not supersede #64229 or #64231
  • does not add tool-name or argument aliases
  • does not change generic runner behavior outside provider-owned hooks and replay/liveness surfacing

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/openai/index.test.ts (modified, +78/-0)
  • extensions/openai/index.ts (modified, +3/-0)
  • src/agents/pi-embedded-runner/run.incomplete-turn.test.ts (modified, +43/-0)
  • src/agents/pi-embedded-runner/run.overflow-compaction.test.ts (modified, +23/-0)
  • src/agents/pi-embedded-runner/run.timeout-triggered-compaction.test.ts (modified, +1/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +80/-0)
  • src/agents/pi-embedded-runner/run/attempt.spawn-workspace.test-support.ts (modified, +6/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +18/-5)
  • src/agents/pi-embedded-runner/run/incomplete-turn.ts (modified, +45/-0)
  • src/agents/pi-embedded-runner/run/retry-limit.ts (modified, +5/-0)
  • src/agents/pi-embedded-runner/run/types.ts (modified, +7/-0)
  • src/agents/pi-embedded-runner/types.ts (modified, +4/-0)
  • src/agents/pi-embedded-subscribe.handlers.compaction.ts (modified, +4/-0)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts (modified, +67/-0)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts (modified, +27/-1)
  • src/agents/pi-embedded-subscribe.handlers.tools.test.ts (modified, +92/-0)
  • src/agents/pi-embedded-subscribe.handlers.tools.ts (modified, +5/-0)
  • src/agents/pi-embedded-subscribe.handlers.types.ts (modified, +6/-0)
  • src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts (modified, +38/-0)
  • src/agents/pi-embedded-subscribe.ts (modified, +21/-0)
  • src/agents/pi-embedded-subscribe.types.ts (modified, +2/-0)
  • src/auto-reply/reply/dispatch-from-config.ts (modified, +2/-2)
  • src/plugin-sdk/provider-tools.test.ts (modified, +244/-0)
  • src/plugin-sdk/provider-tools.ts (modified, +286/-1)
  • src/plugins/contracts/provider-family-plugin-tests.test.ts (modified, +1/-0)

PR #64441: benchmarks: add first-wave GPT-5.4 vs Opus 4.6 parity harness

Description (problem / solution / changelog)

Summary

This is the benchmark / release-gate slice of the GPT-5.4 / Codex parity program tracked in #64227 and scoped by #64233.

It adds the first-wave QA-lab parity scenario pack, the parity comparison report layer, and the first machine-readable gate verdict so GPT-5.4 and Opus 4.6 can be compared through shared agentic scenarios instead of anecdotes.

Scope

  • Refs #64233
  • Refs #64227
  • first-wave parity harness plus report / gate output
  • no runtime behavior changes in this PR
  • no auth, permission, tool-compat, or replay/liveness changes in this PR

What changed

  • add agentic-parity.ts in QA-lab as the first-wave parity scenario-pack entrypoint
  • wire the parity mode into cli.runtime.ts and cli.ts
  • cover the first-wave scenario pack in tests and scenario-catalog assertions
  • add agentic-parity-report.ts as the comparison layer for two suite summaries
  • add a QA CLI parity-report flow that writes:
    • qa-agentic-parity-report.md
    • qa-agentic-parity-summary.json
    • an explicit pass / fail gate verdict
  • add plain-English + engineering parity docs and maintainer review notes, including a goal-to-evidence matrix for the completion gate
  • start the first-wave parity pack with these scenarios:
    • approval-turn-tool-followthrough
    • model-switch-tool-continuity
    • source-docs-discovery-report
    • image-understanding-attachment
    • compaction-retry-mutating-tool

Validation

  • pnpm build
  • CI=1 pnpm exec vitest run extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/agentic-parity-report.test.ts extensions/qa-lab/src/scenario-catalog.test.ts

Non-goals

  • does not claim final product parity by itself; it provides the proof harness and gate output
  • does not simulate auth/proxy/DNS failures inside QA-lab
  • does not replace the deterministic runtime-truthfulness suites from the other PRs

Changed files

  • docs/help/gpt54-codex-agentic-parity-maintainers.md (added, +164/-0)
  • docs/help/gpt54-codex-agentic-parity.md (added, +219/-0)
  • extensions/qa-lab/src/agentic-parity-report.test.ts (added, +217/-0)
  • extensions/qa-lab/src/agentic-parity-report.ts (added, +303/-0)
  • extensions/qa-lab/src/agentic-parity.ts (added, +47/-0)
  • extensions/qa-lab/src/cli.runtime.test.ts (modified, +59/-0)
  • extensions/qa-lab/src/cli.runtime.ts (modified, +56/-2)
  • extensions/qa-lab/src/cli.ts (modified, +36/-0)
  • extensions/qa-lab/src/mock-openai-server.test.ts (modified, +71/-0)
  • extensions/qa-lab/src/mock-openai-server.ts (modified, +22/-0)
  • extensions/qa-lab/src/scenario-catalog.test.ts (modified, +6/-0)
  • qa/frontier-harness-plan.md (modified, +4/-1)
  • qa/scenarios/compaction-retry-mutating-tool.md (added, +98/-0)
RAW_BUFFERClick to expand / collapse

Parent: #64227

Summary

Add the shared benchmark harness and release gate for GPT-5.4 vs Opus 4.6 so parity claims are evidence-backed.

Scope

  • shared task suite
  • completion / unintended-stop / valid-tool-call metrics
  • targeted auth/proxy mislabel regressions
  • release gate document and CI entrypoint if feasible

Acceptance

  • same prompts, tool surface, sandbox policy, and continuation policy for both models
  • GPT-5.4 must match or beat Opus 4.6 on completion rate
  • equal or lower unintended-stop rate
  • equal or higher valid-tool-call rate
  • zero fake-success cases
  • zero mislabeled auth/proxy/DNS failures

extent analysis

TL;DR

Implement a shared benchmark harness to compare GPT-5.4 and Opus 4.6 performance on specific metrics, ensuring evidence-backed parity claims.

Guidance

  • Develop a comprehensive task suite that includes completion, unintended-stop, and valid-tool-call metrics to evaluate both models.
  • Ensure the benchmark harness uses the same prompts, tool surface, sandbox policy, and continuation policy for both GPT-5.4 and Opus 4.6 to maintain consistency.
  • Implement a release gate with a CI entrypoint to automate the comparison and validation of results, focusing on targeted auth/proxy mislabel regressions.
  • Verify that GPT-5.4 meets or exceeds Opus 4.6 performance on completion rate, unintended-stop rate, and valid-tool-call rate, with zero fake-success cases and mislabeled auth/proxy/DNS failures.

Notes

The implementation details of the benchmark harness and release gate are not specified, so the exact technical approach may vary depending on the project's existing infrastructure and requirements.

Recommendation

Apply workaround: Implement the shared benchmark harness and release gate as described, to ensure evidence-backed parity claims between GPT-5.4 and Opus 4.6.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING