openclaw - 💡(How to fix) Fix Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74259Fetched 2026-04-30 06:26:34
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
2
Timeline (top)
commented ×2cross-referenced ×2closed ×1

Root Cause

These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.

Code Example

openai/gpt-5.5

---

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts

---

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts

---

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build

---

gateway timeout after 25000ms

---

env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight

---

openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline

---

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply

---

pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight
RAW_BUFFERClick to expand / collapse

Problem

The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to openai/gpt-5.5, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.

That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like gpt54 / opus46 make it unclear what actually ran.

Current upstream evidence

Checked against current origin/main on 2026-04-29 at fa8a7d70ee (docs: fix clawsweeper skill metadata).

Parity workflow is mixed current/stale

.github/workflows/parity-gate.yml currently has:

  • job/workflow text still referring to OpenAI / Opus 4.6
  • OPENCLAW_CI_OPENAI_MODEL defaulting to openai/gpt-5.5
  • candidate lane still using --alt-model openai/gpt-5.4-alt
  • candidate output dir still .artifacts/qa-e2e/gpt54
  • baseline lane still using --model anthropic/claude-opus-4-6
  • baseline lane still using --alt-model anthropic/claude-sonnet-4-6
  • baseline output dir still .artifacts/qa-e2e/opus46
  • parity report baseline label still anthropic/claude-opus-4-6

QA lab defaults are partially updated

extensions/qa-lab/src/providers/live-frontier/catalog.ts now defaults the primary live frontier model to:

openai/gpt-5.5

extensions/qa-lab/src/providers/live-frontier/index.ts and model-selection.runtime.ts also know about openai/gpt-5.5.

However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:

  • extensions/qa-lab/src/providers/live-frontier/parity.ts
  • extensions/qa-lab/src/providers/live-frontier/character-eval.ts
  • extensions/qa-lab/src/agentic-parity-report.test.ts
  • extensions/qa-lab/src/providers/mock-openai/server.ts

Provider support for Opus 4.7 already exists elsewhere

The Anthropic provider layer already includes anthropic/claude-opus-4-7 / claude-opus-4.7 mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.

Scenario metadata still targets Opus 4.6

The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:

  • qa/scenarios/models/anthropic-opus-api-key-smoke.md
  • qa/scenarios/models/anthropic-opus-setup-token-smoke.md

These should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.

What still works

The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:

Local focused validation also passed against current main:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts

Result: 5 files, 132 tests passed.

Full extension QA unit lane also passed:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts

Result: 63 files, 524 tests passed.

Private QA runtime build also passed:

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build

What needs hardening

The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on approval-turn-tool-followthrough with:

gateway timeout after 25000ms

Representative command shape:

env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight

This does not look like an unknown-model failure. The full CI parity suite runs approval-turn-tool-followthrough later in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.

Desired target state

The QA lab should clearly and truthfully validate the current target comparison:

openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline

All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like openai-candidate / anthropic-baseline to avoid repeated drift.

Implementation checklist

  • Update .github/workflows/parity-gate.yml:

    • rename job/workflow text from Opus 4.6 to Opus 4.7
    • use anthropic/claude-opus-4-7 for the baseline lane
    • decide whether anthropic/claude-sonnet-4-6 remains the correct alternate model or whether the alternate should also move
    • replace openai/gpt-5.4-alt with openai/gpt-5.5-alt, or derive the mock alt model from the primary model
    • rename .artifacts/qa-e2e/gpt54 and .artifacts/qa-e2e/opus46 to current or generic names
  • Update QA lab parity/reporting code:

    • extensions/qa-lab/src/providers/live-frontier/parity.ts
    • extensions/qa-lab/src/providers/live-frontier/character-eval.ts
    • extensions/qa-lab/src/agentic-parity-report.test.ts
    • any report title, baseline label, summary, fixture, and expected snapshot text that still says Opus 4.6 or GPT-5.4
  • Update mock provider fixtures/tests:

    • extensions/qa-lab/src/providers/mock-openai/server.ts
    • advertise claude-opus-4-7 in mock model lists where appropriate
    • keep compatibility aliases only where intentionally needed
    • ensure the mock provider variant resolver still maps openai/* and anthropic/* by provider family rather than brittle exact model strings
  • Update model scenario metadata/docs:

    • qa/scenarios/models/anthropic-opus-api-key-smoke.md
    • qa/scenarios/models/anthropic-opus-setup-token-smoke.md
    • move requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7, or introduce a parameterized/family-level requirement if that is the preferred QA contract
  • Sweep for stale strings before opening the PR:

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply
  • Harden --preflight:
    • increase the first cold agent-run timeout for preflight, or
    • add a lightweight warmup call before approval-turn-tool-followthrough, or
    • make the gateway child-call timeout retryable for QA preflight when the gateway is healthy but the first agent RPC times out
    • keep the preflight cheap; the point is to avoid paying for the full long parity gate just to discover obvious breakage

Acceptance criteria

  • rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" only returns intentional compatibility aliases or historical comments with explicit justification.
  • pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts passes.
  • OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build passes.
  • Mock preflight passes with GPT-5.5 candidate naming:
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight
  • The full parity gate runs candidate openai/gpt-5.5 against baseline anthropic/claude-opus-4-7 and produces a report that truthfully names those models.
  • The expensive 12-scenario parity suite remains mock-provider compatible and does not require live API keys in mock mode.

Why this matters

These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.

extent analysis

TL;DR

Update the QA lab parity gate and related tests to reflect the current target comparison: openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline.

Guidance

  • Update .github/workflows/parity-gate.yml to use anthropic/claude-opus-4-7 for the baseline lane and rename job/workflow text from Opus 4.6 to Opus 4.7.
  • Update QA lab parity/reporting code to reflect the current models and remove stale strings.
  • Update mock provider fixtures/tests to advertise claude-opus-4-7 and ensure compatibility aliases are only kept where intentionally needed.
  • Update model scenario metadata/docs to move requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7.

Example

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply

This command can be used to sweep for stale strings before opening the PR.

Notes

The implementation checklist provided in the issue body is a comprehensive guide to the necessary updates. It is essential to ensure that all user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names are updated to reflect the current models or renamed to generic stable names.

Recommendation

Apply the workaround by updating the QA lab parity gate and related tests to reflect the current target comparison. This will ensure that the QA gates are unambiguous and provide a clear indication of the current OpenAI candidate vs current Anthropic Opus baseline.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight [2 comments, 2 participants]