openclaw - 💡(How to fix) Fix Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight [2 comments, 2 participants]

openclaw2026-04-29 09:41:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74259•Fetched 2026-04-30 06:26:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

electricsheepops

Participants

100yenadmin

electricsheepops

Timeline (top)

commented ×2cross-referenced ×2closed ×1

Root Cause

These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.

Code Example

openai/gpt-5.5

---

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts

---

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts

---

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build

---

gateway timeout after 25000ms

---

env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight

---

openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline

---

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply

---

pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight

RAW_BUFFERClick to expand / collapse

Problem

The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to openai/gpt-5.5, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.

That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like gpt54 / opus46 make it unclear what actually ran.

Current upstream evidence

Checked against current origin/main on 2026-04-29 at fa8a7d70ee (docs: fix clawsweeper skill metadata).

Parity workflow is mixed current/stale

.github/workflows/parity-gate.yml currently has:

job/workflow text still referring to OpenAI / Opus 4.6
OPENCLAW_CI_OPENAI_MODEL defaulting to openai/gpt-5.5
candidate lane still using --alt-model openai/gpt-5.4-alt
candidate output dir still .artifacts/qa-e2e/gpt54
baseline lane still using --model anthropic/claude-opus-4-6
baseline lane still using --alt-model anthropic/claude-sonnet-4-6
baseline output dir still .artifacts/qa-e2e/opus46
parity report baseline label still anthropic/claude-opus-4-6

QA lab defaults are partially updated

extensions/qa-lab/src/providers/live-frontier/catalog.ts now defaults the primary live frontier model to:

openai/gpt-5.5

extensions/qa-lab/src/providers/live-frontier/index.ts and model-selection.runtime.ts also know about openai/gpt-5.5.

However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:

extensions/qa-lab/src/providers/live-frontier/parity.ts
extensions/qa-lab/src/providers/live-frontier/character-eval.ts
extensions/qa-lab/src/agentic-parity-report.test.ts
extensions/qa-lab/src/providers/mock-openai/server.ts

Provider support for Opus 4.7 already exists elsewhere

The Anthropic provider layer already includes anthropic/claude-opus-4-7 / claude-opus-4.7 mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.

Scenario metadata still targets Opus 4.6

The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:

qa/scenarios/models/anthropic-opus-api-key-smoke.md
qa/scenarios/models/anthropic-opus-setup-token-smoke.md

These should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.

What still works

The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:

Run: https://github.com/openclaw/openclaw/actions/runs/25100029375
Candidate: openai/gpt-5.5
Baseline: anthropic/claude-opus-4-6
Result: candidate passed 12/12, baseline passed 12/12, parity verdict passed

Local focused validation also passed against current main:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts

Result: 5 files, 132 tests passed.

Full extension QA unit lane also passed:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts

Result: 63 files, 524 tests passed.

Private QA runtime build also passed:

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build

What needs hardening

The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on approval-turn-tool-followthrough with:

gateway timeout after 25000ms

Representative command shape:

env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight

This does not look like an unknown-model failure. The full CI parity suite runs approval-turn-tool-followthrough later in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.

Desired target state

The QA lab should clearly and truthfully validate the current target comparison:

openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline

All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like openai-candidate / anthropic-baseline to avoid repeated drift.

Implementation checklist

Update .github/workflows/parity-gate.yml:
- rename job/workflow text from Opus 4.6 to Opus 4.7
- use anthropic/claude-opus-4-7 for the baseline lane
- decide whether anthropic/claude-sonnet-4-6 remains the correct alternate model or whether the alternate should also move
- replace openai/gpt-5.4-alt with openai/gpt-5.5-alt, or derive the mock alt model from the primary model
- rename .artifacts/qa-e2e/gpt54 and .artifacts/qa-e2e/opus46 to current or generic names
Update QA lab parity/reporting code:
- extensions/qa-lab/src/providers/live-frontier/parity.ts
- extensions/qa-lab/src/providers/live-frontier/character-eval.ts
- extensions/qa-lab/src/agentic-parity-report.test.ts
- any report title, baseline label, summary, fixture, and expected snapshot text that still says Opus 4.6 or GPT-5.4
Update mock provider fixtures/tests:
- extensions/qa-lab/src/providers/mock-openai/server.ts
- advertise claude-opus-4-7 in mock model lists where appropriate
- keep compatibility aliases only where intentionally needed
- ensure the mock provider variant resolver still maps openai/* and anthropic/* by provider family rather than brittle exact model strings
Update model scenario metadata/docs:
- qa/scenarios/models/anthropic-opus-api-key-smoke.md
- qa/scenarios/models/anthropic-opus-setup-token-smoke.md
- move requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7, or introduce a parameterized/family-level requirement if that is the preferred QA contract
Sweep for stale strings before opening the PR:

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply

Harden --preflight:
- increase the first cold agent-run timeout for preflight, or
- add a lightweight warmup call before approval-turn-tool-followthrough, or
- make the gateway child-call timeout retryable for QA preflight when the gateway is healthy but the first agent RPC times out
- keep the preflight cheap; the point is to avoid paying for the full long parity gate just to discover obvious breakage

Acceptance criteria

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" only returns intentional compatibility aliases or historical comments with explicit justification.
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts passes.
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build passes.
Mock preflight passes with GPT-5.5 candidate naming:

pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight

The full parity gate runs candidate openai/gpt-5.5 against baseline anthropic/claude-opus-4-7 and produces a report that truthfully names those models.
The expensive 12-scenario parity suite remains mock-provider compatible and does not require live API keys in mock mode.

Why this matters

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.

extent analysis

TL;DR

Update the QA lab parity gate and related tests to reflect the current target comparison: openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline.

Guidance

Update .github/workflows/parity-gate.yml to use anthropic/claude-opus-4-7 for the baseline lane and rename job/workflow text from Opus 4.6 to Opus 4.7.
Update QA lab parity/reporting code to reflect the current models and remove stale strings.
Update mock provider fixtures/tests to advertise claude-opus-4-7 and ensure compatibility aliases are only kept where intentionally needed.
Update model scenario metadata/docs to move requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7.

Example

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply

This command can be used to sweep for stale strings before opening the PR.

Notes

The implementation checklist provided in the issue body is a comprehensive guide to the necessary updates. It is essential to ensure that all user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names are updated to reflect the current models or renamed to generic stable names.

Recommendation

Apply the workaround by updating the QA lab parity gate and related tests to reflect the current target comparison. This will ensure that the QA gates are unambiguous and provide a clear indication of the current OpenAI candidate vs current Anthropic Opus baseline.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Current upstream evidence

Parity workflow is mixed current/stale

QA lab defaults are partially updated

Provider support for Opus 4.7 already exists elsewhere

Scenario metadata still targets Opus 4.6

What still works

What needs hardening

Desired target state

Implementation checklist

Acceptance criteria

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Current upstream evidence

Parity workflow is mixed current/stale

QA lab defaults are partially updated

Provider support for Opus 4.7 already exists elsewhere

Scenario metadata still targets Opus 4.6

What still works

What needs hardening

Desired target state

Implementation checklist

Acceptance criteria

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING