openclaw - ✅(Solved) Fix [GPT 5.4 v3 Phase 3.D] qa-lab parity scenarios + verification harness updates [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#67521Fetched 2026-04-17 08:30:20
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
cross-referenced ×2referenced ×1

Fix Action

Fixed

PR fix notes

PR #67540: feat(qa-lab): GPT-5.4 parity scenarios — tool use, act-don't-ask, injection, plan [Phase 3.D]

Description (problem / solution / changelog)

Phase 3.D — qa-lab parity scenarios

Tracking: #66345 | Issue: #67521

5 scenario files covering the core GPT-5.4 parity behaviors. Each uses the established qa/scenarios/*.md format.

ScenarioTestsCovers
gpt54-mandatory-tool-useTime/arithmetic/disk queries use exec, not memory#67512 tool enforcement
gpt54-act-dont-askPort/OS check acts immediately, no clarification#67512 act-don't-ask
gpt54-injection-scanSOUL.md with injection → BLOCKED#67512 injection scan
gpt54-cancelled-statusFailed step marked cancelled, revised step added#67514 cancelled status
gpt54-plan-mode-default-offDefault GPT-5 does NOT enter plan modeHermes parity preserved

Depends on

  • #67512 (prompt stack) for tool enforcement + injection scan scenarios
  • #67514 (task-system parity) for cancelled status scenario
  • #67538 (plan mode) for plan-mode-default-off scenario

Changed files

  • qa/scenarios/gpt54-act-dont-ask.md (added, +53/-0)
  • qa/scenarios/gpt54-cancelled-status.md (added, +51/-0)
  • qa/scenarios/gpt54-injection-scan.md (added, +55/-0)
  • qa/scenarios/gpt54-mandatory-tool-use.md (added, +54/-0)
  • qa/scenarios/gpt54-plan-mode-default-off.md (added, +55/-0)
RAW_BUFFERClick to expand / collapse

GPT 5.4 Enhancement v3 — Phase 3.D

Tracking: #66345 Priority: P1 — Closes the verification loop on the entire parity sprint Depends on: #67512, #67518, #67514 (final sprint PRs) + #67519, #67520 (Phase 3 PRs) Builds on: #64441 (first-wave parity harness), #65664 (parity proof slice)

Problem

The parity benchmark harness (#64441) exists but does not yet have scenarios covering the Phase 3 features. Without these, we cannot measure whether GPT-5.4 on OpenClaw has reached Opus 4.6-quality.

Scope

New scenario files (under qa-lab/agentic/scenarios/)

ScenarioTestsCovers
gpt54-mandatory-tool-use.scenario.yaml"What time is it?" uses exec, not memory#67512 tool enforcement
gpt54-act-dont-ask.scenario.yaml"Is port 8080 open?" checks local, doesn't ask#67512 act-don't-ask
gpt54-tool-retry.scenario.yamlFirst grep misses, retries with broader query#67512 tool persistence
gpt54-injection-scan.scenario.yamlSOUL.md with injection → BLOCKED#67512 injection scan
gpt54-cancelled-status.scenario.yamlFailed step marked cancelled, revised step added#67514 cancelled
gpt54-compaction-hydration.scenario.yamlLong task survives compaction#67514 hydration
gpt54-merge-mode.scenario.yamlPartial update preserves other steps#67514 merge
gpt54-plan-rendering.scenario.yamlChecklist visible in channel transcript#67519 rendering
gpt54-plan-mode-approve.scenario.yamlPlan → approve → execute cycle#67520 plan mode
gpt54-plan-mode-reject.scenario.yamlPlan → reject → abort#67520 plan mode
gpt54-default-no-plan-mode.scenario.yamlDefault run does NOT enter plan modeHermes parity preserved
gpt54-gemini-directives.scenario.yamlGemini uses verify-first, non-interactive flags#67518 Gemini

Gate verdict updates

  • Update parity gate thresholds in the report layer
  • Add "user-perceived visibility" metric: scrape transcript, assert checklist rendered
  • Add cross-comparison: GPT-5.4 on OpenClaw vs Hermes on same prompts

Verification

The meta-verification: after this PR, the qa-lab gate verdict should move from "GPT-5.4 lags Opus 4.6 on visibility/recovery" to "GPT-5.4 matches Opus 4.6 on visibility/recovery".

Estimated size: ~300 LoC (scenario YAML + fixtures + gate updates)

extent analysis

TL;DR

Create new scenario files under qa-lab/agentic/scenarios/ to cover Phase 3 features and update gate verdict thresholds to measure GPT-5.4's parity with Opus 4.6.

Guidance

  • Create the 11 new scenario files listed in the scope section, each testing a specific feature or behavior of GPT-5.4 on OpenClaw.
  • Update the parity gate thresholds in the report layer to reflect the new scenarios and metrics.
  • Add a "user-perceived visibility" metric to the gate verdict updates, which scrapes the transcript and asserts that the checklist is rendered.
  • Perform a cross-comparison between GPT-5.4 on OpenClaw and Hermes on the same prompts to ensure parity.

Example

No code snippet is provided as the issue does not contain specific code details.

Notes

The estimated size of the changes is approximately 300 lines of code, including scenario YAML files, fixtures, and gate updates.

Recommendation

Apply the workaround by creating the new scenario files and updating the gate verdict thresholds to measure GPT-5.4's parity with Opus 4.6, as this will allow for the verification of the entire parity sprint.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING