openclaw - ✅(Solved) Fix [GPT 5.4 v3 Phase 3.D] qa-lab parity scenarios + verification harness updates [1 pull requests, 1 participants]

openclaw2026-04-16 05:12:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#67521•Fetched 2026-04-17 08:30:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

100yenadmin

Participants

100yenadmin

Timeline (top)

cross-referenced ×2referenced ×1

Fix Action

Fixed

Fixed by PR: feat(qa-lab): GPT-5.4 parity scenarios — tool use, act-don't-ask, injection, plan [Phase 3.D] (https://github.com/openclaw/openclaw/pull/67540)

PR fix notes

PR #67540: feat(qa-lab): GPT-5.4 parity scenarios — tool use, act-don't-ask, injection, plan [Phase 3.D]

Repository: openclaw/openclaw
Author: 100yenadmin
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/67540

Description (problem / solution / changelog)

Phase 3.D — qa-lab parity scenarios

Tracking: #66345 | Issue: #67521

5 scenario files covering the core GPT-5.4 parity behaviors. Each uses the established qa/scenarios/*.md format.

Scenario	Tests	Covers
`gpt54-mandatory-tool-use`	Time/arithmetic/disk queries use exec, not memory	#67512 tool enforcement
`gpt54-act-dont-ask`	Port/OS check acts immediately, no clarification	#67512 act-don't-ask
`gpt54-injection-scan`	SOUL.md with injection → BLOCKED	#67512 injection scan
`gpt54-cancelled-status`	Failed step marked cancelled, revised step added	#67514 cancelled status
`gpt54-plan-mode-default-off`	Default GPT-5 does NOT enter plan mode	Hermes parity preserved

Depends on

#67512 (prompt stack) for tool enforcement + injection scan scenarios
#67514 (task-system parity) for cancelled status scenario
#67538 (plan mode) for plan-mode-default-off scenario

Changed files

qa/scenarios/gpt54-act-dont-ask.md (added, +53/-0)
qa/scenarios/gpt54-cancelled-status.md (added, +51/-0)
qa/scenarios/gpt54-injection-scan.md (added, +55/-0)
qa/scenarios/gpt54-mandatory-tool-use.md (added, +54/-0)
qa/scenarios/gpt54-plan-mode-default-off.md (added, +55/-0)

RAW_BUFFERClick to expand / collapse

GPT 5.4 Enhancement v3 — Phase 3.D

Tracking: #66345 Priority: P1 — Closes the verification loop on the entire parity sprint Depends on: #67512, #67518, #67514 (final sprint PRs) + #67519, #67520 (Phase 3 PRs) Builds on: #64441 (first-wave parity harness), #65664 (parity proof slice)

Problem

The parity benchmark harness (#64441) exists but does not yet have scenarios covering the Phase 3 features. Without these, we cannot measure whether GPT-5.4 on OpenClaw has reached Opus 4.6-quality.

Scope

New scenario files (under `qa-lab/agentic/scenarios/`)

Scenario	Tests	Covers
`gpt54-mandatory-tool-use.scenario.yaml`	"What time is it?" uses exec, not memory	#67512 tool enforcement
`gpt54-act-dont-ask.scenario.yaml`	"Is port 8080 open?" checks local, doesn't ask	#67512 act-don't-ask
`gpt54-tool-retry.scenario.yaml`	First grep misses, retries with broader query	#67512 tool persistence
`gpt54-injection-scan.scenario.yaml`	SOUL.md with injection → BLOCKED	#67512 injection scan
`gpt54-cancelled-status.scenario.yaml`	Failed step marked cancelled, revised step added	#67514 cancelled
`gpt54-compaction-hydration.scenario.yaml`	Long task survives compaction	#67514 hydration
`gpt54-merge-mode.scenario.yaml`	Partial update preserves other steps	#67514 merge
`gpt54-plan-rendering.scenario.yaml`	Checklist visible in channel transcript	#67519 rendering
`gpt54-plan-mode-approve.scenario.yaml`	Plan → approve → execute cycle	#67520 plan mode
`gpt54-plan-mode-reject.scenario.yaml`	Plan → reject → abort	#67520 plan mode
`gpt54-default-no-plan-mode.scenario.yaml`	Default run does NOT enter plan mode	Hermes parity preserved
`gpt54-gemini-directives.scenario.yaml`	Gemini uses verify-first, non-interactive flags	#67518 Gemini

Gate verdict updates

Update parity gate thresholds in the report layer
Add "user-perceived visibility" metric: scrape transcript, assert checklist rendered
Add cross-comparison: GPT-5.4 on OpenClaw vs Hermes on same prompts

Verification

The meta-verification: after this PR, the qa-lab gate verdict should move from "GPT-5.4 lags Opus 4.6 on visibility/recovery" to "GPT-5.4 matches Opus 4.6 on visibility/recovery".

Estimated size: ~300 LoC (scenario YAML + fixtures + gate updates)

extent analysis

TL;DR

Create new scenario files under qa-lab/agentic/scenarios/ to cover Phase 3 features and update gate verdict thresholds to measure GPT-5.4's parity with Opus 4.6.

Guidance

Create the 11 new scenario files listed in the scope section, each testing a specific feature or behavior of GPT-5.4 on OpenClaw.
Update the parity gate thresholds in the report layer to reflect the new scenarios and metrics.
Add a "user-perceived visibility" metric to the gate verdict updates, which scrapes the transcript and asserts that the checklist is rendered.
Perform a cross-comparison between GPT-5.4 on OpenClaw and Hermes on the same prompts to ensure parity.

Example

No code snippet is provided as the issue does not contain specific code details.

Notes

The estimated size of the changes is approximately 300 lines of code, including scenario YAML files, fixtures, and gate updates.

Recommendation

Apply the workaround by creating the new scenario files and updating the gate verdict thresholds to measure GPT-5.4's parity with Opus 4.6, as this will allow for the verification of the entire parity sprint.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - ✅(Solved) Fix [GPT 5.4 v3 Phase 3.D] qa-lab parity scenarios + verification harness updates [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #67540: feat(qa-lab): GPT-5.4 parity scenarios — tool use, act-don't-ask, injection, plan [Phase 3.D]

Description (problem / solution / changelog)

Phase 3.D — qa-lab parity scenarios

Depends on

Changed files

GPT 5.4 Enhancement v3 — Phase 3.D

Problem

Scope

New scenario files (under `qa-lab/agentic/scenarios/`)

Gate verdict updates

Verification

Estimated size: ~300 LoC (scenario YAML + fixtures + gate updates)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - ✅(Solved) Fix [GPT 5.4 v3 Phase 3.D] qa-lab parity scenarios + verification harness updates [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #67540: feat(qa-lab): GPT-5.4 parity scenarios — tool use, act-don't-ask, injection, plan [Phase 3.D]

Description (problem / solution / changelog)

Phase 3.D — qa-lab parity scenarios

Depends on

Changed files

GPT 5.4 Enhancement v3 — Phase 3.D

Problem

Scope

New scenario files (under qa-lab/agentic/scenarios/)

Gate verdict updates

Verification

Estimated size: ~300 LoC (scenario YAML + fixtures + gate updates)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

New scenario files (under `qa-lab/agentic/scenarios/`)