openclaw - ✅(Solved) Fix [Codex×Pi parity Phase 5] JSONL session-replay harness [2 pull requests, 3 comments, 2 participants]

100yenadmin · 2026-05-10T08:22:07Z

[openclaw] PR 80179: docs qa-lab : runtime-parity gate design Pi vs Codex harness - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged:… # PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness) - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/80179 ## Description (problem / solution / changelog) ## Summary Adds `extensions/qa-lab/transport-parity-gate.md`, a design-only doc covering the Codex-vs-Pi runtime parity QA harness scoped in #80171. The doc lifts forward the transport-parity-gate.md sketch from closed PR #78512 (which was originally tracking #78457) and expands it to include the surfaces the maintainer thread asked for: - Runtime-parity (`pi` vs `codex` for the same model+provider) — the higher-value gate now that Codex is the default for OpenAI turns - Per-tool fixture set so "tool X breaks under codex" surfaces at tool granularity, not session-level - Codex-plugin lifecycle stress (cold install, version pinning, install racing first turn, doctor migration safety) - Auth-shape coverage (oauth-only, apikey-only, mixed-profiles) for the #78499 class - Token-efficiency report — the side-by-side per-runtime cost table pash explicitly asked for - JSONL session-replay harness for Eva's "loop 3 agents on real jsonl" ask The doc is the shared artifact the implementing agent (and `@Eva-⚡🐑` / `@pash` for review) work against; sub-issues #80172, #80173, #80174, #80175, #80176 are the actual implementation work. ## Why this is design-only The original #78512 was closed because its `it.fails` reproduction test no longer encodes the right invariant against post-#79238 main. The design doc itself, however, is still load-bearing — it's the only place the matrix shape, drift classifier, capture format, and CI wiring intent are written down. Splitting it out as a design-only PR avoids re-litigating the closure on every implementation PR and gives reviewers something to react to before code lands. ## Verification - `pnpm exec oxfmt --check --threads=1 extensions/qa-lab/transport-parity-gate.md` — clean - No code, runtime, workflow, or test changes — pure docs - Markdown-only diff; refs all link to issues that exist (#80171–#80176, #74290, #79347, #78457, #78055, #78060, #78407, #78499, #79238, #74622) ## Test plan - [x] Format check passes (oxfmt) - [x] All referenced issues exist - [x] Design intent matches the maintainer thread (pash + Eva + ai-hpc, Yesterday) - [ ] Maintainer review on matrix shape and per-cell capture format before Phase 1 (#80172) starts implementation ## References - RFC + tracking: #80171 - Sub-issues: - Phase 1 (Runtime axis): #80172 - Phase 2 (Per-tool fixtures): #80173 - Phase 3 (Codex-plugin lifecycle): #80174 - Phase 4 (Token-efficiency report): #80175 - Phase 5 (JSONL replay): #80176 - Sibling model-axis parity: #74290 (closed) → #79347 (in flight) - Original transport-parity proposal: #78457 - Closed PR with the original draft of this doc: #78512 ## Changed files - `extensions/qa-lab/transport-parity-gate.md` (added, +148/-0) --- # PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5 - Repository: openclaw/openclaw - Author: 100yenadmin - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/80323 ## Description (problem / solution / changelog) ## Summary Adds the Codex-vs-Pi runtime parity QA harness across `extensions/qa-lab`, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring. This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture: - Codex-native workspace tools (`read`, `write`, `edit`, `apply_patch`, `exec`, `process`, `update_plan`) are no longer expected to appear as duplicate OpenClaw dynamic tools. - OpenClaw integration tools (`image_generate`, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows. - Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled. - Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence. - Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action. ## Why OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports. ## Verification Passing targeted/current-scope checks: - `pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runti

openclaw2026-05-10 08:22:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#80176•Fetched 2026-05-11 03:18:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

100yenadmin

Participants

100yenadmin

clawsweeper[bot]

Timeline (top)

commented ×3cross-referenced ×3

Fix Action

Fixed

Fixed by PR: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness) (https://github.com/openclaw/openclaw/pull/80179)
Fixed by PR: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5 (https://github.com/openclaw/openclaw/pull/80323)

PR fix notes

PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness)

Repository: openclaw/openclaw
Author: 100yenadmin
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/80179

Description (problem / solution / changelog)

Summary

Adds extensions/qa-lab/transport-parity-gate.md, a design-only doc covering the Codex-vs-Pi runtime parity QA harness scoped in #80171.

The doc lifts forward the transport-parity-gate.md sketch from closed PR #78512 (which was originally tracking #78457) and expands it to include the surfaces the maintainer thread asked for:

Runtime-parity (pi vs codex for the same model+provider) — the higher-value gate now that Codex is the default for OpenAI turns
Per-tool fixture set so "tool X breaks under codex" surfaces at tool granularity, not session-level
Codex-plugin lifecycle stress (cold install, version pinning, install racing first turn, doctor migration safety)
Auth-shape coverage (oauth-only, apikey-only, mixed-profiles) for the #78499 class
Token-efficiency report — the side-by-side per-runtime cost table pash explicitly asked for
JSONL session-replay harness for Eva's "loop 3 agents on real jsonl" ask

The doc is the shared artifact the implementing agent (and @Eva-⚡🐑 / @pash for review) work against; sub-issues #80172, #80173, #80174, #80175, #80176 are the actual implementation work.

Why this is design-only

The original #78512 was closed because its it.fails reproduction test no longer encodes the right invariant against post-#79238 main. The design doc itself, however, is still load-bearing — it's the only place the matrix shape, drift classifier, capture format, and CI wiring intent are written down. Splitting it out as a design-only PR avoids re-litigating the closure on every implementation PR and gives reviewers something to react to before code lands.

Verification

pnpm exec oxfmt --check --threads=1 extensions/qa-lab/transport-parity-gate.md — clean
No code, runtime, workflow, or test changes — pure docs
Markdown-only diff; refs all link to issues that exist (#80171–#80176, #74290, #79347, #78457, #78055, #78060, #78407, #78499, #79238, #74622)

Test plan

Format check passes (oxfmt)
All referenced issues exist
Design intent matches the maintainer thread (pash + Eva + ai-hpc, Yesterday)
Maintainer review on matrix shape and per-cell capture format before Phase 1 (#80172) starts implementation

References

RFC + tracking: #80171
Sub-issues:
- Phase 1 (Runtime axis): #80172
- Phase 2 (Per-tool fixtures): #80173
- Phase 3 (Codex-plugin lifecycle): #80174
- Phase 4 (Token-efficiency report): #80175
- Phase 5 (JSONL replay): #80176
Sibling model-axis parity: #74290 (closed) → #79347 (in flight)
Original transport-parity proposal: #78457
Closed PR with the original draft of this doc: #78512

Changed files

extensions/qa-lab/transport-parity-gate.md (added, +148/-0)

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Repository: openclaw/openclaw
Author: 100yenadmin
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/80323

Description (problem / solution / changelog)

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, first-hour/depth suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay scaffolding, and release-check wiring.

This update also corrects the tool-defaults mock lane so the harness matches Codex app-server architecture:

Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are no longer expected to appear as duplicate OpenClaw dynamic tools.
OpenClaw integration tools (image_generate, sessions, web, etc.) remain dynamic-tool parity rows and are tracked separately from Codex-native behavior rows.
Optional/profile/plugin-dependent tools stay report-only unless explicitly enabled.
Mock provider planned tool calls are captured as provider-plan diagnostics, not as runtime transcript tool evidence.
Tool coverage reports now show bucket, expected layer, required/report-only status, product impact, QA impact, and action.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Verification

Passing targeted/current-scope checks:

pnpm test extensions/qa-lab/src/runtime-tool-fixture.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/tool-coverage-report.test.ts extensions/qa-lab/src/runtime-suite.test.ts extensions/qa-lab/src/suite.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts
pnpm tsgo:extensions:test
pnpm check:test-types
git diff --check

Real Behavior Proof

Behavior or issue addressed: Corrects the runtime parity tool-defaults harness so Codex-native workspace tools are no longer falsely required as duplicate OpenClaw dynamic tools, while OpenClaw dynamic integration rows remain visible and tracked.
Real environment tested: Local OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1 on branch codex-vs-pi-runtime-parity-tools, running the real pnpm openclaw qa CLI against the embedded gateway and mock OpenAI provider after this patch.
Exact steps or command run after this patch:

OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite tool-defaults --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/runtime-tools-correction
pnpm openclaw qa tool-coverage --repo-root . --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --runtime-pair pi,codex --output .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite openclaw-dynamic-tools --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/openclaw-dynamic-tools-correction
pnpm openclaw qa parity-report --repo-root . --runtime-axis --summary .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json --output-dir .artifacts/qa-e2e/runtime-tools-correction/parity --token-efficiency

Evidence after fix: Terminal output produced these real local artifacts: .artifacts/qa-e2e/runtime-tools-correction/qa-suite-summary.json, .artifacts/qa-e2e/runtime-tools-correction/qa-suite-report.md, .artifacts/qa-e2e/runtime-tools-correction/qa-tool-coverage-report.md, .artifacts/qa-e2e/openclaw-dynamic-tools-correction/qa-suite-summary.json, and .artifacts/qa-e2e/runtime-tools-correction/parity/qa-runtime-token-efficiency-report.md.
Observed result after fix: tool-defaults completed with 20 scenarios, 15 pass, 5 report-only skip, 0 fail. Tool coverage verdict was pass with 13 required tools, 8 Codex-native workspace tools, 5 OpenClaw dynamic integration tools, 7 optional/profile/plugin tools, and 0 failing tools. The focused openclaw-dynamic-tools suite completed with 5 report-only rows tracked under #80319. Token efficiency report verdict was pass with usage source mock-estimate.
What was not tested: Live frontier token-efficiency proof was not completed because local direct OpenAI auth is missing; optional scheduled/Testbox soak-100 proof was not completed; broad first-hour-20 remains red and is tracked in #80434.

Known Broad/Latest Blockers

First first-hour-20 attempt hit a pre-suite tsdown SIGSEGV; retry reached QA.
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite --repo-root . --provider-mode mock-openai --runtime-suite first-hour-20 --runtime-pair pi,codex --output-dir .artifacts/qa-e2e/first-hour-20-correction-retry is not green: 18 total, 6 pass, 12 fail; tracked in #80434.
pnpm check fails unrelated Discord lint: #80428.
pnpm test fails unrelated agents-core / ACPx / Mattermost shards: #80429, #80430, #80431, #67784.
Live token-efficiency proof path renders artifacts, but local direct OpenAI auth is missing so the attempted live run is not valid proof; tracked in #80175.
Optional soak-100 exists but is not scheduled/Testbox-wired; tracked in #80433.

Linked Issues

Umbrella/spec: #80171

Phase issues: #80172, #80173, #80174, #80175, #80176

Harness correction issues: #80236, #80312, #80319, #80320; #80321 is closed as fixed by this PR branch.

Fresh broad-rerun follow-ups: #80428, #80429, #80430, #80431, #80433, #80434, #67784

Changed files

.github/workflows/openclaw-release-checks.yml (modified, +115/-0)
.github/workflows/qa-live-transports-convex.yml (modified, +77/-0)
apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift (modified, +4/-0)
extensions/codex/src/app-server/schema-normalization-runtime-contract.test.ts (modified, +9/-4)
extensions/lmstudio/src/models.test.ts (modified, +1/-1)
extensions/qa-lab/src/agentic-parity-report.test.ts (modified, +120/-0)
extensions/qa-lab/src/agentic-parity-report.ts (modified, +218/-0)
extensions/qa-lab/src/auth-profile-fixture.ts (added, +177/-0)
extensions/qa-lab/src/cli.runtime.test.ts (modified, +282/-0)
extensions/qa-lab/src/cli.runtime.ts (modified, +416/-3)
extensions/qa-lab/src/cli.ts (modified, +175/-7)
extensions/qa-lab/src/codex-plugin-fixture.ts (added, +282/-0)
extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (added, +190/-0)
extensions/qa-lab/src/gateway-child.ts (modified, +7/-0)
extensions/qa-lab/src/harness-parity.test.ts (added, +144/-0)
extensions/qa-lab/src/harness-parity.ts (added, +415/-0)
extensions/qa-lab/src/jsonl-replay.test.ts (added, +169/-0)
extensions/qa-lab/src/jsonl-replay.ts (added, +270/-0)
extensions/qa-lab/src/multipass.runtime.test.ts (modified, +11/-0)
extensions/qa-lab/src/multipass.runtime.ts (modified, +6/-0)
extensions/qa-lab/src/providers/mock-openai/server.ts (modified, +74/-3)
extensions/qa-lab/src/runtime-parity.test.ts (added, +427/-0)
extensions/qa-lab/src/runtime-parity.ts (added, +1119/-0)
extensions/qa-lab/src/runtime-suite.test.ts (added, +75/-0)
extensions/qa-lab/src/runtime-suite.ts (added, +147/-0)
extensions/qa-lab/src/runtime-tool-fixture.test.ts (added, +156/-0)
extensions/qa-lab/src/runtime-tool-fixture.ts (added, +291/-0)
extensions/qa-lab/src/runtime-tool-metadata.ts (added, +142/-0)
extensions/qa-lab/src/scenario-catalog.test.ts (modified, +10/-0)
extensions/qa-lab/src/scenario-catalog.ts (modified, +4/-0)
extensions/qa-lab/src/scenario-flow-runner.ts (modified, +1/-1)
extensions/qa-lab/src/scenario-runtime-api.test.ts (modified, +1/-0)
extensions/qa-lab/src/scenario-runtime-api.ts (modified, +3/-0)
extensions/qa-lab/src/suite-runtime-flow.ts (modified, +13/-1)
extensions/qa-lab/src/suite-summary.ts (modified, +4/-1)
extensions/qa-lab/src/suite.summary-json.test.ts (modified, +53/-0)
extensions/qa-lab/src/suite.test.ts (modified, +100/-0)
extensions/qa-lab/src/suite.ts (modified, +449/-2)
extensions/qa-lab/src/token-efficiency-report.test.ts (added, +218/-0)
extensions/qa-lab/src/token-efficiency-report.ts (added, +379/-0)
extensions/qa-lab/src/tool-coverage-report.test.ts (added, +288/-0)
extensions/qa-lab/src/tool-coverage-report.ts (added, +285/-0)
extensions/qa-lab/transport-parity-gate.md (added, +66/-0)
extensions/qqbot/src/bridge/tools/remind.test.ts (modified, +1/-1)
extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts (modified, +1/-1)
extensions/slack/src/monitor/media.test.ts (modified, +3/-3)
extensions/tavily/src/tavily-tools.test.ts (modified, +3/-1)
qa/scenarios/agents/instruction-followthrough-repo-contract.md (modified, +1/-0)
qa/scenarios/agents/subagent-fanout-synthesis.md (modified, +1/-0)
qa/scenarios/agents/subagent-handoff.md (modified, +1/-0)
qa/scenarios/agents/subagent-stale-child-links.md (modified, +1/-0)
qa/scenarios/channels/channel-chat-baseline.md (modified, +1/-0)
qa/scenarios/config/config-restart-capability-flip.md (modified, +1/-0)
qa/scenarios/jsonl-replay/plan-mode-boundaries.jsonl (added, +8/-0)
qa/scenarios/jsonl-replay/recovery-partial-session.jsonl (added, +4/-0)
qa/scenarios/jsonl-replay/repo-triage-tool-loop.jsonl (added, +7/-0)
qa/scenarios/memory/memory-recall.md (modified, +1/-0)
qa/scenarios/memory/thread-memory-isolation.md (modified, +1/-0)
qa/scenarios/models/model-switch-tool-continuity.md (modified, +1/-0)
qa/scenarios/runtime/approval-turn-tool-followthrough.md (modified, +1/-0)
qa/scenarios/runtime/auth-profile-codex-mixed-profiles.md (added, +39/-0)
qa/scenarios/runtime/auth-profile-doctor-migration-safety.md (added, +44/-0)
qa/scenarios/runtime/codex-plugin-cold-install.md (added, +42/-0)
qa/scenarios/runtime/codex-plugin-install-race.md (added, +38/-0)
qa/scenarios/runtime/codex-plugin-pinned-new.md (added, +39/-0)
qa/scenarios/runtime/codex-plugin-pinned-old.md (added, +39/-0)
qa/scenarios/runtime/compaction-retry-mutating-tool.md (modified, +1/-0)
qa/scenarios/runtime/first-hour-20-turn.md (added, +68/-0)
qa/scenarios/runtime/soak-100-turn.md (added, +68/-0)
qa/scenarios/runtime/tools/apply-patch.md (added, +54/-0)
qa/scenarios/runtime/tools/bash.md (added, +55/-0)
qa/scenarios/runtime/tools/edit.md (added, +54/-0)
qa/scenarios/runtime/tools/exec.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-list.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-read.md (added, +54/-0)
qa/scenarios/runtime/tools/fs-write.md (added, +54/-0)
qa/scenarios/runtime/tools/grep.md (added, +54/-0)
qa/scenarios/runtime/tools/image-generate.md (added, +55/-0)
qa/scenarios/runtime/tools/memory-add.md (added, +54/-0)
qa/scenarios/runtime/tools/memory-recall.md (added, +54/-0)
qa/scenarios/runtime/tools/message-tool.md (added, +52/-0)
qa/scenarios/runtime/tools/session-status.md (added, +54/-0)
qa/scenarios/runtime/tools/sessions-spawn.md (added, +54/-0)
qa/scenarios/runtime/tools/skill-invocation.md (added, +54/-0)
qa/scenarios/runtime/tools/tavily-extract.md (added, +53/-0)
qa/scenarios/runtime/tools/tavily-search.md (added, +53/-0)
qa/scenarios/runtime/tools/tts.md (added, +54/-0)
qa/scenarios/runtime/tools/web-fetch.md (added, +54/-0)
qa/scenarios/runtime/tools/web-search.md (added, +54/-0)
qa/scenarios/workspace/source-docs-discovery-report.md (modified, +1/-0)
scripts/deadcode-unused-files.allowlist.mjs (modified, +2/-0)
src/agents/model-runtime-policy.test.ts (added, +91/-0)
src/agents/model-runtime-policy.ts (modified, +16/-0)

Code Example

export type JsonlReplayInput = {
    directory: string;
    runtimePair: ["pi", "codex"];
    providerMode: "mock-openai" | "live-frontier";
  };
  export type JsonlReplayResult = {
    transcripts: Array<{
      transcriptPath: string;
      userTurnCount: number;
      cells: { pi: RuntimeParityCell[]; codex: RuntimeParityCell[] };  // one per turn
      drift: Array<RuntimeParityResult["drift"]>;                       // one per turn
      firstDriftAtTurn?: number;                                        // for triage
    }>;
  };

RAW_BUFFERClick to expand / collapse

Tracking parent: #80171 Depends on: Phase 1 #80172 (drift classifier)

Goal

Eva's "loop 3 agents on difficult scenarios for testing based on a real jsonl session history." Take captured session transcripts, replay through fresh sessions on each runtime, diff trajectories.

This catches the regression class where a synthetic prompt looks fine but a real long-running session — with its accumulated context, tool-call history, and edge-case branching — exposes drift.

Scope

Curated jsonl fixture set, not real-customer transcripts. Real-customer transcript ingestion is a separate concern (PII, consent, retention). This PR ships with a small curated set checked into the repo (3–5 transcripts the maintainers approve) so the harness can land without depending on a customer-data pipeline.

Concrete deliverables

Code

New extensions/qa-lab/src/jsonl-replay.ts — reads a directory of jsonl session transcripts, extracts user-turn boundaries, replays each through both runtimes via the Phase 1 orchestrator. Public API:

export type JsonlReplayInput = {
  directory: string;
  runtimePair: ["pi", "codex"];
  providerMode: "mock-openai" | "live-frontier";
};
export type JsonlReplayResult = {
  transcripts: Array<{
    transcriptPath: string;
    userTurnCount: number;
    cells: { pi: RuntimeParityCell[]; codex: RuntimeParityCell[] };  // one per turn
    drift: Array<RuntimeParityResult["drift"]>;                       // one per turn
    firstDriftAtTurn?: number;                                        // for triage
  }>;
};

New extensions/qa-lab/src/jsonl-replay.test.ts — unit tests for the user-turn extraction, plus integration test against the curated fixtures.
New qa/scenarios/jsonl-replay/<curated-name>.jsonl — 3–5 maintainer-approved fixtures. Strip PII, fix any external dependencies (URLs, channel ids).
Extend extensions/qa-lab/src/cli.ts — qa jsonl-replay --runtime-pair pi,codex --transcripts qa/scenarios/jsonl-replay.

Tests

User-turn extraction unit test (handle edge cases: tool-only turns, system prompts, empty turns, partial transcripts).
"First drift at turn N" reporter test — long sessions are useless if the report just says "drifted somewhere"; the report must surface the earliest divergent turn.

Acceptance criteria

qa jsonl-replay --runtime-pair pi,codex --transcripts <dir> runs each jsonl through both runtimes and produces a per-transcript drift report.
The report surfaces the earliest divergent turn per transcript (this is what makes long-session bugs triagable).
Curated fixture set checked in (3–5 transcripts), maintainer-approved, PII-stripped, no external network dependencies.
Integration test running the harness against the curated set on mock-openai mode passes in <5min.
No real-customer data in the repo.

Out of scope

Customer transcript ingestion pipeline.
Live-mode replay against live-frontier — the curated fixtures are mock-mode.
Three-agent loop (Eva mentioned "loop 3 agents on difficult scenarios"). For this phase the harness loops one agent through each transcript's user turns. A multi-agent variant can be a follow-up if needed.

Open questions for the maintainers

Where should the curated fixture set live? Suggested: qa/scenarios/jsonl-replay/. Need maintainer sign-off on 3–5 specific transcripts to use.
Should we ship a qa-lab helper to scrub a real jsonl into a fixture (PII removal, URL substitution)? Probably yes, in a follow-up — too much scope for this PR.
Three-agent loop — confirm with @Eva-⚡🐑 whether "loop 3 agents" means three concurrent agents in a session, three sequential replay runs for stability sampling, or three different captured agents. The fix shape differs.

References

Tracking parent: #80171
Phase 1: #80172
Eva's request: maintainer thread (Yesterday): "Loop 3 agents on difficult scenarios for testing based on a real jsonl session history"

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Codex×Pi parity Phase 5] JSONL session-replay harness [2 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #80179: docs(qa-lab): runtime-parity gate design (Pi vs Codex harness)

Description (problem / solution / changelog)

Summary

Why this is design-only

Verification

Test plan

References

Changed files

PR #80323: [qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5

Description (problem / solution / changelog)

Summary

Why

Verification

Real Behavior Proof

Known Broad/Latest Blockers

Linked Issues

Changed files

Code Example

Goal

Scope

Concrete deliverables

Code

Tests

Acceptance criteria

Out of scope

Open questions for the maintainers

References

Still need to ship something?

RELATED_DISCOVERY

TRENDING