For MCQ tasks: 1. If a final-answer extractor finds a valid final answer from the final visible content, and that final answer matches `expected_answer`, score the task correct. 2. Reasoning-trace examples such as `FINAL: A`, `or B`, or option letters in the reasoning field should not make the final answer ambiguous if the final visible answer is unambiguous. 3. Only mark ambiguous if the final answer region itself contains conflicting final answers or no reliable final answer can be extracted. 4. The scorer should distinguish: - final answer ambiguity - reasoning trace examples - visible content final answer - reasoning-field final answer - no extractable answer

openclaw - 💡(How to fix) Fix oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true [1 comments, 2 participants]

openclaw2026-04-29 02:59:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74017•Fetched 2026-04-30 06:29:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jmystaki-create

Participants

clawsweeper[bot]

jmystaki-create

Timeline (top)

cross-referenced ×3closed ×1commented ×1

The oc llm benchmark scorer appears to reject MCQ answers when ambiguity_flag=true, even when the final extracted answer matches expected_answer. This causes valid final-answer outputs from native reasoning models to be marked incorrect.

This is an oc llm scorer/harness bug, not a Nemotron model bug and not a vLLM bug. The original Lane B score undercounted model quality.

Error Message

The Lane B metrics contained tasks marked incorrect where:

Root Cause

Suspected root cause

Code Example

FINAL: D

---

FINAL: B

---

FINAL: B

---

The answer must be like FINAL: A or FINAL: B. The correct answer is D.

RAW_BUFFERClick to expand / collapse

oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true

Summary

This is an oc llm scorer/harness bug, not a Nemotron model bug and not a vLLM bug. The original Lane B score undercounted model quality.

Environment

Model served name: nemotron-3-super
Underlying model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Runtime: vLLM OpenAI-compatible endpoint
Hardware: Dell GB10 Pro Max
Benchmark lane: B_native_reasoning
Benchmark type: OpenClaw local oc llm non-smoke benchmark

Benchmark context

Lane A baseline run: run_2026-04-28_234341_gb10_non_smoke_oc_llm_test
Lane B run: B_native_reasoning_20260429_000202
Source metrics file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
Corrected metrics file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
Audit file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md

Observed behavior

The Lane B metrics contained tasks marked incorrect where:

correct == false
expected_answer == extracted_final_answer
extraction_method == final_colon_letter
visible raw_content contained the correct final answer, e.g. FINAL: D or FINAL: B
ambiguity_flag == true

The audit found 9 such scoring inconsistencies and confirmed the scorer bug.

Expected behavior

For MCQ tasks:

If a final-answer extractor finds a valid final answer from the final visible content, and that final answer matches expected_answer, score the task correct.
Reasoning-trace examples such as FINAL: A, or B, or option letters in the reasoning field should not make the final answer ambiguous if the final visible answer is unambiguous.
Only mark ambiguous if the final answer region itself contains conflicting final answers or no reliable final answer can be extracted.
The scorer should distinguish:
- final answer ambiguity
- reasoning trace examples
- visible content final answer
- reasoning-field final answer
- no extractable answer

Evidence / metrics

Lane A baseline: 184/210 = 87.62%
Original Lane B: 188/210 = 89.52%
Corrected final-answer Lane B: 197/210 = 93.81%
Scoring inconsistencies found: 9
Scorer bug confirmed: yes
Original reasoning score: 128/145 = 88.28%
Corrected reasoning score: 137/145 = 94.48%

The audit conclusion says the scorer rejected answers where ambiguity_flag=true even though the final extracted answer matched expected_answer.

Concrete examples

`reasoning_mmlu_015`

expected_answer: D
extracted_final_answer: D
extraction_method: final_colon_letter
ambiguity_flag: true
raw_content:

FINAL: D

scoring_decision: {"correct": false, "required_substrings": null}
classification from audit: ambiguous/output-format mismatch
visible_content_has_expected_answer: true

`reasoning_mmlu_pro_019`

expected_answer: B
extracted_final_answer: B
extraction_method: final_colon_letter
ambiguity_flag: true
raw_content:

FINAL: B

scoring_decision: {"correct": false, "required_substrings": null}
classification from audit: ambiguous/output-format mismatch
visible_content_has_expected_answer: true

`reasoning_mmlu_030`

expected_answer: B
extracted_final_answer: B
extraction_method: final_colon_letter
ambiguity_flag: true
raw_content:

FINAL: B

scoring_decision: {"correct": false, "required_substrings": null}
classification from audit: ambiguous/output-format mismatch
visible_content_has_expected_answer: true

Suspected root cause

The scorer seems to treat any ambiguity_flag=true as a failure for MCQ tasks. But the ambiguity flag can be triggered by examples or option letters inside the reasoning trace, such as FINAL: A used as an example, even when the final extracted answer in visible content is unambiguous and correct.

Suggested fix

Update the oc llm MCQ scorer/extractor to prioritize final answer extraction in this order:

Last valid FINAL: <A|B|C|D> in visible message.content
Last valid Final answer: <A|B|C|D> in visible message.content
Last valid Answer: <A|B|C|D> in visible message.content
If visible content is null or empty, optionally fall back to message.reasoning only when the lane/test is configured to allow reasoning-field scoring
Do not treat reasoning-trace examples as final-answer ambiguity unless the final answer itself is conflicting
Store the extraction source and extraction method in the result output

Suggested regression tests

Create scorer unit tests for these cases:

Correct visible final answer despite examples in reasoning

Input:

expected_answer: D
raw_content: FINAL: D
raw_reasoning contains example text like:

The answer must be like FINAL: A or FINAL: B. The correct answer is D.

Expected:

scorer returns correct: true

Correct visible answer with conflicting-looking reasoning examples

Input:

expected_answer: B
raw_content: FINAL: B
raw_reasoning includes examples FINAL: A and FINAL: C

Expected:

scorer returns correct: true

True ambiguity behavior is explicit

Input:

expected_answer: B
raw_content contains both FINAL: B and later FINAL: C

Expected:

scorer may mark ambiguous or use a last-final-answer rule, but the behavior must be explicit and documented.

Attachments / local evidence paths

Source metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
Corrected metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
Scorer audit report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md
A/B comparison report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/A_vs_B_native_reasoning_comparison_20260429_004524.md
Project status: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/PROJECT_STATUS.md

Impact

The original Lane B score undercounted native reasoning quality: 188/210 = 89.52% was corrected to 197/210 = 93.81% under a simple final-answer rule. This materially changes the benchmark interpretation.

The visible-content-only compatibility issue is separate from this bug. Native reasoning outputs may require reasoning-field support, but when visible content contains a correct FINAL: answer, the scorer should not mark it wrong solely because reasoning text triggered ambiguity_flag=true.

Acceptance criteria

Correct final-answer MCQ outputs are not marked wrong solely because ambiguity_flag=true.
Reasoning-trace examples do not override the final visible answer.
Corrected scorer reproduces the adjusted Lane B score of roughly 197/210 = 93.81% on the audit dataset.
Result output clearly records extraction source and method.
Regression tests cover examples in the audit.

extent analysis

TL;DR

Update the oc llm MCQ scorer to prioritize final answer extraction from visible content and ignore reasoning-trace examples unless they indicate true ambiguity.

Guidance

Review the oc llm scorer code to ensure it follows the suggested fix order for final answer extraction.
Implement regression tests for correct visible final answers despite examples in reasoning, correct visible answers with conflicting-looking reasoning examples, and true ambiguity behavior.
Verify that the corrected scorer reproduces the adjusted Lane B score of roughly 197/210 = 93.81% on the audit dataset.
Ensure the result output clearly records extraction source and method.

Example

No code snippet is provided as the issue does not contain sufficient code context.

Notes

The suggested fix assumes that the oc llm scorer code is modifiable and that the issue is solely related to the scorer's logic. Additional testing and verification may be necessary to ensure the corrected scorer works as expected in all scenarios.

Recommendation

Apply the suggested workaround by updating the oc llm MCQ scorer to prioritize final answer extraction from visible content and ignore reasoning-trace examples unless they indicate true ambiguity, as this should resolve the scoring inconsistencies and improve the accuracy of the benchmark results.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

For MCQ tasks:

If a final-answer extractor finds a valid final answer from the final visible content, and that final answer matches expected_answer, score the task correct.
Reasoning-trace examples such as FINAL: A, or B, or option letters in the reasoning field should not make the final answer ambiguous if the final visible answer is unambiguous.
Only mark ambiguous if the final answer region itself contains conflicting final answers or no reliable final answer can be extracted.
The scorer should distinguish:
- final answer ambiguity
- reasoning trace examples
- visible content final answer
- reasoning-field final answer
- no extractable answer

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Suspected root cause

Code Example

oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true

Summary

Environment

Benchmark context

Observed behavior

Expected behavior

Evidence / metrics

Concrete examples

reasoning_mmlu_015

reasoning_mmlu_pro_019

reasoning_mmlu_030

Suspected root cause

Suggested fix

Suggested regression tests

Correct visible final answer despite examples in reasoning

Correct visible answer with conflicting-looking reasoning examples

True ambiguity behavior is explicit

Attachments / local evidence paths

Impact

Acceptance criteria

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`reasoning_mmlu_015`

`reasoning_mmlu_pro_019`

`reasoning_mmlu_030`