openclaw - 💡(How to fix) Fix oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74017Fetched 2026-04-30 06:29:49
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
cross-referenced ×3closed ×1commented ×1

The oc llm benchmark scorer appears to reject MCQ answers when ambiguity_flag=true, even when the final extracted answer matches expected_answer. This causes valid final-answer outputs from native reasoning models to be marked incorrect.

This is an oc llm scorer/harness bug, not a Nemotron model bug and not a vLLM bug. The original Lane B score undercounted model quality.

Error Message

The Lane B metrics contained tasks marked incorrect where:

Root Cause

Suspected root cause

Code Example

FINAL: D

---

FINAL: B

---

FINAL: B

---

The answer must be like FINAL: A or FINAL: B. The correct answer is D.
RAW_BUFFERClick to expand / collapse

oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true

Summary

The oc llm benchmark scorer appears to reject MCQ answers when ambiguity_flag=true, even when the final extracted answer matches expected_answer. This causes valid final-answer outputs from native reasoning models to be marked incorrect.

This is an oc llm scorer/harness bug, not a Nemotron model bug and not a vLLM bug. The original Lane B score undercounted model quality.

Environment

  • Model served name: nemotron-3-super
  • Underlying model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  • Runtime: vLLM OpenAI-compatible endpoint
  • Hardware: Dell GB10 Pro Max
  • Benchmark lane: B_native_reasoning
  • Benchmark type: OpenClaw local oc llm non-smoke benchmark

Benchmark context

  • Lane A baseline run: run_2026-04-28_234341_gb10_non_smoke_oc_llm_test
  • Lane B run: B_native_reasoning_20260429_000202
  • Source metrics file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
  • Corrected metrics file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
  • Audit file: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md

Observed behavior

The Lane B metrics contained tasks marked incorrect where:

  • correct == false
  • expected_answer == extracted_final_answer
  • extraction_method == final_colon_letter
  • visible raw_content contained the correct final answer, e.g. FINAL: D or FINAL: B
  • ambiguity_flag == true

The audit found 9 such scoring inconsistencies and confirmed the scorer bug.

Expected behavior

For MCQ tasks:

  1. If a final-answer extractor finds a valid final answer from the final visible content, and that final answer matches expected_answer, score the task correct.
  2. Reasoning-trace examples such as FINAL: A, or B, or option letters in the reasoning field should not make the final answer ambiguous if the final visible answer is unambiguous.
  3. Only mark ambiguous if the final answer region itself contains conflicting final answers or no reliable final answer can be extracted.
  4. The scorer should distinguish:
    • final answer ambiguity
    • reasoning trace examples
    • visible content final answer
    • reasoning-field final answer
    • no extractable answer

Evidence / metrics

  • Lane A baseline: 184/210 = 87.62%
  • Original Lane B: 188/210 = 89.52%
  • Corrected final-answer Lane B: 197/210 = 93.81%
  • Scoring inconsistencies found: 9
  • Scorer bug confirmed: yes
  • Original reasoning score: 128/145 = 88.28%
  • Corrected reasoning score: 137/145 = 94.48%

The audit conclusion says the scorer rejected answers where ambiguity_flag=true even though the final extracted answer matched expected_answer.

Concrete examples

reasoning_mmlu_015

  • expected_answer: D
  • extracted_final_answer: D
  • extraction_method: final_colon_letter
  • ambiguity_flag: true
  • raw_content:
FINAL: D
  • scoring_decision: {"correct": false, "required_substrings": null}
  • classification from audit: ambiguous/output-format mismatch
  • visible_content_has_expected_answer: true

reasoning_mmlu_pro_019

  • expected_answer: B
  • extracted_final_answer: B
  • extraction_method: final_colon_letter
  • ambiguity_flag: true
  • raw_content:
FINAL: B
  • scoring_decision: {"correct": false, "required_substrings": null}
  • classification from audit: ambiguous/output-format mismatch
  • visible_content_has_expected_answer: true

reasoning_mmlu_030

  • expected_answer: B
  • extracted_final_answer: B
  • extraction_method: final_colon_letter
  • ambiguity_flag: true
  • raw_content:
FINAL: B
  • scoring_decision: {"correct": false, "required_substrings": null}
  • classification from audit: ambiguous/output-format mismatch
  • visible_content_has_expected_answer: true

Suspected root cause

The scorer seems to treat any ambiguity_flag=true as a failure for MCQ tasks. But the ambiguity flag can be triggered by examples or option letters inside the reasoning trace, such as FINAL: A used as an example, even when the final extracted answer in visible content is unambiguous and correct.

Suggested fix

Update the oc llm MCQ scorer/extractor to prioritize final answer extraction in this order:

  1. Last valid FINAL: <A|B|C|D> in visible message.content
  2. Last valid Final answer: <A|B|C|D> in visible message.content
  3. Last valid Answer: <A|B|C|D> in visible message.content
  4. If visible content is null or empty, optionally fall back to message.reasoning only when the lane/test is configured to allow reasoning-field scoring
  5. Do not treat reasoning-trace examples as final-answer ambiguity unless the final answer itself is conflicting
  6. Store the extraction source and extraction method in the result output

Suggested regression tests

Create scorer unit tests for these cases:

Correct visible final answer despite examples in reasoning

Input:

  • expected_answer: D
  • raw_content: FINAL: D
  • raw_reasoning contains example text like:
The answer must be like FINAL: A or FINAL: B. The correct answer is D.

Expected:

  • scorer returns correct: true

Correct visible answer with conflicting-looking reasoning examples

Input:

  • expected_answer: B
  • raw_content: FINAL: B
  • raw_reasoning includes examples FINAL: A and FINAL: C

Expected:

  • scorer returns correct: true

True ambiguity behavior is explicit

Input:

  • expected_answer: B
  • raw_content contains both FINAL: B and later FINAL: C

Expected:

  • scorer may mark ambiguous or use a last-final-answer rule, but the behavior must be explicit and documented.

Attachments / local evidence paths

  • Source metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
  • Corrected metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
  • Scorer audit report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md
  • A/B comparison report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/A_vs_B_native_reasoning_comparison_20260429_004524.md
  • Project status: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/PROJECT_STATUS.md

Impact

The original Lane B score undercounted native reasoning quality: 188/210 = 89.52% was corrected to 197/210 = 93.81% under a simple final-answer rule. This materially changes the benchmark interpretation.

The visible-content-only compatibility issue is separate from this bug. Native reasoning outputs may require reasoning-field support, but when visible content contains a correct FINAL: answer, the scorer should not mark it wrong solely because reasoning text triggered ambiguity_flag=true.

Acceptance criteria

  • Correct final-answer MCQ outputs are not marked wrong solely because ambiguity_flag=true.
  • Reasoning-trace examples do not override the final visible answer.
  • Corrected scorer reproduces the adjusted Lane B score of roughly 197/210 = 93.81% on the audit dataset.
  • Result output clearly records extraction source and method.
  • Regression tests cover examples in the audit.

extent analysis

TL;DR

Update the oc llm MCQ scorer to prioritize final answer extraction from visible content and ignore reasoning-trace examples unless they indicate true ambiguity.

Guidance

  • Review the oc llm scorer code to ensure it follows the suggested fix order for final answer extraction.
  • Implement regression tests for correct visible final answers despite examples in reasoning, correct visible answers with conflicting-looking reasoning examples, and true ambiguity behavior.
  • Verify that the corrected scorer reproduces the adjusted Lane B score of roughly 197/210 = 93.81% on the audit dataset.
  • Ensure the result output clearly records extraction source and method.

Example

No code snippet is provided as the issue does not contain sufficient code context.

Notes

The suggested fix assumes that the oc llm scorer code is modifiable and that the issue is solely related to the scorer's logic. Additional testing and verification may be necessary to ensure the corrected scorer works as expected in all scenarios.

Recommendation

Apply the suggested workaround by updating the oc llm MCQ scorer to prioritize final answer extraction from visible content and ignore reasoning-trace examples unless they indicate true ambiguity, as this should resolve the scoring inconsistencies and improve the accuracy of the benchmark results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For MCQ tasks:

  1. If a final-answer extractor finds a valid final answer from the final visible content, and that final answer matches expected_answer, score the task correct.
  2. Reasoning-trace examples such as FINAL: A, or B, or option letters in the reasoning field should not make the final answer ambiguous if the final visible answer is unambiguous.
  3. Only mark ambiguous if the final answer region itself contains conflicting final answers or no reliable final answer can be extracted.
  4. The scorer should distinguish:
    • final answer ambiguity
    • reasoning trace examples
    • visible content final answer
    • reasoning-field final answer
    • no extractable answer

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix oc llm scorer incorrectly marks correct MCQ final answers wrong when ambiguity_flag=true [1 comments, 2 participants]