llamaIndex - ✅(Solved) Fix Integration proposal: StyxxHallucinationEvaluator — 8-benchmark cross-validated RAG faithfulness (AUC 0.807 on RAGTruth) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21460Fetched 2026-04-24 05:51:31
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1referenced ×1

Fix Action

Fix / Workaround

PR fix notes

PR #55: Add Cognometry v0 — 8-benchmark cross-validated hallucination detection

Description (problem / solution / changelog)

Adding a paper entry for Cognometry v0 — the first open-source hallucination detector cross-validated across 8 public benchmarks (including the four HaluBench subsets from Patronus AI).

The submission is in the list's established format (metrics, datasets, comments). I've placed it at the top of "Papers and Summaries" since it was deposited on Zenodo on 2026-04-23.

Two things worth calling out beyond the headline AUCs (0.998 HaluEval-QA, 0.994 TruthfulQA):

  1. Two failure modes published openly (HaluBench-DROP 0.424 and HaluBench-FinanceBench 0.492 — both below chance) with their structural causes characterized in the weights module itself.
  2. 3-seed averaging + committed reproducer — full benchmark harness is in the repo and produces the per-dataset AUCs from raw HuggingFace datasets.

Paper: https://doi.org/10.5281/zenodo.19703527 Code: https://github.com/fathom-lab/styxx (MIT + CC-BY-4.0) Manifesto: https://fathom.darkflobi.com/cognometry Leaderboard: https://fathom.darkflobi.com/cognometry/leaderboard

Happy to refine the summary or move the entry if you'd prefer a different placement.

Changed files

  • README.md (modified, +7/-1)

Code Example

from styxx.adapters.llamaindex import StyxxHallucinationEvaluator

evaluator = StyxxHallucinationEvaluator(threshold=0.7)

# Works with the standard BaseEvaluator signature
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

result.passing    # True/False against threshold
result.score      # 1.0 - risk (high = faithful)
result.feedback   # human-readable with action + signals
result.metadata   # {"risk": ..., "action": ..., "signals": {...}}
RAW_BUFFERClick to expand / collapse

Hi LlamaIndex team — proposing a community evaluator backed by styxx, a hallucination detector cross-validated across 8 public benchmarks.

Directly relevant to RAG: our strongest new number is AUC 0.807 on HaluBench-RAGTruth (3-seed averaged, n=150/dataset). That's the benchmark measuring RAG faithfulness — the core LlamaIndex use case.

What this would be

A BaseEvaluator subclass that slots alongside your existing FaithfulnessEvaluator / RelevancyEvaluator:

from styxx.adapters.llamaindex import StyxxHallucinationEvaluator

evaluator = StyxxHallucinationEvaluator(threshold=0.7)

# Works with the standard BaseEvaluator signature
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

result.passing    # True/False against threshold
result.score      # 1.0 - risk (high = faithful)
result.feedback   # human-readable with action + signals
result.metadata   # {"risk": ..., "action": ..., "signals": {...}}

The adapter is live today at https://github.com/fathom-lab/styxx/blob/main/styxx/adapters/llamaindex.py (commit a817732). pip install styxx[nli,llamaindex] gets you everything.

Numbers on RAG-relevant benchmarks (3-seed averaged, n=150/dataset)

BenchmarkAUC
HaluEval-QA0.998
TruthfulQA0.994
HaluBench-RAGTruth0.807 ← direct RAG faithfulness
HaluBench-PubMedQA0.719
HaluEval-Dialog0.676
HaluEval-Summ0.643

Two published failure modes in the paper (HaluBench-DROP 0.424 extractive-span reading comp; FinanceBench 0.492 financial arithmetic) — declared openly in the weights module. LlamaIndex users should avoid this evaluator for those specific domains.

Paper + provenance

What I'd do

Happy to submit as a PR to llama-index-integrations under llama-index-evaluation-styxx (or whichever namespace you prefer for community evaluators), or keep it in the styxx repo where it lives now and just document it as a supported integration.

Also happy to run the 8-benchmark suite against any existing LlamaIndex evaluators (FaithfulnessEvaluator, GuidelineEvaluator) for a direct comparison if that would be useful.

Thanks, Flobi Fathom Lab

extent analysis

TL;DR

Integrate the StyxxHallucinationEvaluator into LlamaIndex to improve hallucination detection.

Guidance

  • Evaluate the StyxxHallucinationEvaluator using the provided benchmarks (e.g., HaluBench-RAGTruth) to assess its performance.
  • Consider the declared failure modes (e.g., HaluBench-DROP, FinanceBench) when deciding which domains to apply the evaluator to.
  • Review the adapter code at https://github.com/fathom-lab/styxx/blob/main/styxx/adapters/llamaindex.py to understand the implementation.
  • Test the evaluator using the standard BaseEvaluator signature, as shown in the provided example.

Example

evaluator = StyxxHallucinationEvaluator(threshold=0.7)
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

Notes

The StyxxHallucinationEvaluator has been cross-validated across 8 public benchmarks, but its performance may vary depending on the specific use case.

Recommendation

Apply the StyxxHallucinationEvaluator as a community evaluator, either by submitting a PR to llama-index-integrations or documenting it as a supported integration in the styxx repo.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix Integration proposal: StyxxHallucinationEvaluator — 8-benchmark cross-validated RAG faithfulness (AUC 0.807 on RAGTruth) [1 pull requests, 1 participants]