llamaIndex - ✅(Solved) Fix Integration proposal: StyxxHallucinationEvaluator — 8-benchmark cross-validated RAG faithfulness (AUC 0.807 on RAGTruth) [1 pull requests, 1 participants]

llamaIndex2026-04-23 15:13:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21460•Fetched 2026-04-24 05:51:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fathomlab

Participants

fathomlab

Timeline (top)

cross-referenced ×1referenced ×1

Fix Action

Fix / Workaround

Zenodo DOI: https://doi.org/10.5281/zenodo.19703527 (CC-BY-4.0)
OSF project: https://osf.io/g2epj/
Merged into EdinburghNLP/awesome-hallucination-detection#55 by Pasquale Minervini (HaluEval author)
Working Guardrails AI validator already in the same repo (issue)

PR fix notes

PR #55: Add Cognometry v0 — 8-benchmark cross-validated hallucination detection

Repository: EdinburghNLP/awesome-hallucination-detection
Author: fathomlab
State: closed | merged: True
Link: https://github.com/EdinburghNLP/awesome-hallucination-detection/pull/55

Description (problem / solution / changelog)

Adding a paper entry for Cognometry v0 — the first open-source hallucination detector cross-validated across 8 public benchmarks (including the four HaluBench subsets from Patronus AI).

The submission is in the list's established format (metrics, datasets, comments). I've placed it at the top of "Papers and Summaries" since it was deposited on Zenodo on 2026-04-23.

Two things worth calling out beyond the headline AUCs (0.998 HaluEval-QA, 0.994 TruthfulQA):

Two failure modes published openly (HaluBench-DROP 0.424 and HaluBench-FinanceBench 0.492 — both below chance) with their structural causes characterized in the weights module itself.
3-seed averaging + committed reproducer — full benchmark harness is in the repo and produces the per-dataset AUCs from raw HuggingFace datasets.

Paper: https://doi.org/10.5281/zenodo.19703527 Code: https://github.com/fathom-lab/styxx (MIT + CC-BY-4.0) Manifesto: https://fathom.darkflobi.com/cognometry Leaderboard: https://fathom.darkflobi.com/cognometry/leaderboard

Happy to refine the summary or move the entry if you'd prefer a different placement.

Changed files

README.md (modified, +7/-1)

Code Example

from styxx.adapters.llamaindex import StyxxHallucinationEvaluator

evaluator = StyxxHallucinationEvaluator(threshold=0.7)

# Works with the standard BaseEvaluator signature
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

result.passing    # True/False against threshold
result.score      # 1.0 - risk (high = faithful)
result.feedback   # human-readable with action + signals
result.metadata   # {"risk": ..., "action": ..., "signals": {...}}

RAW_BUFFERClick to expand / collapse

Hi LlamaIndex team — proposing a community evaluator backed by styxx, a hallucination detector cross-validated across 8 public benchmarks.

Directly relevant to RAG: our strongest new number is AUC 0.807 on HaluBench-RAGTruth (3-seed averaged, n=150/dataset). That's the benchmark measuring RAG faithfulness — the core LlamaIndex use case.

What this would be

A BaseEvaluator subclass that slots alongside your existing FaithfulnessEvaluator / RelevancyEvaluator:

from styxx.adapters.llamaindex import StyxxHallucinationEvaluator

evaluator = StyxxHallucinationEvaluator(threshold=0.7)

# Works with the standard BaseEvaluator signature
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

result.passing    # True/False against threshold
result.score      # 1.0 - risk (high = faithful)
result.feedback   # human-readable with action + signals
result.metadata   # {"risk": ..., "action": ..., "signals": {...}}

The adapter is live today at https://github.com/fathom-lab/styxx/blob/main/styxx/adapters/llamaindex.py (commit a817732). pip install styxx[nli,llamaindex] gets you everything.

Numbers on RAG-relevant benchmarks (3-seed averaged, n=150/dataset)

Benchmark	AUC
HaluEval-QA	0.998
TruthfulQA	0.994
HaluBench-RAGTruth	0.807 ← direct RAG faithfulness
HaluBench-PubMedQA	0.719
HaluEval-Dialog	0.676
HaluEval-Summ	0.643

Two published failure modes in the paper (HaluBench-DROP 0.424 extractive-span reading comp; FinanceBench 0.492 financial arithmetic) — declared openly in the weights module. LlamaIndex users should avoid this evaluator for those specific domains.

Paper + provenance

Zenodo DOI: https://doi.org/10.5281/zenodo.19703527 (CC-BY-4.0)
OSF project: https://osf.io/g2epj/
Merged into EdinburghNLP/awesome-hallucination-detection#55 by Pasquale Minervini (HaluEval author)
Working Guardrails AI validator already in the same repo (issue)

What I'd do

Happy to submit as a PR to llama-index-integrations under llama-index-evaluation-styxx (or whichever namespace you prefer for community evaluators), or keep it in the styxx repo where it lives now and just document it as a supported integration.

Also happy to run the 8-benchmark suite against any existing LlamaIndex evaluators (FaithfulnessEvaluator, GuidelineEvaluator) for a direct comparison if that would be useful.

Thanks, Flobi Fathom Lab

extent analysis

TL;DR

Integrate the StyxxHallucinationEvaluator into LlamaIndex to improve hallucination detection.

Guidance

Evaluate the StyxxHallucinationEvaluator using the provided benchmarks (e.g., HaluBench-RAGTruth) to assess its performance.
Consider the declared failure modes (e.g., HaluBench-DROP, FinanceBench) when deciding which domains to apply the evaluator to.
Review the adapter code at https://github.com/fathom-lab/styxx/blob/main/styxx/adapters/llamaindex.py to understand the implementation.
Test the evaluator using the standard BaseEvaluator signature, as shown in the provided example.

Example

evaluator = StyxxHallucinationEvaluator(threshold=0.7)
result = await evaluator.aevaluate(
    query="who directed inception?",
    response="Christopher Nolan directed Inception in 2010.",
    contexts=["Inception is a 2010 film directed by Christopher Nolan."],
)

Notes

The StyxxHallucinationEvaluator has been cross-validated across 8 public benchmarks, but its performance may vary depending on the specific use case.

Recommendation

Apply the StyxxHallucinationEvaluator as a community evaluator, either by submitting a PR to llama-index-integrations or documenting it as a supported integration in the styxx repo.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix Integration proposal: StyxxHallucinationEvaluator — 8-benchmark cross-validated RAG faithfulness (AUC 0.807 on RAGTruth) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #55: Add Cognometry v0 — 8-benchmark cross-validated hallucination detection

Description (problem / solution / changelog)

Changed files

Code Example

What this would be

Numbers on RAG-relevant benchmarks (3-seed averaged, n=150/dataset)

Paper + provenance

What I'd do

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix Integration proposal: StyxxHallucinationEvaluator — 8-benchmark cross-validated RAG faithfulness (AUC 0.807 on RAGTruth) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #55: Add Cognometry v0 — 8-benchmark cross-validated hallucination detection

Description (problem / solution / changelog)

Changed files

Code Example

What this would be

Numbers on RAG-relevant benchmarks (3-seed averaged, n=150/dataset)

Paper + provenance

What I'd do

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING