llamaIndex - ✅(Solved) Fix [Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality [1 pull requests, 1 participants]

llamaIndex2026-03-03 17:51:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#20862•Fetched 2026-04-08 00:30:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

debu-sinha

Participants

debu-sinha

Timeline (top)

cross-referenced ×1

Fix Action

Fixed

Fixed by PR: Add agent evaluation: ToolCallCorrectness and AgentGoalSuccess evaluators (https://github.com/run-llama/llama_index/pull/20887)

PR fix notes

PR #20887: Add agent evaluation: ToolCallCorrectness and AgentGoalSuccess evaluators

Repository: run-llama/llama_index
Author: debu-sinha
State: open | merged: False
Link: https://github.com/run-llama/llama_index/pull/20887

Description (problem / solution / changelog)

Resolves #20862

What this PR does

Adds two evaluators for agent workflows under llama_index.core.evaluation.agent:

ToolCallCorrectnessEvaluator (deterministic, no LLM needed)

Compares expected vs actual tool calls by name and kwargs
Supports ordered and unordered matching
Supports strict (exact) or subset kwargs comparison
Configurable pass/fail threshold
Returns a score (fraction of matched calls) and detailed feedback

AgentGoalSuccessEvaluator (LLM-based judge)

Scores whether an agent achieved a given goal on a 1-5 scale
Considers the agent response, tool call history, and optional reference outcome
Follows the same pattern as CorrectnessEvaluator (uses default_parser, same prompt structure)
Configurable LLM, prompt template, and score threshold

Both evaluators:

Extend BaseEvaluator with proper aevaluate, _get_prompts, _update_prompts
Are exported from llama_index.core.evaluation (top-level access)
Work with both evaluate() (sync) and aevaluate() (async)

Why

LlamaIndex has evaluators for RAG (faithfulness, relevancy, correctness, guideline adherence) but nothing for agent-specific evaluation. With the growth of agentic workflows, users need to verify that agents call the right tools with the right arguments and actually achieve the stated goal.

Usage

from llama_index.core.evaluation import ToolCallCorrectnessEvaluator

evaluator = ToolCallCorrectnessEvaluator()
result = evaluator.evaluate(
    expected_tool_calls=[
        {"tool_name": "search", "tool_kwargs": {"query": "weather SF"}},
    ],
    actual_tool_calls=[
        {"tool_name": "search", "tool_kwargs": {"query": "weather SF", "limit": 10}},
    ],
)
print(result.score)    # 1.0 (subset kwargs match by default)
print(result.passing)  # True

from llama_index.core.evaluation import AgentGoalSuccessEvaluator

evaluator = AgentGoalSuccessEvaluator()
result = evaluator.evaluate(
    query="Find cheap flights from NYC to London",
    response="Found 3 flights. Cheapest is Delta at $450.",
    contexts=[
        "Called flight_search(from='NYC', to='London') -> [Delta $450, United $520]",
        "Called sort_results(by='price') -> [Delta $450, United $520]",
    ],
)
print(result.score)    # 5.0
print(result.passing)  # True

Testing

18 unit tests for ToolCallCorrectnessEvaluator and compare_tool_calls
11 unit tests for AgentGoalSuccessEvaluator (mocked LLM)
3 integration tests with real OpenAI calls (skipped without API key)
All 29 unit tests pass locally

$ pytest llama-index-core/tests/evaluation/test_tool_call_correctness.py \
        llama-index-core/tests/evaluation/test_goal_success.py -v
============================== 29 passed in 0.04s ==============================

Files changed

File	What
`evaluation/agent/__init__.py`	Module exports
`evaluation/agent/tool_call_correctness.py`	Deterministic tool call evaluator
`evaluation/agent/goal_success.py`	LLM-based goal achievement evaluator
`evaluation/agent/utils.py`	Comparison logic and `ToolCallComparisonResult` dataclass
`evaluation/__init__.py`	Export new evaluators at top level
`tests/evaluation/test_tool_call_correctness.py`	18 unit tests
`tests/evaluation/test_goal_success.py`	11 unit tests
`tests/evaluation/test_agent_eval_integration.py`	3 integration tests

Changed files

llama-index-core/llama_index/core/evaluation/__init__.py (modified, +7/-0)
llama-index-core/llama_index/core/evaluation/agent/__init__.py (added, +19/-0)
llama-index-core/llama_index/core/evaluation/agent/goal_success.py (added, +168/-0)
llama-index-core/llama_index/core/evaluation/agent/tool_call_correctness.py (added, +104/-0)
llama-index-core/llama_index/core/evaluation/agent/utils.py (added, +115/-0)
llama-index-core/tests/evaluation/test_agent_eval_integration.py (added, +94/-0)
llama-index-core/tests/evaluation/test_goal_success.py (added, +166/-0)
llama-index-core/tests/evaluation/test_tool_call_correctness.py (added, +248/-0)

Code Example

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...

RAW_BUFFERClick to expand / collapse

Problem

LlamaIndex has solid evaluation for RAG (AnswerRelevancyEvaluator, FaithfulnessEvaluator, etc.) but nothing for evaluating agent behavior. The instrumentation module already captures agent events (AgentToolCallEvent, AgentRunStepStartEvent/EndEvent) but no evaluator consumes them.

As more teams build with AgentWorkflow, ReActAgent, and FunctionAgent, there's no built-in way to answer:

Did the agent call the right tool with the right arguments?
Did it follow system prompt instructions?
Was the reasoning chain sound (for ReAct)?
Did it stop at the right time instead of looping?

RAGAS and DeepEval both ship agent-specific evaluators (ToolCallAccuracy, TaskCompletion). LlamaIndex could have native equivalents that work directly with its instrumentation and agent types.

Proposal

Add agent evaluators under llama_index.core.evaluation.agent/ that follow the existing BaseEvaluator pattern.

Evaluators

ToolCallCorrectnessEvaluator - Given a user query and expected tool calls, score whether the agent called the correct tools with correct arguments. Works with any agent type.

InstructionAdherenceEvaluator - LLM-judged: did the agent's response follow the system prompt constraints? Uses the same judge LLM pattern as GuidelineEvaluator.

ReasoningQualityEvaluator - For ReAct agents: evaluates whether the reasoning steps (thought/action/observation chain) are logically sound and lead to the correct conclusion.

AgentGoalSuccessEvaluator - End-to-end: given a task description and the agent's final output, did the agent accomplish the goal?

Interface

Building on the existing BaseEvaluator:

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...

The tool_calls and expected_tool_calls parameters extend the base interface. Existing BatchEvalRunner would work for running these at scale.

Scope

I'd start with ToolCallCorrectnessEvaluator and AgentGoalSuccessEvaluator as a first PR, since they cover the most common evaluation need (did the agent do the right thing?). The other two can follow.

Happy to take this on. I've been contributing to the repo (security fixes, rate limiting).

extent analysis

Fix Plan

Step-by-Step Solution

1. Create Agent Evaluator Module

Create a new module llama_index.core.evaluation.agent to hold the new evaluators.

2. Implement ToolCallCorrectnessEvaluator

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult
from llama_index.core.events import ToolCall

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        # Compare tool_calls with expected_tool_calls
        correct_calls = [call for call in tool_calls if call in expected_tool_calls]
        score = len(correct_calls) / len(expected_tool_calls)
        return EvaluationResult(score=score)

3. Implement AgentGoalSuccessEvaluator

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class AgentGoalSuccessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        task_description: str | None = None,
        agent_output: str | None = None,
        **kwargs,
    ) -> EvaluationResult:
        # Evaluate whether agent_output meets task_description requirements
        # This can be a simple string comparison or a more complex logic check
        score = 1 if agent_output == task_description else 0
        return EvaluationResult(score=score)

4. Update BatchEvalRunner

Update BatchEvalRunner to accept the new evaluators and run them at scale.

5. Test

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.