llamaIndex - ✅(Solved) Fix [Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20862Fetched 2026-04-08 00:30:35
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #20887: Add agent evaluation: ToolCallCorrectness and AgentGoalSuccess evaluators

Description (problem / solution / changelog)

Resolves #20862

What this PR does

Adds two evaluators for agent workflows under llama_index.core.evaluation.agent:

ToolCallCorrectnessEvaluator (deterministic, no LLM needed)

  • Compares expected vs actual tool calls by name and kwargs
  • Supports ordered and unordered matching
  • Supports strict (exact) or subset kwargs comparison
  • Configurable pass/fail threshold
  • Returns a score (fraction of matched calls) and detailed feedback

AgentGoalSuccessEvaluator (LLM-based judge)

  • Scores whether an agent achieved a given goal on a 1-5 scale
  • Considers the agent response, tool call history, and optional reference outcome
  • Follows the same pattern as CorrectnessEvaluator (uses default_parser, same prompt structure)
  • Configurable LLM, prompt template, and score threshold

Both evaluators:

  • Extend BaseEvaluator with proper aevaluate, _get_prompts, _update_prompts
  • Are exported from llama_index.core.evaluation (top-level access)
  • Work with both evaluate() (sync) and aevaluate() (async)

Why

LlamaIndex has evaluators for RAG (faithfulness, relevancy, correctness, guideline adherence) but nothing for agent-specific evaluation. With the growth of agentic workflows, users need to verify that agents call the right tools with the right arguments and actually achieve the stated goal.

Usage

from llama_index.core.evaluation import ToolCallCorrectnessEvaluator

evaluator = ToolCallCorrectnessEvaluator()
result = evaluator.evaluate(
    expected_tool_calls=[
        {"tool_name": "search", "tool_kwargs": {"query": "weather SF"}},
    ],
    actual_tool_calls=[
        {"tool_name": "search", "tool_kwargs": {"query": "weather SF", "limit": 10}},
    ],
)
print(result.score)    # 1.0 (subset kwargs match by default)
print(result.passing)  # True
from llama_index.core.evaluation import AgentGoalSuccessEvaluator

evaluator = AgentGoalSuccessEvaluator()
result = evaluator.evaluate(
    query="Find cheap flights from NYC to London",
    response="Found 3 flights. Cheapest is Delta at $450.",
    contexts=[
        "Called flight_search(from='NYC', to='London') -> [Delta $450, United $520]",
        "Called sort_results(by='price') -> [Delta $450, United $520]",
    ],
)
print(result.score)    # 5.0
print(result.passing)  # True

Testing

  • 18 unit tests for ToolCallCorrectnessEvaluator and compare_tool_calls
  • 11 unit tests for AgentGoalSuccessEvaluator (mocked LLM)
  • 3 integration tests with real OpenAI calls (skipped without API key)
  • All 29 unit tests pass locally
$ pytest llama-index-core/tests/evaluation/test_tool_call_correctness.py \
        llama-index-core/tests/evaluation/test_goal_success.py -v
============================== 29 passed in 0.04s ==============================

Files changed

FileWhat
evaluation/agent/__init__.pyModule exports
evaluation/agent/tool_call_correctness.pyDeterministic tool call evaluator
evaluation/agent/goal_success.pyLLM-based goal achievement evaluator
evaluation/agent/utils.pyComparison logic and ToolCallComparisonResult dataclass
evaluation/__init__.pyExport new evaluators at top level
tests/evaluation/test_tool_call_correctness.py18 unit tests
tests/evaluation/test_goal_success.py11 unit tests
tests/evaluation/test_agent_eval_integration.py3 integration tests

Changed files

  • llama-index-core/llama_index/core/evaluation/__init__.py (modified, +7/-0)
  • llama-index-core/llama_index/core/evaluation/agent/__init__.py (added, +19/-0)
  • llama-index-core/llama_index/core/evaluation/agent/goal_success.py (added, +168/-0)
  • llama-index-core/llama_index/core/evaluation/agent/tool_call_correctness.py (added, +104/-0)
  • llama-index-core/llama_index/core/evaluation/agent/utils.py (added, +115/-0)
  • llama-index-core/tests/evaluation/test_agent_eval_integration.py (added, +94/-0)
  • llama-index-core/tests/evaluation/test_goal_success.py (added, +166/-0)
  • llama-index-core/tests/evaluation/test_tool_call_correctness.py (added, +248/-0)

Code Example

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...
RAW_BUFFERClick to expand / collapse

Problem

LlamaIndex has solid evaluation for RAG (AnswerRelevancyEvaluator, FaithfulnessEvaluator, etc.) but nothing for evaluating agent behavior. The instrumentation module already captures agent events (AgentToolCallEvent, AgentRunStepStartEvent/EndEvent) but no evaluator consumes them.

As more teams build with AgentWorkflow, ReActAgent, and FunctionAgent, there's no built-in way to answer:

  • Did the agent call the right tool with the right arguments?
  • Did it follow system prompt instructions?
  • Was the reasoning chain sound (for ReAct)?
  • Did it stop at the right time instead of looping?

RAGAS and DeepEval both ship agent-specific evaluators (ToolCallAccuracy, TaskCompletion). LlamaIndex could have native equivalents that work directly with its instrumentation and agent types.

Proposal

Add agent evaluators under llama_index.core.evaluation.agent/ that follow the existing BaseEvaluator pattern.

Evaluators

ToolCallCorrectnessEvaluator - Given a user query and expected tool calls, score whether the agent called the correct tools with correct arguments. Works with any agent type.

InstructionAdherenceEvaluator - LLM-judged: did the agent's response follow the system prompt constraints? Uses the same judge LLM pattern as GuidelineEvaluator.

ReasoningQualityEvaluator - For ReAct agents: evaluates whether the reasoning steps (thought/action/observation chain) are logically sound and lead to the correct conclusion.

AgentGoalSuccessEvaluator - End-to-end: given a task description and the agent's final output, did the agent accomplish the goal?

Interface

Building on the existing BaseEvaluator:

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        ...

The tool_calls and expected_tool_calls parameters extend the base interface. Existing BatchEvalRunner would work for running these at scale.

Scope

I'd start with ToolCallCorrectnessEvaluator and AgentGoalSuccessEvaluator as a first PR, since they cover the most common evaluation need (did the agent do the right thing?). The other two can follow.

Happy to take this on. I've been contributing to the repo (security fixes, rate limiting).

extent analysis

Fix Plan

Step-by-Step Solution

1. Create Agent Evaluator Module

Create a new module llama_index.core.evaluation.agent to hold the new evaluators.

2. Implement ToolCallCorrectnessEvaluator

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult
from llama_index.core.events import ToolCall

class ToolCallCorrectnessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        tool_calls: list[ToolCall] | None = None,
        expected_tool_calls: list[ToolCall] | None = None,
        **kwargs,
    ) -> EvaluationResult:
        # Compare tool_calls with expected_tool_calls
        correct_calls = [call for call in tool_calls if call in expected_tool_calls]
        score = len(correct_calls) / len(expected_tool_calls)
        return EvaluationResult(score=score)

3. Implement AgentGoalSuccessEvaluator

from llama_index.core.evaluation import BaseEvaluator, EvaluationResult

class AgentGoalSuccessEvaluator(BaseEvaluator):
    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        task_description: str | None = None,
        agent_output: str | None = None,
        **kwargs,
    ) -> EvaluationResult:
        # Evaluate whether agent_output meets task_description requirements
        # This can be a simple string comparison or a more complex logic check
        score = 1 if agent_output == task_description else 0
        return EvaluationResult(score=score)

4. Update BatchEvalRunner

Update BatchEvalRunner to accept the new evaluators and run them at scale.

5. Test

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix [Feature Request]: Agent evaluation framework: tool correctness, instruction adherence, reasoning quality [1 pull requests, 1 participants]