AI Agents and Workflows

Evaluating Agentic Systems

Your agent answered correctly, but did it pick the right tools, in the right order, without redundant calls? Score agent trajectories and answer quality with MLflow's tool-call scorers and LLM-as-judge.

Your agent can return a plausible answer and still take the wrong path to get there. It might call the wrong tool, call the right tool three times, ignore a tool error, or invent data after a failed lookup. The final response reads fine. The trajectory is a mess.

In this lesson we put a stock research agent under evaluation. We score both the final answer and the trajectory — the sequence of tool calls that produced it.

🎥 Video: Why Answer Scores Are Not Enough (2-3 minutes)

What to cover:

  • Show a stock question where the final answer sounds correct
  • Open the trace and show the hidden tool calls behind the answer
  • Point out two failure modes: wrong tool selection and redundant calls
  • Explain why agent evaluation must score both the trajectory and the final response

What You'll Build

  • Score tool-call trajectories with ToolCallCorrectness and ToolCallEfficiency
  • Combine trajectory and answer-quality checks in one evaluate() call
  • Build a tool-calling stock research agent under automated evaluation
  • Read per-case results to find where the agent picks the wrong tool

The Agent Evaluation Suite

A tool-calling agent has failure modes a plain RAG pipeline does not:

  • It picks the wrong tool (search news when asked for a P/E ratio)
  • It picks the right tool but calls it redundantly (look up the same ticker three times)
  • It returns a fluent answer that ignores what the tools actually returned
🎥 Video: Evaluation Architecture (5-10 minutes)

What to cover:

  • Draw the flow: dataset query → predict() → LangGraph agent → tool calls → MLflow trace
  • Show where trajectory scorers inspect tool calls
  • Show where answer-quality scorers inspect the final response
  • Explain why the judge model should be separate from the agent model
  • Open MLflow and show how one failed case maps back to the trace

The same mlflow.genai.evaluate() machinery from the RAG evaluation tutorial accepts agent-specific scorers. We mix two trajectory scorers with two answer-quality scorers:

  • ToolCallCorrectness1 - Did the agent call the tools the task required, with the right arguments?
  • ToolCallEfficiency2 - Did it call them without redundancy or unnecessary detours?
  • RelevanceToQuery3 - Does the final answer address the user's question?
  • Completeness4 - Does the answer cover all the facts the question asks for?

All four are LLM-as-judge scorers. Each returns a pass/fail verdict and a rationale. The split is the whole point:

LayerQuestionScorers
TrajectoryDid the agent take the right path?ToolCallCorrectness, ToolCallEfficiency
AnswerDid the final response satisfy the user?RelevanceToQuery, Completeness

Use trajectory scores to debug tool use. Use answer scores to debug final communication. A production agent needs both layers.

References

Footnotes

  1. ToolCallCorrectness

  2. ToolCallEfficiency

  3. RelevanceToQuery

  4. Completeness

← Previous · 07Agentic RAG - Building an AI Financial Analyst Team
✓ Module completeGreat job — onto the next module.