AI Agents and Workflows

Evaluating Agentic Systems

Your agent answered correctly, but did it pick the right tools, in the right order, without redundant calls? Score agent trajectories and answer quality with MLflow's tool-call scorers and LLM-as-judge.

Your agent can return a plausible answer and still take the wrong path to get there. It might call the wrong tool, call the right tool three times, ignore a tool error, or invent data after a failed lookup. The final response reads fine. The trajectory is a mess.

In this lesson we put a stock research agent under evaluation. We score both the final answer and the trajectory — the sequence of tool calls that produced it.

What You'll Build

  • Score tool-call trajectories with ToolCallCorrectness and ToolCallEfficiency
  • Combine trajectory and answer-quality checks in one evaluate() call
  • Build a tool-calling stock research agent under automated evaluation
  • Read per-case results to find where the agent picks the wrong tool

The Agent Evaluation Suite

A tool-calling agent has failure modes a plain RAG pipeline does not:

  • It picks the wrong tool (search news when asked for a P/E ratio)
  • It picks the right tool but calls it redundantly (look up the same ticker three times)
  • It returns a fluent answer that ignores what the tools actually returned

The same mlflow.genai.evaluate() machinery from the RAG evaluation tutorial accepts agent-specific scorers. We mix two trajectory scorers with two answer-quality scorers:

  • ToolCallCorrectness1 - Did the agent call the tools the task required, with the right arguments?
  • ToolCallEfficiency2 - Did it call them without redundancy or unnecessary detours?
  • RelevanceToQuery3 - Does the final answer address the user's question?
  • Completeness4 - Does the answer cover all the facts the question asks for?

All four are LLM-as-judge scorers. Each returns a pass/fail verdict and a rationale. The split is the whole point:

LayerQuestionScorers
TrajectoryDid the agent take the right path?ToolCallCorrectness, ToolCallEfficiency
AnswerDid the final response satisfy the user?RelevanceToQuery, Completeness

Use trajectory scores to debug tool use. Use answer scores to debug final communication. A production agent needs both layers.

Membership requiredJoin 855+ members
Access Denied
This tutorial is part of the full AI engineering roadmap.
What you unlock
  • 01All 6 modules · 40+ tutorials · source code
  • 02Verifiable certificate with public URL
  • 03LinkedIn-ready completion credential
  • 04Live sessions + every recording
  • 05Discord community
Price·monthly
$39/mo·Cancel anytime
“Best educational investment in my ML/AI journey.”
— Ana Clara Medeiros·AI Developer
30-day money-back guaranteeInstant access after paymentSecure checkout · stripe

References

Footnotes

  1. ToolCallCorrectness

  2. ToolCallEfficiency

  3. RelevanceToQuery

  4. Completeness

← Previous · 07Agentic RAG - Building an AI Financial Analyst Team
✓ Module completeGreat job — onto the next module.