AI Agents and Workflows

Evaluating Agentic Systems

Your agent answered correctly, but did it pick the right tools, in the right order, without redundant calls? Score agent trajectories and answer quality with MLflow's tool-call scorers and LLM-as-judge.

Your agent can return a plausible answer and still take the wrong path to get there. It might call the wrong tool, call the right tool three times, ignore a tool error, or invent data after a failed lookup. The final response reads fine. The trajectory is a mess.

In this lesson we put a stock research agent under evaluation. We score both the final answer and the trajectory — the sequence of tool calls that produced it.

🎥 Video: Why Answer Scores Are Not Enough (2-3 minutes)

What to cover:

Show a stock question where the final answer sounds correct
Open the trace and show the hidden tool calls behind the answer
Point out two failure modes: wrong tool selection and redundant calls
Explain why agent evaluation must score both the trajectory and the final response

What You'll Build

Score tool-call trajectories with ToolCallCorrectness and ToolCallEfficiency
Combine trajectory and answer-quality checks in one evaluate() call
Build a tool-calling stock research agent under automated evaluation
Read per-case results to find where the agent picks the wrong tool

The Agent Evaluation Suite

A tool-calling agent has failure modes a plain RAG pipeline does not:

It picks the wrong tool (search news when asked for a P/E ratio)
It picks the right tool but calls it redundantly (look up the same ticker three times)
It returns a fluent answer that ignores what the tools actually returned

🎥 Video: Evaluation Architecture (5-10 minutes)

What to cover:

Draw the flow: dataset query → predict() → LangGraph agent → tool calls → MLflow trace
Show where trajectory scorers inspect tool calls
Show where answer-quality scorers inspect the final response
Explain why the judge model should be separate from the agent model
Open MLflow and show how one failed case maps back to the trace

The same mlflow.genai.evaluate() machinery from the RAG evaluation tutorial accepts agent-specific scorers. We mix two trajectory scorers with two answer-quality scorers:

ToolCallCorrectness¹ - Did the agent call the tools the task required, with the right arguments?
ToolCallEfficiency² - Did it call them without redundancy or unnecessary detours?
RelevanceToQuery³ - Does the final answer address the user's question?
Completeness⁴ - Does the answer cover all the facts the question asks for?

All four are LLM-as-judge scorers. Each returns a pass/fail verdict and a rationale. The split is the whole point:

Layer	Question	Scorers
Trajectory	Did the agent take the right path?	`ToolCallCorrectness`, `ToolCallEfficiency`
Answer	Did the final response satisfy the user?	`RelevanceToQuery`, `Completeness`

Use trajectory scores to debug tool use. Use answer scores to debug final communication. A production agent needs both layers.

AI Agents and Workflows

Evaluating Agentic Systems

What You'll Build

The Agent Evaluation Suite

References

Footnotes