Evaluating Agentic Systems
Your agent answered correctly, but did it pick the right tools, in the right order, without redundant calls? Score agent trajectories and answer quality with MLflow's tool-call scorers and LLM-as-judge.
Your agent can return a plausible answer and still take the wrong path to get there. It might call the wrong tool, call the right tool three times, ignore a tool error, or invent data after a failed lookup. The final response reads fine. The trajectory is a mess.
In this lesson we put a stock research agent under evaluation. We score both the final answer and the trajectory — the sequence of tool calls that produced it.
What You'll Build
- Score tool-call trajectories with
ToolCallCorrectnessandToolCallEfficiency - Combine trajectory and answer-quality checks in one
evaluate()call - Build a tool-calling stock research agent under automated evaluation
- Read per-case results to find where the agent picks the wrong tool
The Agent Evaluation Suite
A tool-calling agent has failure modes a plain RAG pipeline does not:
- It picks the wrong tool (search news when asked for a P/E ratio)
- It picks the right tool but calls it redundantly (look up the same ticker three times)
- It returns a fluent answer that ignores what the tools actually returned
The same mlflow.genai.evaluate() machinery from the RAG evaluation tutorial accepts agent-specific scorers. We mix two trajectory scorers with two answer-quality scorers:
- ToolCallCorrectness1 - Did the agent call the tools the task required, with the right arguments?
- ToolCallEfficiency2 - Did it call them without redundancy or unnecessary detours?
- RelevanceToQuery3 - Does the final answer address the user's question?
- Completeness4 - Does the answer cover all the facts the question asks for?
All four are LLM-as-judge scorers. Each returns a pass/fail verdict and a rationale. The split is the whole point:
| Layer | Question | Scorers |
|---|---|---|
| Trajectory | Did the agent take the right path? | ToolCallCorrectness, ToolCallEfficiency |
| Answer | Did the final response satisfy the user? | RelevanceToQuery, Completeness |
Use trajectory scores to debug tool use. Use answer scores to debug final communication. A production agent needs both layers.
- 01All 6 modules · 40+ tutorials · source code
- 02Verifiable certificate with public URL
- 03LinkedIn-ready completion credential
- 04Live sessions + every recording
- 05Discord community