AI Agents and Workflows
Evaluating Agentic Systems
Your agent answered correctly, but did it pick the right tools, in the right order, without redundant calls? Score agent trajectories and answer quality with MLflow's tool-call scorers and LLM-as-judge.
Your agent can return a plausible answer and still take the wrong path to get there. It might call the wrong tool, call the right tool three times, ignore a tool error, or invent data after a failed lookup. The final response reads fine. The trajectory is a mess.
In this lesson we put a stock research agent under evaluation. We score both the final answer and the trajectory — the sequence of tool calls that produced it.
What You'll Build
- Score tool-call trajectories with
ToolCallCorrectnessandToolCallEfficiency - Combine trajectory and answer-quality checks in one
evaluate()call - Build a tool-calling stock research agent under automated evaluation
- Read per-case results to find where the agent picks the wrong tool
The Agent Evaluation Suite
A tool-calling agent has failure modes a plain RAG pipeline does not:
- It picks the wrong tool (search news when asked for a P/E ratio)
- It picks the right tool but calls it redundantly (look up the same ticker three times)
- It returns a fluent answer that ignores what the tools actually returned
The same mlflow.genai.evaluate() machinery from the RAG evaluation tutorial accepts agent-specific scorers. We mix two trajectory scorers with two answer-quality scorers:
- ToolCallCorrectness1 - Did the agent call the tools the task required, with the right arguments?
- ToolCallEfficiency2 - Did it call them without redundancy or unnecessary detours?
- RelevanceToQuery3 - Does the final answer address the user's question?
- Completeness4 - Does the answer cover all the facts the question asks for?
All four are LLM-as-judge scorers. Each returns a pass/fail verdict and a rationale. The split is the whole point:
| Layer | Question | Scorers |
|---|---|---|
| Trajectory | Did the agent take the right path? | ToolCallCorrectness, ToolCallEfficiency |
| Answer | Did the final response satisfy the user? | RelevanceToQuery, Completeness |
Use trajectory scores to debug tool use. Use answer scores to debug final communication. A production agent needs both layers.
- 01All 6 modules · 40+ tutorials · source code
- 02Verifiable certificate with public URL
- 03LinkedIn-ready completion credential
- 04Live sessions + every recording
- 05Discord community