RAG and Context Engineering

The RAG Evaluation Triad

Your RAG pipeline returned a wrong answer, but was it bad retrieval or LLM hallucination? Score retrieval relevance, hallucination, and answer quality automatically with LLM-as-judge evaluators and MLflow

Your RAG pipeline returns an answer, but is it grounded in the documents or did the LLM hallucinate? Did the retriever even pull the right documents? These are distinct failure modes, and you need automated checks for each.

What You'll Build

  • Score RAG answers with three LLM-as-judge evaluators (the Triad)
  • Write a custom scorer for domain-specific quality checks
  • Run automated evaluation across a dataset
  • Analyze per-query scorer rationales from trace data

The Evaluation Triad

A RAG pipeline can fail in three common ways:

  • The retriever can pull irrelevant documents
  • The LLM can hallucinate beyond the retrieved context
  • The final answer can miss the user's actual question entirely

The Evaluation Triad checks all three dimensions using an LLM-as-judge pattern (model grades the responses):

  • RelevanceToQuery1 - Does the generated answer address the user's question? A response about NVIDIA revenue is factually correct, but irrelevant if the user asked about Apple. This catches off-topic answers.
  • RetrievalGroundedness2 - Is every claim in the answer supported by the retrieved documents? If the retriever fetched the right documents but the LLM hallucinated facts, this scorer catches it.
  • RetrievalRelevance3 - Did the retriever pull documents relevant to the query? If the retriever fetched Apple earnings when the user asked about NVIDIA, the LLM never had a chance to answer correctly.

MLflow4 provides built-in scorers for all three. Each uses a judge model to return a pass/fail verdict with a rationale explaining why.

Our RAG Evaluation Pipeline
Our RAG Evaluation Pipeline

Footnotes

  1. RelevanceToQuery

  2. RetrievalGroundedness

  3. RetrievalRelevance

  4. MLflow GenAI Evaluation