Lies, Damn Lies and Hallucinations - Evaluating your LLMs
When a new model drops, you’ll see headlines trumpeting its amazing benchmark scores. DeepSeek-R1 beats OpenAI o1! Claude outperforms Llama! But here’s the uncomfortable truth: those benchmark scores might not tell you what you really need to know.
Think about it this way: if you’re studying for a test, and you have access to all the questions and answers beforehand, are your test results meaningful? That’s essentially what happens with public LLM benchmarks. Model developers can’t help but be aware of these tests, even if they don’t directly train on them. They are incentivized to make their models perform well on these public metrics, which can lead to benchmark-chasing rather than real-world capability improvements.
How to build a reliable evaluation
So how do you figure out which model is actually best for your needs? You’ll need to run your own evaluation. Here’s how to do it:
-
Gather realistic data: Collect examples that match what you’ll use in production. If you’re building a customer service bot, use actual customer queries. If you’re doing code generation, use real coding tasks from your team.
-
Design prompts: Create a set of prompts that reflect how you’ll actually use the model. Start with simple templates and iterate based on results. These will be your testing foundation.
-
Select candidate models: Don’t just jump to the biggest, most expensive model. Start with one that should do the job, but is always fast to iterate with. You can always scale up if needed.
-
Define success metrics: Decide what “good” looks like for your use case. This could be simple accuracy scores, response time, or more sophisticated measures like hallucination detection or consistency checking.
Before we dive deeper into each of these steps, let’s examine what are the limitations of LLMs, this will help you design better evaluations.
Known limitations of LLMs
Your Large Language Model might seem like a genius, but it’s important to understand what it can’t do. Here are the key limitations you’ll encounter when working with LLMs:
-
Hallucinations: Making Things Up -Think of LLMs as pattern-matching machines rather than encyclopedias. They predict what words should come next based on patterns they’ve seen in training, not based on facts they “know.” This means they can write convincingly about things that simply aren’t true.
-
Biases - LLMs learn from human-created content, which means they inherit human biases.
-
Static Knowledge - LLMs are like books - once printed, their knowledge doesn’t update. If an LLM was trained on data up to 2022, it won’t know about events in 2023. While we can use techniques like RAG (Retrieval-Augmented Generation) to add new information, the base model remains frozen in time.
-
Language Inequality -These models excel in English but often struggle with other languages. Why? Because most of their training data is in English. You can expect poorer performance in languages with fewer online resources
-
False Confidence -One of the trickiest problems with LLMs is that they sound equally confident whether they’re right or wrong. Unlike traditional ML models that give you probability scores, LLMs can write perfectly convincing nonsense with no indication that they’re uncertain.
In the next sections, we’ll explore practical strategies for managing these limitations and building more reliable AI applications.
Managing your evaluations with Opik
Evaluating your LLMs is something you should treat just as production code. Having a system that is well documented and easy to use will help you iterate faster and make better decisions.
In my practice, I use Opik1 (by Comet) - free (for self hosting) and open source tool that allows you to go through the evaluation lifecycle and go as deep as you need to. You can start with pre-made metrics, and then build your own.
From RAG chatbots to code assistants to complex agentic systems and beyond, build LLM systems that run better, faster, and cheaper with tracing, evaluations, and dashboards.
Opik is easy to setup on your own machine (it requires Docker2 and Docker Compose3 installed). First, clone the repository:
git clone https://github.com/comet-ml/opik.git
cd opik
git checkout 9e83ba0
This should give you version 1.4.5
of Opik. Now go to the deployment/docker-compose
directory and start the services:
cd deployment/docker-compose
Start the services:
docker compose up --detach
If everything is running correctly, go to http://localhost:5173/ and you should see the Opik dashboard.
When you want to stop it, run:
docker compose down
Tracing calls
Logging prompts and the LLM responses is a good first step to understand how your model is performing. Opik makes this really easy, even if you’re using a local model. Let’s setup the client:
import os
import opik
from loguru import logger
from openai import OpenAI
from opik.integrations.openai import track_openai
opik.configure(use_local=True)
os.environ["OPIK_PROJECT_NAME"] = "tracing"
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
)
client = track_openai(client)
The track_openai
wraps the OpenAI client (that points to a local Ollama instance) and will observe calls and responses to the client. Let’s look at an example:
MODEL = "llama3.2"
TEMPERATURE = 0
PROMPT = """
You are a UX writer specializing in clear, actionable error messages.
Write a payment failure error message in 2 parts:
- What happened (max 10 words)
- What to do (max 15 words)
The result should be a single error message that is 25 words or less. Format it as a JSON object.
""".strip()
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": PROMPT,
}
],
temperature=TEMPERATURE,
response_format={"type": "json_object"},
model=MODEL,
)
logger.debug(chat_completion.choices[0].message.content)
This call looks entirely regular, but it is going to be traced by Opik and you can see the results in the dashboard. We gave a name to our project - tracing
, look for that in the dashboard:

Evaluating with your own data
In this part, I’ll go through the whole evaluation process using a custom dataset (available on HuggingFace: https://huggingface.co/datasets/NickyNicky/crypto-news-small). We’ll use a dataset of cryptocurrency news articles and use an LLM to extract sentiment from them.
Let’s start by adding the dataset to Opik. Here’s a sample of the data:
title | text | class | polarity | subjectivity |
---|---|---|---|---|
Grayscale CEO Calls for Simult… | Grayscale CEO Michael Sonnenshein believes the SEC… | negative | -0.1 | 0.6 |
Indian Government is Actively … | In an exclusive interview with CryptoNews, Manhar … | neutral | 0 | 0 |
Judge Approves Settlement: Bin… | According to the Federal Court ruling on December … | positive | 0.05 | 0.05 |
Why a gold rush for inscriptio… | Some suggest EVM inscriptions are the latest way f… | positive | 0.5 | 0.9 |
”Concerning precedent” — bloXr… | A decision by bloXroute Labs to start censoring OF… | neutral | 0 | 0 |
Opik allows you to add your data as a dataset. This will help you track your experiments better and probably make them run faster. Let’s first create a function that loads the news data:
import ast
import opik
import pandas as pd
from opik import Opik
def create_crypto_data() -> pd.DataFrame:
df = pd.read_parquet(Config.Path.DATA_DIR / "crypto-news.parquet")
sentiment_df = pd.DataFrame(list(df["sentiment"].apply(ast.literal_eval).values))
return pd.concat([df.drop("sentiment", axis=1), sentiment_df], axis=1)
Next, we’ll create a new dataset in Opik and insert the data:
N_CRYPTO_NEWS = 100
ARTICLE_FORMAT = """
<article>
<title>{title}</title>
<text>{text}</text>
</article>
""".strip()
def add_crypto_news_dataset():
client.delete_dataset(Config.Dataset.CRYPTO_NEWS)
dataset = client.get_or_create_dataset(Config.Dataset.CRYPTO_NEWS)
df = create_crypto_data().sample(N_CRYPTO_NEWS)
rows = []
for _, row in df.iterrows():
rows.append(
{
"input": ARTICLE_FORMAT.format(title=row.title, text=row.text),
"expected_output": {
"sentiment": row["class"],
"polarity": row["polarity"],
"subjectivity": row["subjectivity"],
},
}
)
dataset.insert(rows)
client = Opik()
add_crypto_news_dataset(client)
We’re formatting each article as an XML-like structure, with a title and text. This is a common way to structure data for LLMs. The output is a JSON object with the labels (sentiment
, polarity
and subjectivity
) we want our model to predict. Let’s check the dataset in the dashboard:

We’ll test two different prompts for extracting the sentiment from the news articles. Luckily, Opik allows you to add prompts to a library, so you can reuse them in different evaluations. Here’s how you can add a prompt:
import opik
from opik import Opik
from bootcamp.config import Config
opik.configure(use_local=True)
client = Opik()
CLASSIFY_ARTICLE_PROMPT = """
Classify the following news article:
- sentiment (positive, neutral, negative)
- subjectivity (float)
- polarity (float)
{{input}}
The response must be in JSON. Here's an example:
```
{
"sentiment": "positive",
"subjectivity": 0.5,
"polarity": -0.1
}
```
Reply only with the JSON object.
""".strip()
client.create_prompt(Config.Prompt.CLASSIFY_ARTICLE, CLASSIFY_ARTICLE_PROMPT)

The next prompt will focus the evaluation on correctly finding the sentiment. This is the most important part of the task. Here’s how you can add it:
CLASSIFY_ARTICLE_FOCUS_SENTIMENT_PROMPT = """
Classify the following news article:
- sentiment (positive, neutral, negative)
- subjectivity (float)
- polarity (float)
{{input}}
The response must be in JSON. Here's an example:
```
{
"sentiment": "positive",
"subjectivity": 0.5,
"polarity": -0.1
}
```
Focus on correctly finding the sentiment. This is the most important part.
Reply only with the JSON object.
""".strip()
client.create_prompt(
Config.Prompt.CLASSIFY_ARTICLE_FOCUS_SENTIMENT,
CLASSIFY_ARTICLE_FOCUS_SENTIMENT_PROMPT,
)
For this example, we’ll measure the accuracy of the predicted sentiment. For that, we’ll create a custom metric that compares the expected sentiment with the predicted sentiment. Here’s how to do it:
import json
from typing import Any
from opik.evaluation import models
from opik.evaluation.metrics import Hallucination, base_metric, score_result
from bootcamp.config import Config
def _remove_thinking_from_response(response: str) -> str:
close_tag = "</think>"
tag_length = len(close_tag)
return response[response.find(close_tag) + tag_length :].strip()
class AccuracyMetric(base_metric.BaseMetric):
def __init__(self, name: str, field: str):
self.name = name
self.field = field
def score(self, expected_output: str, output: str, **ignored_kwargs: Any):
output = _remove_thinking_from_response(output)
text = output.replace("```json", "").replace("```", "")
response = json.loads(text)
return score_result.ScoreResult(
value=expected_output[self.field] == response[self.field],
name=self.name,
)
We’re removing the <think>
tag from the response (if you’re using the DeepSeek-R1 model). We also remove json blocks if the response is in a code block. We’re ready to run our experiment:
import opik
from opik import Opik
from opik.evaluation import evaluate_prompt, models
from bootcamp.config import Config
from bootcamp.evaluation.metrics import AccuracyMetric
opik.configure(use_local=True)
client = Opik()
dataset = client.get_or_create_dataset(Config.Dataset.CRYPTO_NEWS)
evaluate_prompt(
project_name="crypto-news",
experiment_name="sentiment-accuracy",
dataset=dataset,
messages=[
{
"role": "user",
"content": client.get_prompt(Config.Prompt.CLASSIFY_ARTICLE).prompt,
},
],
scoring_metrics=[AccuracyMetric(name="sentiment_accuracy", field="sentiment")],
model=models.LiteLLMChatModel(
model_name=f"ollama/{Config.Model.DEEPSEEK_R1}",
temperature=0,
),
)
We’re making use of our existing dataset and prompt. The LiteLLMChatModel
points to our Ollama instance with the DeepSeek-R1 model. Run the experiment:
uv run bootcamp/evaluation/evaluate_news.py
You can see the results in the Opik dashboard:
Now, let’s change the prompt:
"content": client.get_prompt(Config.Prompt.CLASSIFY_ARTICLE_FOCUS_SENTIMENT).prompt
Just a couple of added words to the prompt and we got better result. Interesting, right?
Does your LLM hallucinate?
Opik also has a built-in metrics to measure hallucination. We’ll create summaries of email with an LLM and evaluate does the summaries contain hallucinations.
The dataset can be found on HuggingFace: https://huggingface.co/datasets/argilla/FinePersonas-Conversations-Email-Summaries. Here’s a sample of the data:
summary | |
---|---|
Subject: Following up from the interfait… | Samantha met at the interfaith event and enjoyed a workshop on Jewish holidays a… |
Subject: RE: Checking in and seeking you… | Jenna thanks Marcus for sharing an article and expresses enthusiasm about collab… |
Subject: Interesting developments in the… | John is reaching out to discuss the recent discovery of a large natural gas fiel… |
Subject: Seeking your expertise on a new… | Emily is reaching out for input and expertise on an e-learning course designed t… |
Subject: Reconnecting and collaboration … | Sarah expresses interest in collaborating on a project to create educational res… |
Those sumamries are generated by Qwen-2.5-72B4 model and we’re not going to use them. We’ll create our own summaries with the model that we’re going to evaluate. Let’s start by adding the dataset to Opik:
N_EMAIL_SUMMARIES = 100
EMAIL_FORMAT = """
<email>
{text}
</email>
""".strip()
def create_email_summaries_data() -> pd.DataFrame:
return pd.read_parquet(Config.Path.DATA_DIR / "email-summaries.parquet")
def add_email_summaries_dataset(client: Opik):
client.delete_dataset(Config.Dataset.EMAIL_SUMMARIES)
dataset = client.get_or_create_dataset(Config.Dataset.EMAIL_SUMMARIES)
df = create_email_summaries_data().sample(N_EMAIL_SUMMARIES)
rows = [{"input": EMAIL_FORMAT.format(text=row.email)} for _, row in df.iterrows()]
dataset.insert(rows)
add_email_summaries_dataset(client)

Our prompt will be very simple (feel free to tweak it for better results):
SUMMARIZE_EMAIL_PROMPT = """
Summarize the following email:
{{input}}
Focus only on the most important points and keep it short.
Reply only with the text of the summary.
"""
client.create_prompt(
Config.Prompt.SUMMARIZE_EMAIL,
SUMMARIZE_EMAIL_PROMPT,
)
The hallucination metric5 comes out of the box with Opik. We’ll use it in our evaluation:
import opik
from opik import Opik
from opik.evaluation import evaluate_prompt, models
from bootcamp.config import Config
from opik.evaluation.metrics import Hallucination
opik.configure(use_local=True)
client = Opik()
dataset = client.get_or_create_dataset(Config.Dataset.EMAIL_SUMMARIES)
evaluate_prompt(
project_name="email-summaries",
experiment_name="summarize-email",
dataset=dataset,
messages=[
{
"role": "user",
"content": client.get_prompt(Config.Prompt.SUMMARIZE_EMAIL).prompt,
},
],
scoring_metrics=[
Hallucination(
name="summary_hallucination",
model=models.LiteLLMChatModel(
model_name=f"ollama/{Config.Model.JUDGE_LLM}", temperature=0
),
),
],
model=models.LiteLLMChatModel(
model_name=f"ollama/{Config.Model.QWEN}",
temperature=0,
),
)
The metric requires an LLM as a judge to detect if the summary contains hallucinations. It returns a score between 0 and 1, where 0 means no hallucinations and 1 means the summary is full of them. Run the experiment:
uv run bootcamp/evaluation/evaluate_email_summaries.py