BootcampLies, Damn Lies and Hallucinations - Evaluating your LLMs

Lies, Damn Lies and Hallucinations - Evaluating your LLMs

When a new model drops, you’ll see headlines trumpeting its amazing benchmark scores. DeepSeek-R1 beats OpenAI o1! Claude outperforms Llama! But here’s the uncomfortable truth: those benchmark scores might not tell you what you really need to know.

Think about it this way: if you’re studying for a test, and you have access to all the questions and answers beforehand, are your test results meaningful? That’s essentially what happens with public LLM benchmarks. Model developers can’t help but be aware of these tests, even if they don’t directly train on them. They are incentivized to make their models perform well on these public metrics, which can lead to benchmark-chasing rather than real-world capability improvements.

How to build a reliable evaluation

MLExpert is loading...

References

Footnotes

  1. Opik

  2. Docker

  3. Docker Compose

  4. Qwen2.5-72B

  5. Hallucination metric in Opik