TL:DR
"I feel like there are more LLM evaluation solutions out there than there are problems around LLM evaluation" - said Dylan, a Head of AI at a Fortune 500 company.
And I couldn't agree more - it seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exists. At the end of the day, what Dylan really wants is a framework, package, library, whatever you want to call it, that would simply quantify the performance of the LLM (application) he's looking to productionize.
So, as someone who were once in Dylan's shoes, I've compiled a list of the top 5 LLM evaluation framework that exists in 2024 :) π
Let's begin!
1. DeepEval - The Evaluation Framework for LLMs
DeepEval is your favorite evaluation framework's favorite evaluation framework. It takes top spot for a variety of reasons:
- Offers 14+ LLM evaluation metrics (both for RAG and fine-tuning use cases), updated with the latest research in the LLM evaluation field. These metrics include:
- G-Eval
- Summarization
- Hallucination
- Faithfulness
- Contextual Relevancy
- Answer Relevancy
- Contextual Recall
- Contextual Precision
- RAGAS
- Bias
- Toxicity
Most metrics are self-explaining, which means DeepEval's metrics will literally tell you why the metric score cannot be higher.
- Offers modular components that is extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
- Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
- Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
- Offers a hosted platform with a generous free tier to run real-time evaluations in production.
With Pytest Integration:
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="How many evaluation metrics does DeepEval offers?",
actual_output="14+ evaluation metrics",
context=["DeepEval offers 14+ evaluation metrics"]
)
metric = HallucinationMetric(minimum_score=0.7)
def test_hallucination():
assert_test(test_case, [metric])
Then in the CLI:
deepeval test run test_file.py
Or, without Pytest (perfect for notebook environments):
from deepeval import evaluate
...
evaluate([test_case], [metric])
2. MLFlow LLM Evaluate - LLM Model Evaluation
MLFlow is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation.
MLFlow is good because of its intuitive developer experience. For example, this is how you run evaluations with MLFlow:
results = mlflow.evaluate(
model,
eval_data,
targets="ground_truth",
model_type="question-answering",
)
3. RAGAs - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
Third on the list, RAGAs was build for RAG pipelines. They offer 5 core metrics:
- Faithfulness
- Contextual Relevancy
- Answer Relevancy
- Contextual Recall
- Contextual Precision
These metrics make up the final RAGAs score. DeepEval and RAGAs have very similar implementations, but RAGAs metrics are not self-explaining, making it much harder to debug unsatisfactory results.
RAGAs is third on the list primarily because it also incorporates the latest research into its RAG metrics, is simple to use, but not higher on the list because of its limited features and inflexibility as a framework.
from ragas import evaluate
from datasets import Dataset
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
# prepare your huggingface dataset in the format
# Dataset({
# features: ['question', 'contexts', 'answer', 'ground_truths'],
# num_rows: 25
# })
dataset: Dataset
results = evaluate(dataset)
4. Deepchecks
Deepchecks stands out as it is geared more towards evaluating the LLM itself, rather than LLM systems/applications.
It is not higher on the list due to its complicated developer experience (seriously, try setting it up yourself and let me know how it goes), but its open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.
π Star Deepchecks on GitHub
5. Arize AI Phoenix
Last on the list, Arize AI evaluates LLM applications through extensive observability into LLM traces. However it is extremely limited as it only offers three evaluation criteria:
- QA Correctness
- Hallucination
- Toxicity
So there you have it, the list of top LLM evaluation frameworks GitHub has to offer in 2024. Think there's something I've missed? Comment below to let me know!
Thank you for reading, and till next time π
Top comments (8)
Few more :)
OpenAI Evals
TruLens
Truera
Top 5 only!
Nice list!
This article's author is also the author of DeepEval which is ranked No.1 in this article.
I will say good effort for both building DeepEval and writing articles but this is not the right way to promote yourself.
That may make this comparison article not have any value.
Somehow you missed TruLens: github.com/truera/trulens
DeepLearning AI has a whole free course on how to use it to test RAG apps and a workshop on agents, too.
deeplearning.ai/short-courses/buil...
youtube.com/watch?v=0pnEUAwoDP0
Did not miss it, top 5 only!
It's good to be aware of these. Are all of them comparable, or do they define the benchmarks differently? In other words, is there a "golden standard" that all benchmarking tools follow?
Glad you liked it!
Some comments have been hidden by the post's author - find out more