RAGAS - RAG Assessment Framework
Overview
RAGAS is an evaluation framework specifically designed to assess the performance of Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that might focus solely on the final output, RAGAS provides a comprehensive evaluation across multiple dimensions of RAG system performance.
Key Metrics
RAGAS includes several key metrics:
- Faithfulness: Measures how well the generated response aligns with the retrieved context, identifying potential hallucinations
- Answer Relevancy: Evaluates how relevant the generated response is to the question
- Context Relevancy: Assesses how relevant the retrieved documents are to the question
- Context Precision: Measures the proportion of retrieved context that is actually useful for answering the question
Context Precision
Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts. It quantifies how much of the retrieved information is actually useful for answering the question, providing insight into the effectiveness of the retrieval component in a RAG system.
Mathematically, Context Precision@K is defined as:
\[ \text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top } K \text{ results}} \]
where \(K\) is the number of retrieved chunks, \(v_k\) is a relevance indicator (1 if relevant, 0 otherwise), and Precision@k is the ratio of true positives to the total retrieved at rank \(k\).
LLM-Based Context Precision
RAGAS provides LLM-based metrics to estimate the relevance of each retrieved context chunk, either with or without a reference answer.
Without Reference
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference
from ragas.llms import llm_factory
# Set up a lightweight LLM for evaluation
= llm_factory(model="gpt-3.5-turbo")
evaluator_llm
= LLMContextPrecisionWithoutReference(llm=evaluator_llm)
context_precision
= SingleTurnSample(
sample ="Where is the Eiffel Tower located?",
user_input="The Eiffel Tower is located in Paris.",
response=["The Eiffel Tower is located in Paris."]
retrieved_contexts
)
= await context_precision.single_turn_ascore(sample) score
With Reference
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference
from ragas.llms import llm_factory
# Set up a lightweight LLM for evaluation
= llm_factory(model="gpt-3.5-turbo")
evaluator_llm
= LLMContextPrecisionWithReference(llm=evaluator_llm)
context_precision
= SingleTurnSample(
sample ="Where is the Eiffel Tower located?",
user_input="The Eiffel Tower is located in Paris.",
reference=["The Eiffel Tower is located in Paris."]
retrieved_contexts
)
= await context_precision.single_turn_ascore(sample) score
Non-LLM-Based Context Precision
For scenarios where LLM-based evaluation is not desired, RAGAS also supports non-LLM-based context precision using traditional similarity measures.
import asyncio
from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextPrecisionWithReference
# Create a simple example demonstrating context precision
= NonLLMContextPrecisionWithReference()
context_precision
= SingleTurnSample(
sample =[
retrieved_contexts"Paris is the capital city of France.",
"The Eiffel Tower stands 324 meters tall.",
"Python is a programming language.",
],=[
reference_contexts"The Eiffel Tower is located in Paris, France.",
"Paris is known for the Eiffel Tower landmark."
]
)
# Run the evaluation
= asyncio.run(context_precision.single_turn_ascore(sample))
score print(f"Context Precision Score: {score:.3f}")
# Note: The score reflects how well retrieved contexts align with reference contexts
# A score of 1.0 indicates perfect precision, while lower scores suggest
# some retrieved contexts are less relevant to the reference material
Context Precision Score: 0.500
A higher Context Precision score indicates that a greater proportion of the retrieved context is directly useful for answering the question, helping to identify and minimise irrelevant retrievals.
Usage
RAGAS can be installed via pip:
from ragas import evaluate
from datasets import Dataset
# Example evaluation
= evaluate(
eval_results =your_dataset,
dataset=[
metrics"faithfulness",
"answer_relevancy",
"context_relevancy"
] )
Benefits
- Comprehensive Assessment: Evaluates multiple aspects of RAG performance
- Standardisation: Provides consistent metrics across different RAG implementations
- Automation: Reduces the need for manual evaluation
- Interpretability: Offers clear insights into specific areas needing improvement