RAGAS - RAG Assessment Framework

Overview

RAGAS is an evaluation framework specifically designed to assess the performance of Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that might focus solely on the final output, RAGAS provides a comprehensive evaluation across multiple dimensions of RAG system performance.

Key Metrics

RAGAS includes several key metrics:

Faithfulness: Measures how well the generated response aligns with the retrieved context, identifying potential hallucinations
Answer Relevancy: Evaluates how relevant the generated response is to the question
Context Relevancy: Assesses how relevant the retrieved documents are to the question
Context Precision: Measures the proportion of retrieved context that is actually useful for answering the question

Context Precision

Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts. It quantifies how much of the retrieved information is actually useful for answering the question, providing insight into the effectiveness of the retrieval component in a RAG system.

Mathematically, Context Precision@K is defined as:

\[ \text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top } K \text{ results}} \]

where \(K\) is the number of retrieved chunks, \(v_k\) is a relevance indicator (1 if relevant, 0 otherwise), and Precision@k is the ratio of true positives to the total retrieved at rank \(k\).

LLM-Based Context Precision

RAGAS provides LLM-based metrics to estimate the relevance of each retrieved context chunk, either with or without a reference answer.

Without Reference

from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference
from ragas.llms import llm_factory

# Set up a lightweight LLM for evaluation
evaluator_llm = llm_factory(model="gpt-3.5-turbo")

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."]
)

score = await context_precision.single_turn_ascore(sample)

With Reference

from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference
from ragas.llms import llm_factory

# Set up a lightweight LLM for evaluation
evaluator_llm = llm_factory(model="gpt-3.5-turbo")

context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."]
)

score = await context_precision.single_turn_ascore(sample)

Non-LLM-Based Context Precision

For scenarios where LLM-based evaluation is not desired, RAGAS also supports non-LLM-based context precision using traditional similarity measures.

import asyncio
from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextPrecisionWithReference

# Create a simple example demonstrating context precision
context_precision = NonLLMContextPrecisionWithReference()

sample = SingleTurnSample(
    retrieved_contexts=[
        "Paris is the capital city of France.",
        "The Eiffel Tower stands 324 meters tall.",
        "Python is a programming language.",
    ],
    reference_contexts=[
        "The Eiffel Tower is located in Paris, France.",
        "Paris is known for the Eiffel Tower landmark."
    ]
)

# Run the evaluation
score = asyncio.run(context_precision.single_turn_ascore(sample))
print(f"Context Precision Score: {score:.3f}")

# Note: The score reflects how well retrieved contexts align with reference contexts
# A score of 1.0 indicates perfect precision, while lower scores suggest 
# some retrieved contexts are less relevant to the reference material

Context Precision Score: 0.500

A higher Context Precision score indicates that a greater proportion of the retrieved context is directly useful for answering the question, helping to identify and minimise irrelevant retrievals.

Usage

RAGAS can be installed via pip:

from ragas import evaluate
from datasets import Dataset

# Example evaluation
eval_results = evaluate(
    dataset=your_dataset,
    metrics=[
        "faithfulness",
        "answer_relevancy",
        "context_relevancy"
    ]
)

Benefits

Comprehensive Assessment: Evaluates multiple aspects of RAG performance
Standardisation: Provides consistent metrics across different RAG implementations
Automation: Reduces the need for manual evaluation
Interpretability: Offers clear insights into specific areas needing improvement