LLM evaluation

Introduction to LLM Evaluation

LLM evaluation is crucial for understanding the performance and effectiveness of large language models in various applications. The primary purpose of evaluation is to assess how well these models generate human-like text, comprehend context, and perform specific tasks.

Evaluating LLMs presents several challenges, including the complexity of language, the subjectivity of human judgment, and the need for diverse benchmarks that reflect real-world scenarios. Additionally, the rapid evolution of LLM architectures necessitates continuous updates to evaluation methods.

Despite these challenges, effective evaluation offers significant benefits. It helps researchers and developers identify strengths and weaknesses in models, guides improvements, and ensures that LLMs are reliable and safe for deployment in applications ranging from chatbots to content generation.

Taxonomy of LLM Evaluation

LLM evaluations can be categorised by purpose (what they measure), structure (how tests are administered), and scoring methods.

I. High-Level Benchmark Categories

LLM benchmarks are systematically organised into three primary categories:

General Capabilities Benchmarks

Designed to measure foundational skills inherent to LLMs:

  • Linguistic Core: Natural Language Understanding (NLU), commonsense reasoning, text generation, dialogue systems, multilingual capabilities, and holistic performance
  • Knowledge: Comprehensive knowledge, expert-level knowledge (e.g., Google-Proof Q&A), exam-based knowledge (standardised assessments), and language-specific knowledge
  • Reasoning: Formal logical reasoning, commonsense reasoning, causal reasoning, mathematical reasoning, and applied reasoning

Domain-Specific Benchmarks

Focused on performance within specialised fields:

  • Natural Sciences: Mathematics, physics, chemistry, biology, and cross-disciplinary science problems
  • Humanities & Social Sciences: Law, intellectual property (IP), education, psychology, and finance
  • Engineering & Technology: Code generation, code maintenance and repair, code understanding, database and DevOps (e.g., Text-to-SQL), and hardware and engineering tasks

Target-Specific Benchmarks

Concentrated on risks, reliability, and complex agentic behaviours:

  • Risk & Reliability: Safety evaluations, hallucination detection (factuality, faithfulness), robustness testing (prompt variations, out-of-distribution), and data leakage or privacy concerns
  • Agent Capabilities: Planning and control, multi-agent collaboration and competition, integrated and holistic agent performance, domain-specific proficiency, and agent safety

II. Evaluation Paradigms

Evaluation paradigms define how and when assessment data is presented to the model:

  • Static Benchmarks: Fixed test sets that are susceptible to data contamination and leaderboard overfitting, obscuring true generalisation ability. Examples include GLUE and MMLU
  • Dynamic Evaluation: Protocols using unseen, randomly sampled questions or test sets generated for each evaluation run, ensuring unpredictability and robustness against gaming
  • Zero-Shot Setting: Models receive no examples and answer questions directly
  • Few-Shot Setting: Models receive a few examples in the prompt to guide behaviour
  • Chain-of-Thought (CoT) Prompting: Models generate reasoning paths before the final answer to assess complex reasoning strategies

III. Core Evaluation Methods and Metrics

These methods determine how model outputs are measured against ground truth.

Evaluation Dimensions

Code generation evaluation typically involves three complementary dimensions:

  • Correctness: Percentage of test cases passed, covering basic functionality, edge cases, and corner cases, often scored granularly
  • Efficiency: Algorithmic scalability assessed using large input sizes and strict runtime thresholds derived from optimal reference solutions
  • Quality: Code quality focusing on maintainability, including cyclomatic/cognitive complexity, function length, nesting depth, duplication, and adherence to best practices

Outcome Evaluation Techniques

Different evaluation methods apply to various outcome types:

Outcome Type Evaluation Methods
Information Acquisition Whole String Matching: Direct comparison with ground truth
(Textual responses) Substring Matching: Checks if response contains ground truth
LLM-as-a-Judge: Uses an LLM (e.g., GPT-4o) to score outputs based on standardised criteria
Code Generation Unit Testing: Designing test cases for individual functions or classes
Fuzz Testing: Running code against generated inputs to cover data types and edge cases
End-to-end (E2E) Testing: Simulates complete user workflows with repeatability
State Modification State Matching: Compares final environment state (e.g., database changes) with ground truth
(Environment states)
Multistep Reasoning Answer Matching: Parses agent output and compares with ground truth
Quality Measure: Customised metrics against baseline when ground truth is difficult to achieve
Ranking/Comparison Relative Ranking: Compares models relative to each other within an evaluation session

IV. Specialised and Risk-Focused Evaluations

Methods often used in governance or high-stakes scenarios:

  • Benchmark Testing: Standardised, quantitative tests evaluating performance on fixed task sets
  • AI Red Teaming: Systematic process to find vulnerabilities or potential misuse by searching for inputs that induce undesirable behaviour
  • Safety-Focused Evaluations: Context-aware assessments of risk (probability × severity), often requiring mixed measurement methods
  • Policy Adherence Evaluations: Targeted assessments of whether AI system behaviours align with policy requirements
  • Uplift Studies: Assesses how advanced AI might be used by malicious actors compared to existing tools (e.g., internet search)
  • Auditing: Formal review of organisational compliance with standards, policies, and procedures, typically by an independent third party
  • Validity Checks (Agentic): Rigorous assessment frameworks evaluating conceptual equivalence:
    • Task Validity: Target capability being measured is equivalent to task success
    • Outcome Validity: Task success is equivalent to a positive evaluation result

Specialised Evaluation Frameworks

Different LLM applications may require specific evaluation approaches:

RAG Evaluation

For Retrieval-Augmented Generation (RAG) systems, the RAGAS framework provides specialised metrics that assess both the quality of retrieved information and the generated responses.