LLM evaluation
Introduction to LLM Evaluation
LLM evaluation is crucial for understanding the performance and effectiveness of large language models in various applications. The primary purpose of evaluation is to assess how well these models generate human-like text, comprehend context, and perform specific tasks.
Evaluating LLMs presents several challenges, including the complexity of language, the subjectivity of human judgment, and the need for diverse benchmarks that reflect real-world scenarios. Additionally, the rapid evolution of LLM architectures necessitates continuous updates to evaluation methods.
Despite these challenges, effective evaluation offers significant benefits. It helps researchers and developers identify strengths and weaknesses in models, guides improvements, and ensures that LLMs are reliable and safe for deployment in applications ranging from chatbots to content generation.
Taxonomy of LLM Evaluation
LLM evaluations can be categorised by purpose (what they measure), structure (how tests are administered), and scoring methods.
I. High-Level Benchmark Categories
LLM benchmarks are systematically organised into three primary categories:
General Capabilities Benchmarks
Designed to measure foundational skills inherent to LLMs:
- Linguistic Core: Natural Language Understanding (NLU), commonsense reasoning, text generation, dialogue systems, multilingual capabilities, and holistic performance
- Knowledge: Comprehensive knowledge, expert-level knowledge (e.g., Google-Proof Q&A), exam-based knowledge (standardised assessments), and language-specific knowledge
- Reasoning: Formal logical reasoning, commonsense reasoning, causal reasoning, mathematical reasoning, and applied reasoning
Domain-Specific Benchmarks
Focused on performance within specialised fields:
- Natural Sciences: Mathematics, physics, chemistry, biology, and cross-disciplinary science problems
- Humanities & Social Sciences: Law, intellectual property (IP), education, psychology, and finance
- Engineering & Technology: Code generation, code maintenance and repair, code understanding, database and DevOps (e.g., Text-to-SQL), and hardware and engineering tasks
Target-Specific Benchmarks
Concentrated on risks, reliability, and complex agentic behaviours:
- Risk & Reliability: Safety evaluations, hallucination detection (factuality, faithfulness), robustness testing (prompt variations, out-of-distribution), and data leakage or privacy concerns
- Agent Capabilities: Planning and control, multi-agent collaboration and competition, integrated and holistic agent performance, domain-specific proficiency, and agent safety
II. Evaluation Paradigms
Evaluation paradigms define how and when assessment data is presented to the model:
- Static Benchmarks: Fixed test sets that are susceptible to data contamination and leaderboard overfitting, obscuring true generalisation ability. Examples include GLUE and MMLU
- Dynamic Evaluation: Protocols using unseen, randomly sampled questions or test sets generated for each evaluation run, ensuring unpredictability and robustness against gaming
- Zero-Shot Setting: Models receive no examples and answer questions directly
- Few-Shot Setting: Models receive a few examples in the prompt to guide behaviour
- Chain-of-Thought (CoT) Prompting: Models generate reasoning paths before the final answer to assess complex reasoning strategies
III. Core Evaluation Methods and Metrics
These methods determine how model outputs are measured against ground truth.
Evaluation Dimensions
Code generation evaluation typically involves three complementary dimensions:
- Correctness: Percentage of test cases passed, covering basic functionality, edge cases, and corner cases, often scored granularly
- Efficiency: Algorithmic scalability assessed using large input sizes and strict runtime thresholds derived from optimal reference solutions
- Quality: Code quality focusing on maintainability, including cyclomatic/cognitive complexity, function length, nesting depth, duplication, and adherence to best practices
Outcome Evaluation Techniques
Different evaluation methods apply to various outcome types:
| Outcome Type | Evaluation Methods |
|---|---|
| Information Acquisition | Whole String Matching: Direct comparison with ground truth |
| (Textual responses) | Substring Matching: Checks if response contains ground truth |
| LLM-as-a-Judge: Uses an LLM (e.g., GPT-4o) to score outputs based on standardised criteria | |
| Code Generation | Unit Testing: Designing test cases for individual functions or classes |
| Fuzz Testing: Running code against generated inputs to cover data types and edge cases | |
| End-to-end (E2E) Testing: Simulates complete user workflows with repeatability | |
| State Modification | State Matching: Compares final environment state (e.g., database changes) with ground truth |
| (Environment states) | |
| Multistep Reasoning | Answer Matching: Parses agent output and compares with ground truth |
| Quality Measure: Customised metrics against baseline when ground truth is difficult to achieve | |
| Ranking/Comparison | Relative Ranking: Compares models relative to each other within an evaluation session |
IV. Specialised and Risk-Focused Evaluations
Methods often used in governance or high-stakes scenarios:
- Benchmark Testing: Standardised, quantitative tests evaluating performance on fixed task sets
- AI Red Teaming: Systematic process to find vulnerabilities or potential misuse by searching for inputs that induce undesirable behaviour
- Safety-Focused Evaluations: Context-aware assessments of risk (probability × severity), often requiring mixed measurement methods
- Policy Adherence Evaluations: Targeted assessments of whether AI system behaviours align with policy requirements
- Uplift Studies: Assesses how advanced AI might be used by malicious actors compared to existing tools (e.g., internet search)
- Auditing: Formal review of organisational compliance with standards, policies, and procedures, typically by an independent third party
- Validity Checks (Agentic): Rigorous assessment frameworks evaluating conceptual equivalence:
- Task Validity: Target capability being measured is equivalent to task success
- Outcome Validity: Task success is equivalent to a positive evaluation result
Specialised Evaluation Frameworks
Different LLM applications may require specific evaluation approaches:
RAG Evaluation
For Retrieval-Augmented Generation (RAG) systems, the RAGAS framework provides specialised metrics that assess both the quality of retrieved information and the generated responses.