Visualising Token Probabilities in Large Language Models

Machine Learning

LLMs

Visualisation

NLP

Published

October 12, 2025

When working with Large Language Models (LLMs), it is important to understand how they make decisions at each step of text generation. Token probabilities provide a window into the model’s generation process, showing us which tokens the model considers most likely at each position.

We’ll explore what token probabilities are, how they work, and how to visualise them effectively to understand LLM behaviour.

What Are Token Probabilities?

Token probabilities represent the likelihood that a specific token (word, subword, or character) will be generated at a given position in the sequence. When an LLM generates text, it doesn’t just pick the most likely next word; it considers a probability distribution over the entire vocabulary.

For each position in the sequence, the model outputs a probability distribution over all possible tokens (in its vocabulary). This distribution is typically computed using a softmax function over the model’s logits (raw output scores).

The process is generally:

Input Processing: The model takes the current sequence of tokens as input
Forward Pass: The model processes this through its neural network layers
Logit Generation: The final layer produces raw scores (logits) for each token in the vocabulary
Probability Calculation: A softmax function converts these logits into probabilities
Token Selection: The model either selects the most probable token (greedy decoding) or samples from the distribution

The key insight is that these probabilities change dynamically as the context evolves. Each new token influences the probability distribution for subsequent tokens.

Before diving into practical examples, it’s crucial to understand logprobs (log probabilities), which are the fundamental building blocks that LLMs actually work with internally.

Logprobs are simply the natural logarithm of probabilities:

log_{p} = \log\{p\}

To convert back to actual probabilities, we use:

p = \exp\{log_{p}\}.

For example:

A logprob of -1.69 corresponds to a probability of \exp(-1.69) \approx 0.18
A logprob of -6.06 corresponds to a probability of \exp(-6.06) \approx 0.002

Why Do LLMs Use Logprobs Instead of Probabilities?

The main reason is numerical stability. Consider calculating the joint probability of a text T consisting of tokens x_1, x_2, \ldots, x_n:

Using probabilities (problematic):

p(T) = p(x_1) \times p(x_2) \times \ldots \times p(x_n)

Since probabilities are numbers smaller than 1, multiplying many of them together leads to numerical underflow and instability.

Using logprobs (stable):

\log(p(T)) = \log(p(x_1)) + \log(p(x_2)) + \ldots + \log(p(x_n))

Adding logprobs is numerically much safer than multiplying probabilities, thanks to the identity

\log(p \times q) = \log(p) + \log(q)

Logprobs as Confidence Measures

Logprobs can be interpreted as a measure of the model’s “confidence” in its predictions:

High logprobs (closer to 0): The model is very confident
Low logprobs (more negative): The model is uncertain

This has practical applications:

Classification confidence: Use logprobs to indicate prediction certainty
Text quality assessment: Calculate total logprob across a sequence
AI detection: Compare logprobs of human vs. AI-generated text

Non-Deterministic Nature

Important note: Logprobs are not deterministic. Running the same query multiple times will yield different logprobs due to the model’s internal randomness, even with temperature settings.

# Let's demonstrate logprobs with a practical example
import numpy as np

# Example logprobs (these would come from an actual LLM)
example_logprobs = {
    'token_4': -1.69,
    'token_7': -1.81, 
    'token_5': -1.81,
    'token_1': -6.06
}
example_logprobs

{'token_4': -1.69, 'token_7': -1.81, 'token_5': -1.81, 'token_1': -6.06}

Converting logprobs to probabilities:

for token, logprob in example_logprobs.items():
    probability = np.exp(logprob)
    print(f"{token}: logprob = {logprob:.2f} → probability = {probability:.4f}")

print(f"\nTotal logprob: {sum(example_logprobs.values()):.2f}")
print(f"Joint probability: {np.exp(sum(example_logprobs.values())):.6f}")

token_4: logprob = -1.69 → probability = 0.1845
token_7: logprob = -1.81 → probability = 0.1637
token_5: logprob = -1.81 → probability = 0.1637
token_1: logprob = -6.06 → probability = 0.0023

Total logprob: -11.37
Joint probability: 0.000012

Numerical Stability Demonstration:

/tmp/ipykernel_820276/313016976.py:48: UserWarning:

marker is redundantly defined by the 'marker' keyword argument and the fmt string "o-" (-> marker='o'). The keyword argument will take precedence.

/tmp/ipykernel_820276/313016976.py:50: UserWarning:

marker is redundantly defined by the 'marker' keyword argument and the fmt string "s-" (-> marker='s'). The keyword argument will take precedence.

# Model loading and function definition for heatmap demonstrations
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Load a smaller model for demonstration (GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model: {model_name}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Max sequence length: {model.config.max_position_embeddings}")

def get_token_probabilities(
    model, 
    tokenizer, 
    prompt: str, 
    max_new_tokens: int = 50,
    return_logits: bool = False
) -> Tuple[List[str], List[torch.Tensor], List[Dict[str, float]]]:
    """
    Extract token probabilities for each generation step.
    
    Returns:
        - tokens: List of generated tokens
        - probabilities: List of probability tensors for each step
        - top_k_probs: List of dictionaries with top-k probabilities for each step
    """
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    
    tokens = []
    probabilities = []
    top_k_probs = []
    
    # Generate tokens one by one
    for step in range(max_new_tokens):
        with torch.no_grad():
            # Forward pass
            outputs = model(input_ids)
            logits = outputs.logits[0, -1, :]  # Get logits for the last position
            
            # Convert to probabilities
            probs = torch.softmax(logits, dim=-1)
            probabilities.append(probs)
            
            # Get top-k probabilities
            top_k = torch.topk(probs, k=10)
            top_k_dict = {
                tokenizer.decode([idx]): float(prob)
                for idx, prob in zip(top_k.indices, top_k.values)
            }
            top_k_probs.append(top_k_dict)
            
            # Select the most probable token (greedy decoding)
            next_token_id = torch.argmax(probs).item()
            next_token = tokenizer.decode([next_token_id])
            tokens.append(next_token)
            
            # Append to input for next iteration
            input_ids = torch.cat([input_ids, torch.tensor([[next_token_id]])], dim=1)
            
            # Stop if we hit the end-of-sequence token
            if next_token_id == tokenizer.eos_token_id:
                break
    
    return tokens, probabilities, top_k_probs

Model: gpt2
Vocabulary size: 50257
Max sequence length: 1024

Numeric failure example:

prompt = "The future of artificial intelligence is"
tokens, probabilities, top_k_probs = get_token_probabilities(
    model, tokenizer, prompt, max_new_tokens=8
)

print(f"Prompt: '{prompt}'")
print(f"Generated tokens: {tokens}")
print(f"Full generated text: '{prompt}{''.join(tokens)}'")

Prompt: 'The future of artificial intelligence is'
Generated tokens: [' uncertain', '.', '\n', '\n', '"', 'We', "'re", ' not']
Full generated text: 'The future of artificial intelligence is uncertain.

"We're not'

Practical Applications of Logprobs

1. Confidence Assessment in Classification

Logprobs can be used to assess the confidence of model predictions. Let’s see how this works in practice:

# Simulate classification confidence using logprobs
def assess_classification_confidence(logprobs_dict):
    """
    Assess confidence in a classification task using logprobs.
    """
    # Convert logprobs to probabilities
    probs = {k: np.exp(v) for k, v in logprobs_dict.items()}
    
    # Find the most confident prediction
    best_class = max(logprobs_dict.keys(), key=lambda k: logprobs_dict[k])
    best_logprob = logprobs_dict[best_class]
    best_prob = probs[best_class]
    
    # Calculate confidence metrics
    confidence_score = best_prob
    logprob_margin = best_logprob - max(v for k, v in logprobs_dict.items() if k != best_class)
    
    return {
        'predicted_class': best_class,
        'confidence_score': confidence_score,
        'logprob_margin': logprob_margin,
        'all_probabilities': probs
    }

# Example: Sentiment classification
sentiment_logprobs = {
    'positive': -0.5,    # High confidence
    'negative': -2.1,    # Lower confidence  
    'neutral': -1.8      # Medium confidence
}

Sentiment Classification Results:

result = assess_classification_confidence(sentiment_logprobs)
print(f"Predicted class: {result['predicted_class']}")
print(f"Confidence score: {result['confidence_score']:.3f}")
print(f"Logprob margin: {result['logprob_margin']:.2f}")
print(f"All probabilities: {result['all_probabilities']}")

Predicted class: positive
Confidence score: 0.607
Logprob margin: 1.30
All probabilities: {'positive': np.float64(0.6065306597126334), 'negative': np.float64(0.1224564282529819), 'neutral': np.float64(0.16529888822158653)}

2. AI Detection Using Logprobs

One interesting application is using logprobs to detect AI-generated text. The idea is that AI-generated text typically has different logprob patterns compared to human-written text.

# Simulate AI detection using logprobs
def calculate_text_logprob(text_tokens, token_logprobs):
    """
    Calculate the total logprob for a sequence of tokens.
    """
    total_logprob = sum(token_logprobs.get(token, -10) for token in text_tokens)
    avg_logprob = total_logprob / len(text_tokens)
    return total_logprob, avg_logprob

# Simulate different types of text
human_text_tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat', 'and', 'purred', 'softly']
ai_text_tokens = ['The', 'feline', 'creature', 'positioned', 'itself', 'upon', 'the', 'textile', 'surface']

# Simulate logprobs (AI text tends to have more consistent, higher logprobs)
human_logprobs = {
    'The': -0.5, 'cat': -1.2, 'sat': -1.8, 'on': -0.8, 'the': -0.5, 
    'mat': -2.1, 'and': -0.6, 'purred': -3.2, 'softly': -2.5
}

ai_logprobs = {
    'The': -0.3, 'feline': -1.0, 'creature': -1.1, 'positioned': -1.2, 
    'itself': -0.9, 'upon': -1.3, 'the': -0.3, 'textile': -2.0, 'surface': -1.4
}

# Calculate metrics
human_total, human_avg = calculate_text_logprob(human_text_tokens, human_logprobs)
ai_total, ai_avg = calculate_text_logprob(ai_text_tokens, ai_logprobs)

Understanding Token Probability Visualisations

The visualisations above show us how language models make decisions when generating text. These patterns reveal important insights about model behaviour.

Confidence Levels

When a token has a high probability (close to 1.0), it means the model is very sure about choosing that word next. Low probability tokens show the model is uncertain or considering multiple options. Sometimes you’ll see one token dominate with a sharp distribution, while other times multiple tokens have similar probabilities, creating a flat distribution.

How Context Influences Things

The most interesting part is watching how probabilities change as the text builds up. Early tokens often show more uncertainty because the direction is not yet clear. As more context builds, later tokens become more predictable because the previous words help narrow down the choices.

Model Behaviour Patterns

You might notice the model keeps picking high-probability tokens, showing it’s in a confident mode. Other times, flat distributions suggest the model is exploring different valid options. Sometimes you’ll see sudden drops in probability for the chosen token, which might mean the model is making a surprising choice.

Practical Applications

Understanding these patterns helps in several ways:

Model Debugging

You can spot where the model gets confused or makes unexpected choices. This helps detect when the model is “hallucinating” by picking low-probability tokens. It also helps explain why certain prompts lead to specific outputs.

Prompt Engineering

This knowledge helps you write better prompts that lead to more confident and coherent outputs. You can identify prompts that cause too much uncertainty and adjust them for better results.

Model Comparison

You can compare how different models handle the same prompt. This helps understand the trade-offs between different model types and evaluate improvements from fine-tuning.

AI Detection Using Logprobs:

print(f"Human text - Total logprob: {human_total:.2f}, Average: {human_avg:.2f}")
print(f"AI text - Total logprob: {ai_total:.2f}, Average: {ai_avg:.2f}")
print(f"Difference: {ai_avg - human_avg:.2f}")

Human text - Total logprob: -13.20, Average: -1.47
AI text - Total logprob: -9.50, Average: -1.06
Difference: 0.41


🔍 Detection Result: Likely AI-generated (avg logprob: -1.06 > -1.5)

Practical Example: Extracting Token Probabilities

Let’s start with a practical example using the transformers library to extract and visualise token probabilities from a language model.

Load a smaller model for demonstration (GPT-2):

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model: {model_name}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Max sequence length: {model.config.max_position_embeddings}")

Model: gpt2
Vocabulary size: 50257
Max sequence length: 1024

def get_token_probabilities(
    model, 
    tokenizer, 
    prompt: str, 
    max_new_tokens: int = 50,
    return_logits: bool = False
) -> Tuple[List[str], List[torch.Tensor], List[Dict[str, float]]]:
    """
    Extract token probabilities for each generation step.
    
    Returns:
        - tokens: List of generated tokens
        - probabilities: List of probability tensors for each step
        - top_k_probs: List of dictionaries with top-k probabilities for each step
    """
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    
    tokens = []
    probabilities = []
    top_k_probs = []
    
    # Generate tokens one by one
    for step in range(max_new_tokens):
        with torch.no_grad():
            # Forward pass
            outputs = model(input_ids)
            logits = outputs.logits[0, -1, :]  # Get logits for the last position
            
            # Convert to probabilities
            probs = torch.softmax(logits, dim=-1)
            probabilities.append(probs)
            
            # Get top-k probabilities
            top_k = torch.topk(probs, k=10)
            top_k_dict = {
                tokenizer.decode([idx]): float(prob)
                for idx, prob in zip(top_k.indices, top_k.values)
            }
            top_k_probs.append(top_k_dict)
            
            # Select the most probable token (greedy decoding)
            next_token_id = torch.argmax(probs).item()
            next_token = tokenizer.decode([next_token_id])
            tokens.append(next_token)
            
            # Append to input for next iteration
            input_ids = torch.cat([input_ids, torch.tensor([[next_token_id]])], dim=1)
            
            # Stop if we hit the end-of-sequence token
            if next_token_id == tokenizer.eos_token_id:
                break
    
    return tokens, probabilities, top_k_probs

Test our function with a simple prompt:

prompt = "The future of artificial intelligence is"
tokens, probabilities, top_k_probs = get_token_probabilities(
    model, tokenizer, prompt, max_new_tokens=8
)

print(f"Prompt: '{prompt}'")
print(f"Generated tokens: {tokens}")
print(f"Full generated text: '{prompt}{''.join(tokens)}'")

Prompt: 'The future of artificial intelligence is'
Generated tokens: [' uncertain', '.', '\n', '\n', '"', 'We', "'re", ' not']
Full generated text: 'The future of artificial intelligence is uncertain.

"We're not'

Let’s examine the top-k probabilities for the first few tokens:

for i, (token, top_k_dict) in enumerate(zip(tokens[:3], top_k_probs[:3])):
    print(f"\nStep {i+1}: Generated token '{token}'")
    print("Top 5 most probable tokens:")
    for j, (tok, prob) in enumerate(list(top_k_dict.items())[:5]):
        print(f"  {j+1}. '{tok}' (probability: {prob:.4f})")


Step 1: Generated token ' uncertain'
Top 5 most probable tokens:
  1. ' uncertain' (probability: 0.0766)
  2. ' in' (probability: 0.0694)
  3. ' not' (probability: 0.0441)
  4. ' a' (probability: 0.0407)
  5. ' still' (probability: 0.0345)

Step 2: Generated token '.'
Top 5 most probable tokens:
  1. '.' (probability: 0.3799)
  2. ',' (probability: 0.2969)
  3. ',"' (probability: 0.0575)
  4. '."' (probability: 0.0461)
  5. ' and' (probability: 0.0385)

Step 3: Generated token '
'
Top 5 most probable tokens:
  1. '
' (probability: 0.0756)
  2. ' The' (probability: 0.0730)
  3. ' But' (probability: 0.0720)
  4. ' It' (probability: 0.0550)
  5. ' In' (probability: 0.0444)

Visualising Token Probabilities

Now let’s create some visualisations to better understand how token probabilities evolve during generation.

Interpreting Token Probability Visualisations

The visualisations above reveal several important insights about how LLMs generate text. When tokens have high probabilities close to 1.0, the model is very confident about what comes next. Low probability tokens suggest the model is uncertain or exploring multiple possibilities. You can see this difference between sharp probability distributions where one token dominates versus flat distributions where multiple tokens have similar probabilities.

The probabilities change dramatically as context builds up. Early tokens often show higher uncertainty because the model hasn’t established a clear direction yet. Later tokens become more predictable as the growing context constrains the possible choices. This context sensitivity is crucial for understanding how language models work.

Model behaviour patterns emerge from these probability distributions. When a model keeps selecting high-probability tokens, it’s operating in a confident generation mode. Flat distributions suggest the model is exploring multiple valid options rather than committing to one path. Sudden drops in probability for the selected token might indicate the model is making a surprising choice that breaks from its usual patterns.

Practical Applications

Understanding token probabilities is valuable for several practical applications. For model debugging, you can identify where the model becomes uncertain or makes unexpected choices. This helps detect when the model is hallucinating by selecting low-probability tokens, and it reveals why certain prompts lead to specific outputs.

In prompt engineering, this knowledge helps design prompts that lead to more confident, coherent outputs. You can identify ambiguous prompts that cause high uncertainty and optimise prompts for specific types of responses. The probability patterns guide you toward prompts that work well with the model’s natural tendencies.

For model comparison, token probabilities let you compare how different models handle the same prompt. This reveals the trade-offs between different model architectures and helps evaluate the effects of model improvements and fine-tuning. By examining these probability distributions, you gain insight into how each model makes decisions and where their strengths and weaknesses lie.

Quantifying Uncertainty: Entropy and Other Metrics

Beyond visual inspection, we can quantify the uncertainty in token probability distributions using various metrics.

def calculate_uncertainty_metrics(probabilities: List[torch.Tensor]) -> Dict[str, List[float]]:
    """
    Calculate various uncertainty metrics for each generation step.
    """
    metrics = {
        'entropy': [],
        'max_probability': [],
        'top_k_entropy': [],  # Entropy of top-10 tokens
        'gini_coefficient': []
    }
    
    for prob_tensor in probabilities:
        probs = prob_tensor.numpy()
        
        # Entropy: -sum(p * log(p))
        entropy = -np.sum(probs * np.log(probs + 1e-10))
        metrics['entropy'].append(entropy)
        
        # Maximum probability
        max_prob = np.max(probs)
        metrics['max_probability'].append(max_prob)
        
        # Top-k entropy (entropy of top-10 tokens)
        top_k_indices = np.argpartition(probs, -10)[-10:]
        top_k_probs = probs[top_k_indices]
        top_k_probs = top_k_probs / np.sum(top_k_probs)  # Normalise
        top_k_entropy = -np.sum(top_k_probs * np.log(top_k_probs + 1e-10))
        metrics['top_k_entropy'].append(top_k_entropy)
        
        # Gini coefficient (measure of inequality in probability distribution)
        sorted_probs = np.sort(probs)[::-1]
        n = len(sorted_probs)
        cumsum = np.cumsum(sorted_probs)
        gini = (n + 1 - 2 * np.sum(cumsum) / cumsum[-1]) / n
        metrics['gini_coefficient'].append(gini)
    
    return metrics

Calculate metrics for our example and show uncertainty metrics for each generation step:

uncertainty_metrics = calculate_uncertainty_metrics(probabilities)

# Create a DataFrame for easier analysis
metrics_df = pd.DataFrame(uncertainty_metrics)
metrics_df['step'] = range(1, len(metrics_df) + 1)
metrics_df['generated_token'] = tokens

print(metrics_df.round(4))

   entropy  max_probability  top_k_entropy  gini_coefficient  step  \
0   5.2350           0.0766         2.1672           -0.9926     1   
1   2.1829           0.3799         1.5217           -0.9996     2   
2   4.7959           0.0756         2.1668           -0.9945     3   
3   0.0088           0.9994         0.0027           -1.0000     4   
4   5.4840           0.0979         2.0205           -0.9915     5   
5   4.0912           0.1282         2.0879           -0.9973     6   
6   3.3770           0.2063         2.0379           -0.9990     7   
7   4.5594           0.1198         2.1186           -0.9957     8   

  generated_token  
0       uncertain  
1               .  
2              \n  
3              \n  
4               "  
5              We  
6             're  
7             not

Advanced Techniques and Considerations

The examples above used greedy decoding, which always selects the highest probability token. However, different sampling strategies can reveal different aspects of the probability distribution. Top-k sampling only considers the top-k most probable tokens, while top-p (nucleus) sampling considers tokens whose cumulative probability exceeds a threshold. Temperature scaling adjusts the sharpness of the probability distribution, making it more or less peaked.

Token probabilities are closely related to the model’s attention patterns. Visualising attention weights alongside token probabilities can provide deeper insights into how the model processes context and makes decisions about which parts of the input to focus on.

Different model architectures handle token probabilities in various ways. Autoregressive models like GPT generate probabilities sequentially, building on previous tokens. Encoder-decoder models such as T5 use different probability distributions for encoding and decoding phases. Bidirectional models like BERT can generate probabilities for masked tokens, allowing them to consider context from both directions.

Extracting token probabilities for every generation step can be computationally expensive, especially for large models. Batch processing for multiple sequences can help improve efficiency, as can selective probability extraction that only focuses on specific steps. For very large vocabularies, approximation methods can provide reasonable estimates without the full computational cost.

Conclusion

Token probabilities provide a view into model decision-making, showing not just what the model chooses but how confident it is about that choice. Context matters (a lot), as probabilities change dramatically as the generation context evolves. Uncertainty metrics like entropy and Gini coefficient help quantify model confidence, while visualisation through heatmaps, bar charts, and time series plots reveals patterns that numbers alone cannot show.

The are plenty of practical applications for token probabilities, from debugging to prompt engineering to model comparison. Understanding these concepts is essential for working effectively with LLMs, whether you’re trying to understand model behaviour, debug generation issues, or optimise prompts.

Finally, token probability visualisation is a useful tool for understanding and working with LLMs. By examining how models assign probabilities to tokens we can gain understanding into model behaviour.