Visualising Token Probabilities in Large Language Models
Machine Learning
LLMs
Visualisation
NLP
Published
October 12, 2025
When working with Large Language Models (LLMs), it is important to understand how they make decisions at each step of text generation. Token probabilities provide a window into the model’s generation process, showing us which tokens the model considers most likely at each position.
We’ll explore what token probabilities are, how they work, and how to visualise them effectively to understand LLM behaviour.
What Are Token Probabilities?
Token probabilities represent the likelihood that a specific token (word, subword, or character) will be generated at a given position in the sequence. When an LLM generates text, it doesn’t just pick the most likely next word; it considers a probability distribution over the entire vocabulary.
For each position in the sequence, the model outputs a probability distribution over all possible tokens (in its vocabulary). This distribution is typically computed using a softmax function over the model’s logits (raw output scores).
The process is generally:
Input Processing: The model takes the current sequence of tokens as input
Forward Pass: The model processes this through its neural network layers
Logit Generation: The final layer produces raw scores (logits) for each token in the vocabulary
Probability Calculation: A softmax function converts these logits into probabilities
Token Selection: The model either selects the most probable token (greedy decoding) or samples from the distribution
The key insight is that these probabilities change dynamically as the context evolves. Each new token influences the probability distribution for subsequent tokens.
Before diving into practical examples, it’s crucial to understand logprobs (log probabilities), which are the fundamental building blocks that LLMs actually work with internally.
Logprobs are simply the natural logarithm of probabilities:
log_{p} = \log\{p\}
To convert back to actual probabilities, we use:
p = \exp\{log_{p}\}.
For example:
A logprob of -1.69 corresponds to a probability of \exp(-1.69) \approx 0.18
A logprob of -6.06 corresponds to a probability of \exp(-6.06) \approx 0.002
Why Do LLMs Use Logprobs Instead of Probabilities?
The main reason is numerical stability. Consider calculating the joint probability of a text T consisting of tokens x_1, x_2, \ldots, x_n:
Adding logprobs is numerically much safer than multiplying probabilities, thanks to the identity
\log(p \times q) = \log(p) + \log(q)
Logprobs as Confidence Measures
Logprobs can be interpreted as a measure of the model’s “confidence” in its predictions:
High logprobs (closer to 0): The model is very confident
Low logprobs (more negative): The model is uncertain
This has practical applications:
Classification confidence: Use logprobs to indicate prediction certainty
Text quality assessment: Calculate total logprob across a sequence
AI detection: Compare logprobs of human vs. AI-generated text
Non-Deterministic Nature
Important note: Logprobs are not deterministic. Running the same query multiple times will yield different logprobs due to the model’s internal randomness, even with temperature settings.
# Let's demonstrate logprobs with a practical exampleimport numpy as np# Example logprobs (these would come from an actual LLM)example_logprobs = {'token_4': -1.69,'token_7': -1.81, 'token_5': -1.81,'token_1': -6.06}example_logprobs
for token, logprob in example_logprobs.items(): probability = np.exp(logprob)print(f"{token}: logprob = {logprob:.2f} → probability = {probability:.4f}")print(f"\nTotal logprob: {sum(example_logprobs.values()):.2f}")print(f"Joint probability: {np.exp(sum(example_logprobs.values())):.6f}")
token_4: logprob = -1.69 → probability = 0.1845
token_7: logprob = -1.81 → probability = 0.1637
token_5: logprob = -1.81 → probability = 0.1637
token_1: logprob = -6.06 → probability = 0.0023
Total logprob: -11.37
Joint probability: 0.000012
Numerical Stability Demonstration:
/tmp/ipykernel_629074/313016976.py:48: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "o-" (-> marker='o'). The keyword argument will take precedence.
ax.semilogy(counts, multiply_results, 'o-', linewidth=2, label='Multiplying probabilities',
/tmp/ipykernel_629074/313016976.py:50: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "s-" (-> marker='s'). The keyword argument will take precedence.
ax.semilogy(counts, logprob_results, 's-', linewidth=2, label='Adding logprobs',
# Model loading and function definition for heatmap demonstrationsimport torchfrom transformers import AutoTokenizer, AutoModelForCausalLMfrom typing import List, Dict, Tupleimport warningswarnings.filterwarnings('ignore')# Load a smaller model for demonstration (GPT-2)model_name ="gpt2"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name)# Add padding token if it doesn't existif tokenizer.pad_token isNone: tokenizer.pad_token = tokenizer.eos_tokenprint(f"Model: {model_name}")print(f"Vocabulary size: {len(tokenizer)}")print(f"Max sequence length: {model.config.max_position_embeddings}")def get_token_probabilities( model, tokenizer, prompt: str, max_new_tokens: int=50, return_logits: bool=False) -> Tuple[List[str], List[torch.Tensor], List[Dict[str, float]]]:""" Extract token probabilities for each generation step. Returns: - tokens: List of generated tokens - probabilities: List of probability tensors for each step - top_k_probs: List of dictionaries with top-k probabilities for each step """# Tokenize the prompt inputs = tokenizer(prompt, return_tensors="pt") input_ids = inputs["input_ids"] tokens = [] probabilities = [] top_k_probs = []# Generate tokens one by onefor step inrange(max_new_tokens):with torch.no_grad():# Forward pass outputs = model(input_ids) logits = outputs.logits[0, -1, :] # Get logits for the last position# Convert to probabilities probs = torch.softmax(logits, dim=-1) probabilities.append(probs)# Get top-k probabilities top_k = torch.topk(probs, k=10) top_k_dict = { tokenizer.decode([idx]): float(prob)for idx, prob inzip(top_k.indices, top_k.values) } top_k_probs.append(top_k_dict)# Select the most probable token (greedy decoding) next_token_id = torch.argmax(probs).item() next_token = tokenizer.decode([next_token_id]) tokens.append(next_token)# Append to input for next iteration input_ids = torch.cat([input_ids, torch.tensor([[next_token_id]])], dim=1)# Stop if we hit the end-of-sequence tokenif next_token_id == tokenizer.eos_token_id:breakreturn tokens, probabilities, top_k_probs
Model: gpt2
Vocabulary size: 50257
Max sequence length: 1024
One interesting application is using logprobs to detect AI-generated text. The idea is that AI-generated text typically has different logprob patterns compared to human-written text.
# Simulate AI detection using logprobsdef calculate_text_logprob(text_tokens, token_logprobs):""" Calculate the total logprob for a sequence of tokens. """ total_logprob =sum(token_logprobs.get(token, -10) for token in text_tokens) avg_logprob = total_logprob /len(text_tokens)return total_logprob, avg_logprob# Simulate different types of texthuman_text_tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat', 'and', 'purred', 'softly']ai_text_tokens = ['The', 'feline', 'creature', 'positioned', 'itself', 'upon', 'the', 'textile', 'surface']# Simulate logprobs (AI text tends to have more consistent, higher logprobs)human_logprobs = {'The': -0.5, 'cat': -1.2, 'sat': -1.8, 'on': -0.8, 'the': -0.5, 'mat': -2.1, 'and': -0.6, 'purred': -3.2, 'softly': -2.5}ai_logprobs = {'The': -0.3, 'feline': -1.0, 'creature': -1.1, 'positioned': -1.2, 'itself': -0.9, 'upon': -1.3, 'the': -0.3, 'textile': -2.0, 'surface': -1.4}# Calculate metricshuman_total, human_avg = calculate_text_logprob(human_text_tokens, human_logprobs)ai_total, ai_avg = calculate_text_logprob(ai_text_tokens, ai_logprobs)
Understanding Token Probability Visualisations
The visualisations above show us how language models make decisions when generating text. These patterns reveal important insights about model behaviour.
Confidence Levels
When a token has a high probability (close to 1.0), it means the model is very sure about choosing that word next. Low probability tokens show the model is uncertain or considering multiple options. Sometimes you’ll see one token dominate with a sharp distribution, while other times multiple tokens have similar probabilities, creating a flat distribution.
How Context Influences Things
The most interesting part is watching how probabilities change as the text builds up. Early tokens often show more uncertainty because the direction is not yet clear. As more context builds, later tokens become more predictable because the previous words help narrow down the choices.
Model Behaviour Patterns
You might notice the model keeps picking high-probability tokens, showing it’s in a confident mode. Other times, flat distributions suggest the model is exploring different valid options. Sometimes you’ll see sudden drops in probability for the chosen token, which might mean the model is making a surprising choice.
Practical Applications
Understanding these patterns helps in several ways:
Model Debugging
You can spot where the model gets confused or makes unexpected choices. This helps detect when the model is “hallucinating” by picking low-probability tokens. It also helps explain why certain prompts lead to specific outputs.
Prompt Engineering
This knowledge helps you write better prompts that lead to more confident and coherent outputs. You can identify prompts that cause too much uncertainty and adjust them for better results.
Model Comparison
You can compare how different models handle the same prompt. This helps understand the trade-offs between different model types and evaluate improvements from fine-tuning.
AI Detection Using Logprobs:
print(f"Human text - Total logprob: {human_total:.2f}, Average: {human_avg:.2f}")print(f"AI text - Total logprob: {ai_total:.2f}, Average: {ai_avg:.2f}")print(f"Difference: {ai_avg - human_avg:.2f}")
Human text - Total logprob: -13.20, Average: -1.47
AI text - Total logprob: -9.50, Average: -1.06
Difference: 0.41
Model: gpt2
Vocabulary size: 50257
Max sequence length: 1024
def get_token_probabilities( model, tokenizer, prompt: str, max_new_tokens: int=50, return_logits: bool=False) -> Tuple[List[str], List[torch.Tensor], List[Dict[str, float]]]:""" Extract token probabilities for each generation step. Returns: - tokens: List of generated tokens - probabilities: List of probability tensors for each step - top_k_probs: List of dictionaries with top-k probabilities for each step """# Tokenize the prompt inputs = tokenizer(prompt, return_tensors="pt") input_ids = inputs["input_ids"] tokens = [] probabilities = [] top_k_probs = []# Generate tokens one by onefor step inrange(max_new_tokens):with torch.no_grad():# Forward pass outputs = model(input_ids) logits = outputs.logits[0, -1, :] # Get logits for the last position# Convert to probabilities probs = torch.softmax(logits, dim=-1) probabilities.append(probs)# Get top-k probabilities top_k = torch.topk(probs, k=10) top_k_dict = { tokenizer.decode([idx]): float(prob)for idx, prob inzip(top_k.indices, top_k.values) } top_k_probs.append(top_k_dict)# Select the most probable token (greedy decoding) next_token_id = torch.argmax(probs).item() next_token = tokenizer.decode([next_token_id]) tokens.append(next_token)# Append to input for next iteration input_ids = torch.cat([input_ids, torch.tensor([[next_token_id]])], dim=1)# Stop if we hit the end-of-sequence tokenif next_token_id == tokenizer.eos_token_id:breakreturn tokens, probabilities, top_k_probs
Now let’s create some visualisations to better understand how token probabilities evolve during generation.
Interpreting Token Probability Visualisations
The visualisations above reveal several important insights about how LLMs generate text. When tokens have high probabilities close to 1.0, the model is very confident about what comes next. Low probability tokens suggest the model is uncertain or exploring multiple possibilities. You can see this difference between sharp probability distributions where one token dominates versus flat distributions where multiple tokens have similar probabilities.
The probabilities change dramatically as context builds up. Early tokens often show higher uncertainty because the model hasn’t established a clear direction yet. Later tokens become more predictable as the growing context constrains the possible choices. This context sensitivity is crucial for understanding how language models work.
Model behaviour patterns emerge from these probability distributions. When a model keeps selecting high-probability tokens, it’s operating in a confident generation mode. Flat distributions suggest the model is exploring multiple valid options rather than committing to one path. Sudden drops in probability for the selected token might indicate the model is making a surprising choice that breaks from its usual patterns.
Practical Applications
Understanding token probabilities is valuable for several practical applications. For model debugging, you can identify where the model becomes uncertain or makes unexpected choices. This helps detect when the model is hallucinating by selecting low-probability tokens, and it reveals why certain prompts lead to specific outputs.
In prompt engineering, this knowledge helps design prompts that lead to more confident, coherent outputs. You can identify ambiguous prompts that cause high uncertainty and optimise prompts for specific types of responses. The probability patterns guide you toward prompts that work well with the model’s natural tendencies.
For model comparison, token probabilities let you compare how different models handle the same prompt. This reveals the trade-offs between different model architectures and helps evaluate the effects of model improvements and fine-tuning. By examining these probability distributions, you gain insight into how each model makes decisions and where their strengths and weaknesses lie.
Quantifying Uncertainty: Entropy and Other Metrics
Beyond visual inspection, we can quantify the uncertainty in token probability distributions using various metrics.
def calculate_uncertainty_metrics(probabilities: List[torch.Tensor]) -> Dict[str, List[float]]:""" Calculate various uncertainty metrics for each generation step. """ metrics = {'entropy': [],'max_probability': [],'top_k_entropy': [], # Entropy of top-10 tokens'gini_coefficient': [] }for prob_tensor in probabilities: probs = prob_tensor.numpy()# Entropy: -sum(p * log(p)) entropy =-np.sum(probs * np.log(probs +1e-10)) metrics['entropy'].append(entropy)# Maximum probability max_prob = np.max(probs) metrics['max_probability'].append(max_prob)# Top-k entropy (entropy of top-10 tokens) top_k_indices = np.argpartition(probs, -10)[-10:] top_k_probs = probs[top_k_indices] top_k_probs = top_k_probs / np.sum(top_k_probs) # Normalise top_k_entropy =-np.sum(top_k_probs * np.log(top_k_probs +1e-10)) metrics['top_k_entropy'].append(top_k_entropy)# Gini coefficient (measure of inequality in probability distribution) sorted_probs = np.sort(probs)[::-1] n =len(sorted_probs) cumsum = np.cumsum(sorted_probs) gini = (n +1-2* np.sum(cumsum) / cumsum[-1]) / n metrics['gini_coefficient'].append(gini)return metrics
Calculate metrics for our example and show uncertainty metrics for each generation step:
uncertainty_metrics = calculate_uncertainty_metrics(probabilities)# Create a DataFrame for easier analysismetrics_df = pd.DataFrame(uncertainty_metrics)metrics_df['step'] =range(1, len(metrics_df) +1)metrics_df['generated_token'] = tokensprint(metrics_df.round(4))
The examples above used greedy decoding, which always selects the highest probability token. However, different sampling strategies can reveal different aspects of the probability distribution. Top-k sampling only considers the top-k most probable tokens, while top-p (nucleus) sampling considers tokens whose cumulative probability exceeds a threshold. Temperature scaling adjusts the sharpness of the probability distribution, making it more or less peaked.
Token probabilities are closely related to the model’s attention patterns. Visualising attention weights alongside token probabilities can provide deeper insights into how the model processes context and makes decisions about which parts of the input to focus on.
Different model architectures handle token probabilities in various ways. Autoregressive models like GPT generate probabilities sequentially, building on previous tokens. Encoder-decoder models such as T5 use different probability distributions for encoding and decoding phases. Bidirectional models like BERT can generate probabilities for masked tokens, allowing them to consider context from both directions.
Extracting token probabilities for every generation step can be computationally expensive, especially for large models. Batch processing for multiple sequences can help improve efficiency, as can selective probability extraction that only focuses on specific steps. For very large vocabularies, approximation methods can provide reasonable estimates without the full computational cost.
Conclusion
Token probabilities provide a view into model decision-making, showing not just what the model chooses but how confident it is about that choice. Context matters (a lot), as probabilities change dramatically as the generation context evolves. Uncertainty metrics like entropy and Gini coefficient help quantify model confidence, while visualisation through heatmaps, bar charts, and time series plots reveals patterns that numbers alone cannot show.
The are plenty of practical applications for token probabilities, from debugging to prompt engineering to model comparison. Understanding these concepts is essential for working effectively with LLMs, whether you’re trying to understand model behaviour, debug generation issues, or optimise prompts.
Finally, token probability visualisation is a useful tool for understanding and working with LLMs. By examining how models assign probabilities to tokens we can gain understanding into model behaviour.