Perplexity Calculator

Calculate and understand perplexity scores for language model evaluation

Calculate Perplexity

Cross-Entropy Loss (bits)

PPL = 2^H = 2^2.5 = 5.66

Result

5.66

Excellent

Model predicts next tokens very accurately

Perplexity

5.6569

Cross-Entropy (bits)

2.5000

Model Benchmarks (WikiText-2)

GPT-4

3.5

GPT-3.5

9.2

Llama 2 70B

3.3

Llama 2 13B

4.9

Llama 2 7B

5.5

Mistral 7B

5.3

GPT-2 Large

22.1

GPT-2 Small

37.5

Lower perplexity = better. Values are approximate and may vary by evaluation method.

Related Tools

ROUGE Score Calculator

Calculate ROUGE-N and ROUGE-L scores for summarization tasks

Temperature Visualizer

Visualise how temperature and top-p sampling affect next-token probabilities

Tokenization Visualizer

See how text is broken down into tokens by different tokenizers (BPE, WordPiece)

Nucleus Sampling (Top-p) Demo

Interactive demo explaining how Nucleus Sampling filters token selection

Vector Dimension Guide

Reference for default embedding dimensions of popular models (OpenAI, Cohere, etc.)

Attention Mechanism Demo

Interactive visualizer of how self-attention works in transformers

What is Perplexity?

Perplexity is a measurement of how well a language model predicts a text sample. Intuitively, it represents "how surprised" the model is by the text. Lower perplexity means the model predicts the text more accurately.

A perplexity of 10 means the model is as confused as if it had to choose uniformly among 10 options at each step. A perplexity of 1 would mean perfect prediction.

The Formula

PPL = 2^H(p)

PPL = Perplexity

H(p) = Cross-entropy loss in bits

Alternatively: PPL = exp(H(p)) when using natural log loss

Perplexity Scale

Range	Quality	Meaning
<10	Excellent	State-of-the-art LLMs
10-20	Very Good	Strong language models
20-50	Good	Older or smaller LLMs
50-100	Fair	Basic models, specialized domains
>100	Poor	Untrained or domain mismatch

Important Caveats

Perplexity is dataset-specific. A model may have low perplexity on one dataset but high on another. Always compare models on the same evaluation set with the same tokenizer.

Frequently Asked Questions

Why base 2 vs base e?

Both are valid. Base 2 gives cross-entropy in bits (common in information theory). Base e (natural log) is often used in deep learning frameworks. Just be consistent when comparing models.

How do I calculate perplexity for my model?

Run your model on a test dataset, collect the cross-entropy loss, then PPL = exp(loss). Most ML frameworks provide this automatically during evaluation.

Perplexity Calculator

Calculate Perplexity

Result

Model Benchmarks (WikiText-2)

Related Tools

ROUGE Score Calculator

Temperature Visualizer

Tokenization Visualizer

Nucleus Sampling (Top-p) Demo

Vector Dimension Guide

Attention Mechanism Demo

What is Perplexity?

The Formula

Perplexity Scale

Important Caveats

Frequently Asked Questions

Related Tools

BLEU Score

ROUGE Score