Perplexity Calculator

Calculate and understand perplexity scores for language model evaluation

Calculate Perplexity

PPL = 2H = 22.5 = 5.66

Result

5.66

Excellent

Model predicts next tokens very accurately

Perplexity

5.6569

Cross-Entropy (bits)

2.5000

Model Benchmarks (WikiText-2)

GPT-4
3.5
GPT-3.5
9.2
Llama 2 70B
3.3
Llama 2 13B
4.9
Llama 2 7B
5.5
Mistral 7B
5.3
GPT-2 Large
22.1
GPT-2 Small
37.5

Lower perplexity = better. Values are approximate and may vary by evaluation method.

Related Tools

What is Perplexity?

Perplexity is a measurement of how well a language model predicts a text sample. Intuitively, it represents "how surprised" the model is by the text. Lower perplexity means the model predicts the text more accurately.

A perplexity of 10 means the model is as confused as if it had to choose uniformly among 10 options at each step. A perplexity of 1 would mean perfect prediction.

The Formula

PPL = 2H(p)

PPL = Perplexity

H(p) = Cross-entropy loss in bits

Alternatively: PPL = exp(H(p)) when using natural log loss

Perplexity Scale

RangeQualityMeaning
<10ExcellentState-of-the-art LLMs
10-20Very GoodStrong language models
20-50GoodOlder or smaller LLMs
50-100FairBasic models, specialized domains
>100PoorUntrained or domain mismatch

Important Caveats

Perplexity is dataset-specific. A model may have low perplexity on one dataset but high on another. Always compare models on the same evaluation set with the same tokenizer.

Frequently Asked Questions

Why base 2 vs base e?

Both are valid. Base 2 gives cross-entropy in bits (common in information theory). Base e (natural log) is often used in deep learning frameworks. Just be consistent when comparing models.

How do I calculate perplexity for my model?

Run your model on a test dataset, collect the cross-entropy loss, then PPL = exp(loss). Most ML frameworks provide this automatically during evaluation.

Related Tools

BLEU Score

Calculate BLEU scores for translation.

ROUGE Score

Calculate ROUGE scores for summarization.