Top-P & Top-K Explainer

Visualize how nucleus sampling (top-p) and top-k filtering affect token selection

Sampling Parameters

0.90

Selects tokens until cumulative probability reaches 90%

40

Selects only the top 40 most likely tokens

Token Probability Distribution

Included in sampling Excluded
the
25.0% Σ25%
a
18.0% Σ43%
an
12.0% Σ55%
this
10.0% Σ65%
that
8.0% Σ73%
my
6.0% Σ79%
your
5.0% Σ84%
his
4.0% Σ88%
her
3.0% Σ91%
← top-p cutoff
their
2.5% Σ94%
our
2.0% Σ96%
its
1.5% Σ97%
some
1.0% Σ98%
any
0.8% Σ99%
each
0.6% Σ99%
every
0.5% Σ100%
one
0.4% Σ100%
two
0.3% Σ101%
three
0.2% Σ101%
many
0.1% Σ101%

Summary

Tokens in candidate pool:

9

Probability mass covered:

91.0%

Related Tools

What is Top-P (Nucleus Sampling)?

Top-P, also called nucleus sampling, selects tokens from the smallest set whose cumulative probability exceeds P. For example, top-p=0.9 means the model considers only the tokens that together make up 90% of the probability mass.

This adapts dynamically to the distribution — if the model is confident, fewer tokens are considered. If uncertain, more tokens are included.

What is Top-K?

Top-K simply limits selection to the K most probable tokens, regardless of their actual probabilities. For example, top-k=40 means only the 40 most likely tokens are candidates for selection.

This is simpler but less adaptive — it always considers exactly K tokens whether the model is confident or uncertain.

Comparison

AspectTop-PTop-K
BasisCumulative probabilityToken count
Adaptive?Yes - fewer tokens when confidentNo - always K tokens
Common values0.9, 0.9540, 50, 100
Best forMost use casesVery large vocabularies

Pro Tip: Use Top-P Alone

OpenAI recommends adjusting either temperature OR top-p, not both. Top-p=0.9 with temperature=1.0 usually gives good results. Adding top-k on top is rarely necessary.

Frequently Asked Questions

Should I use top-p or top-k?

Top-p is generally preferred because it adapts to the model's confidence. Top-k is useful when you know your vocabulary size and want fixed limits.

What's a good top-p value?

0.9 is a common default. Lower values (0.7-0.8) reduce randomness, while higher values (0.95-1.0) allow more variety.

Can I use both together?

Yes, they can be combined. The model will use whichever filter is more restrictive at each step. However, using one at a time is usually clearer and easier to tune.

Related Tools

Temperature Simulator

See how temperature affects distributions.

Tokenization Visualizer

See how text becomes tokens.