Attention Demo

Visualize self-attention patterns in transformer models

This is a simplified demonstration of self-attention. Real transformer attention patterns are learned and much more complex, with multiple heads and layers.

Input Sentence

Attention Patterns

Click a token to see what it attends to. Brighter colors = stronger attention.

Related Tools

BLEU Score Calculator

Calculate BLEU score for machine translation evaluation

Cosine Similarity Calc

Calculate similarity between two vectors or text embeddings

Embedding 3D Visualizer

Visualize high-dimensional embeddings in 2D/3D using PCA/t-SNE simulation

Perplexity Explainer

Calculate and understand perplexity from probability distributions

ROUGE Score Calculator

Calculate ROUGE-N and ROUGE-L scores for summarization tasks

Temperature Visualizer

Visualise how temperature and top-p sampling affect next-token probabilities

What is Attention?

Attention is the mechanism that allows transformer models to weigh the importance of different parts of the input when processing each token. It's the key innovation behind models like BERT, GPT, and modern LLMs.

Self-attention lets each token "look at" all other tokens and decide which ones are most relevant for understanding its context.

How Attention Works

Query, Key, Value

Each token creates three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).

Compute Scores

Each token's Query is compared with all Keys to compute attention scores (usually via dot product).

Softmax

Scores are normalized to probabilities using softmax.

Weighted Sum

Values are weighted by attention scores and summed to produce the output representation.

Attention Patterns

Self-Attention

Tokens attend to themselves and nearby context.

Long-Range Dependencies

Important tokens can attend to distant relevant tokens.

Positional Patterns

Early layers often attend locally, later layers more globally.

Pro Tip: Multi-Head Attention

Real transformers use multiple attention "heads" that learn different patterns. GPT-3 has 96 heads per layer! This demo shows a simplified single-head view.

Frequently Asked Questions

Why is attention computationally expensive?

Standard attention has O(n²) complexity — every token attends to every other token. This is why context lengths were limited. Newer architectures use sparse attention or other tricks.

What is causal/masked attention?

In autoregressive models like GPT, tokens can only attend to previous tokens (not future ones). This is enforced with a triangular mask during training.

Attention Demo

Input Sentence

Attention Patterns

Related Tools

BLEU Score Calculator

Cosine Similarity Calc

Embedding 3D Visualizer

Perplexity Explainer

ROUGE Score Calculator

Temperature Visualizer

What is Attention?

How Attention Works

Query, Key, Value

Compute Scores

Softmax

Weighted Sum

Attention Patterns

Self-Attention

Long-Range Dependencies

Positional Patterns

Pro Tip: Multi-Head Attention

Frequently Asked Questions

Related Tools

Tokenization Visualizer

Embedding Visualizer