Attention Demo
Visualize self-attention patterns in transformer models
This is a simplified demonstration of self-attention. Real transformer attention patterns are learned and much more complex, with multiple heads and layers.
Input Sentence
Attention Patterns
Click a token to see what it attends to. Brighter colors = stronger attention.
Related Tools
BLEU Score Calculator
Calculate BLEU score for machine translation evaluation
Cosine Similarity Calc
Calculate similarity between two vectors or text embeddings
Embedding 3D Visualizer
Visualize high-dimensional embeddings in 2D/3D using PCA/t-SNE simulation
Perplexity Explainer
Calculate and understand perplexity from probability distributions
ROUGE Score Calculator
Calculate ROUGE-N and ROUGE-L scores for summarization tasks
Temperature Visualizer
Visualise how temperature and top-p sampling affect next-token probabilities
What is Attention?
Attention is the mechanism that allows transformer models to weigh the importance of different parts of the input when processing each token. It's the key innovation behind models like BERT, GPT, and modern LLMs.
Self-attention lets each token "look at" all other tokens and decide which ones are most relevant for understanding its context.
How Attention Works
Query, Key, Value
Each token creates three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).
Compute Scores
Each token's Query is compared with all Keys to compute attention scores (usually via dot product).
Softmax
Scores are normalized to probabilities using softmax.
Weighted Sum
Values are weighted by attention scores and summed to produce the output representation.
Attention Patterns
Self-Attention
Tokens attend to themselves and nearby context.
Long-Range Dependencies
Important tokens can attend to distant relevant tokens.
Positional Patterns
Early layers often attend locally, later layers more globally.
Pro Tip: Multi-Head Attention
Real transformers use multiple attention "heads" that learn different patterns. GPT-3 has 96 heads per layer! This demo shows a simplified single-head view.
Frequently Asked Questions
Why is attention computationally expensive?
Standard attention has O(n²) complexity — every token attends to every other token. This is why context lengths were limited. Newer architectures use sparse attention or other tricks.
What is causal/masked attention?
In autoregressive models like GPT, tokens can only attend to previous tokens (not future ones). This is enforced with a triangular mask during training.
Related Tools
Tokenization Visualizer
See how text becomes tokens.
Embedding Visualizer
Explore token embeddings in 2D.
