Attention Demo

Visualize self-attention patterns in transformer models

This is a simplified demonstration of self-attention. Real transformer attention patterns are learned and much more complex, with multiple heads and layers.

Input Sentence

Attention Patterns

Click a token to see what it attends to. Brighter colors = stronger attention.

Related Tools

What is Attention?

Attention is the mechanism that allows transformer models to weigh the importance of different parts of the input when processing each token. It's the key innovation behind models like BERT, GPT, and modern LLMs.

Self-attention lets each token "look at" all other tokens and decide which ones are most relevant for understanding its context.

How Attention Works

1

Query, Key, Value

Each token creates three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).

2

Compute Scores

Each token's Query is compared with all Keys to compute attention scores (usually via dot product).

3

Softmax

Scores are normalized to probabilities using softmax.

4

Weighted Sum

Values are weighted by attention scores and summed to produce the output representation.

Attention Patterns

Self-Attention

Tokens attend to themselves and nearby context.

Long-Range Dependencies

Important tokens can attend to distant relevant tokens.

Positional Patterns

Early layers often attend locally, later layers more globally.

Pro Tip: Multi-Head Attention

Real transformers use multiple attention "heads" that learn different patterns. GPT-3 has 96 heads per layer! This demo shows a simplified single-head view.

Frequently Asked Questions

Why is attention computationally expensive?

Standard attention has O(n²) complexity — every token attends to every other token. This is why context lengths were limited. Newer architectures use sparse attention or other tricks.

What is causal/masked attention?

In autoregressive models like GPT, tokens can only attend to previous tokens (not future ones). This is enforced with a triangular mask during training.

Related Tools

Tokenization Visualizer

See how text becomes tokens.

Embedding Visualizer

Explore token embeddings in 2D.