One paper changed everything: Attention Is All You Need. This post breaks down the foundations behind attention, starting from embeddings. A simple journey from words to meaning.
Attention Mechanisms
I want to talk about attention.
The paper that changed everything.
The paper that gave birth to ChatGPT and forced the rest of the world to wake up.
“Attention Is All You Need.”
Today, we’re not going to talk about the entire Transformer architecture.
We’ll focus on attention.
But before we get there, we need to understand something fundamental:
embeddings.
Embeddings
When we talk to a language model like ChatGPT or Gemini, we use words.
That’s obvious.
But algorithms don’t work with words.
They work with numbers.
This is where embeddings come in.
Embeddings are numerical representations of words inside a vector space.
Each word is mapped to a vector — a list of numbers — that captures meaning, context, and relationships with other words.
Words that mean similar things end up close together in that space.
And that leads to an important question:
What is a vector space?
A vector space is an algebraic structure defined over a set of elements where:
you can add elements together,
and scale them using numbers.
Formally speaking:
A vector space is an algebraic structure that arises when we define two internal operations and one external operation on a set of elements.
In the dropdown below, you can find the formal definition of a vector space (sorry, I just really love mathematics).
So now, each word is represented inside a vector space that captures its meaning.
And something very powerful happens.
Once everything is expressed as vectors, we can add, subtract, scale, and combine them mathematically.
Even more importantly, metrics emerge.
That means we can measure distances between vectors.
We can compute similarities.
We can compare meanings.
For example, dog and cat are both animals, so the distance between their vectors will be small.
But the distance between cat and television will be much larger.
By giving words a vector structure, we unlock a whole new set of operations.
And this is where attention mechanisms come in.
This is the first step of large language models:
understanding the context surrounding a word.
Attention: Understanding Context
We already know that text can be turned into vector representations that encode meaning.
Well… “already know” is a bit generous, because we haven’t explained that transformation yet.
Quick preview: embeddings are produced by another model. I won’t spoil it now—we’ll cover that in a separate post.
So, assume each word/token is now a vector. Great.
Next challenge: capture how words relate to each other inside a sentence.
Example:
The elephant is very large.
Here, elephant should pay a lot of attention to large (it describes it), while is might matter less for this specific relation.
So how do we model that mathematically?
This is where the fun begins.
Get ready: we’re back to high-dimensional vector spaces and operations on them.
Let a token be represented as a vector:
where ddd is the embedding dimension (often hundreds or even thousands).
Now we apply linear projections to this vector.
The core idea is:
Instead of working directly in the full d-dimensional space, we project into smaller learned subspaces of size , where .
To do this, we use three learned matrices:
And compute:
These are called:
Query (q): what this token is looking for
Key (k): what this token offers / how it can be matched
Value (v): the information this token will pass on if selected
That’s the foundation of attention:
Tokens ask questions (queries), compare with others’ keys, and gather weighted information from their values.
What we are doing in the formula, multiplying by the weights , is projecting vectors onto learned subspaces.
For more information you can check the following deep dive.
These are linear transformations in algebra.
More precisely, each multiplication by a weight matrix maps the same token representation into a different feature space.
This does not simply scale numbers; it reorganizes information so different aspects of meaning become easier to separate and use.
By the way, one key point:
are trainable parameters (model weights).
They are not fixed by hand. During training, the model sees lots of data, makes predictions, measures error, and updates these matrices through gradient descent/backpropagation.
So over time, these weights learn:
how to build better queries (what to look for),
how to build better keys (how tokens should be matched),
and how to build better values (what information should be passed forward).
In short: the model learns from data how attention should work, by continuously adjusting , , and .
Let’s Dig Deeper into Attention
At this point, we know that Transformers learn projection weights to compute attention.
Now the key question is: how is attention written mathematically?
The core formula is:
Because embeddings are vectors, we can compare them numerically.
Breaking down the formula
Compute similarity scores
Each entry measures how much token (as a query) matches token (as a key).
Scale the scores
Without scaling, dot products can grow large when is large, making softmax too peaky and gradients unstable.
Normalize with softmax (row-wise)
This gives attention weights :
each weight is in ,
each row sums to ,
larger scores get larger weights.
Weighted sum of values
For each token , the output is a weighted combination of all value vectors .
So token “looks more” at tokens with higher attention weights.
Why do we need softmax?
Without softmax, scores are unbounded raw numbers (negative, positive, any scale).
With softmax, they become a clean probability-like distribution over tokens:
Easy to interpret (“how much attention goes to each token”)
Comparable across positions
Differentiable and stable for training.
Take this quiz to test your understanding of the concepts covered in this article.
15 questions • ~23 min
