LLM··14 min read

LLMs Explained, Part 2: How the Transformer Works

How the 2017 'Attention Is All You Need' paper threw out the RNN, what self-attention actually does, and why this one architecture became the foundation of every LLM today.

Ram BakthavachalamRam Bakthavachalam
LLMs Explained, Part 2: How the Transformer Works

In Part 1 we ended with the field stuck on one problem. Recurrent Neural Networks (RNNs) read sentences one word at a time. To process word 100, you had to first process words 1 through 99. Training was slow. As datasets grew, the RNN became the bottleneck.

The fix arrived in June 2017. A team of 8 researchers at Google Brain and the University of Toronto published a paper called "Attention Is All You Need." The title was almost a joke. They were saying that if you want a sequence model, you do not need RNNs, you do not need convolutions, you only need attention.

This post is about that paper. By the end you will know what a transformer is, what self-attention means, and why this one design replaced everything that came before it.

The math in the paper is heavy. The ideas behind it are simple. I will skip the equations.

If terms like "weight matrix" or "embedding" are new to you, read How Neural Networks Work first. The rest of this post assumes those basics.

The big idea: throw out the RNN

The 2014 attention mechanism we ended Part 1 with was an add-on. It helped a decoder pay attention to the right source words, but the model underneath was still an RNN reading the sequence step by step.

The 2017 paper asked a strange question: what if the entire model is just attention? No RNN at all. No recurrence anywhere.

The answer, surprisingly, was yes. The transformer is built almost entirely from attention layers and simple feed-forward layers. There is no sequential reading. Every word in the sentence is processed at the same time, in parallel.

That sounds like a small change. It was not. Removing the RNN unlocked three things at once:

  1. Training got much faster. A GPU could process all the words at the same time, instead of waiting for each previous step.
  2. Long-range connections became cheap. In an RNN, connecting word 1 to word 100 means passing information through 99 in-between steps. In a transformer, every word can directly look at every other word in one operation.
  3. The architecture scaled cleanly. You could stack many transformer layers on top of each other, train on huge data, and quality kept improving in ways nobody had seen with RNNs.

These three together are why transformers won. The first point is practical. The second is a quality benefit. The third is what made GPT-3 and everything after it possible.

What is self-attention?

This is the key idea. Let me explain it slowly.

In Part 1, "attention" meant the decoder looking back at the source words. That was attention from one sequence to another. Self-attention is different: every word in a sentence attends to every other word in the same sentence.

Imagine the sentence "The cat sat on the mat because it was tired." When you read the word "it," you need to figure out what "it" refers to. Is "it" the cat? The mat? Self-attention is the math that lets the model do this lookup.

Each word in the sentence produces three different vectors. The paper calls them Query, Key, and Value (Q, K, V). The names sound abstract. Here is the intuition.

Self-attention: each word produces a Query, Key, and Value vector. The Query of one word is matched against the Keys of all other words to produce attention weights, and those weights mix the Values into a new representation.

Think of it as a small library lookup, happening for every word at the same time:

  • The Query is "what am I looking for?" It is the current word asking a question about its context.
  • The Key is "what am I about?" Every word advertises what kind of information it carries.
  • The Value is "what should I contribute?" The actual content the word offers if you decide to use it.

For every word, you take its Query and compare it against the Keys of all other words. The comparison gives a similarity score: high if the Query matches the Key, low if not. You normalize these scores so they sum to 1 (using a math operation called softmax), so they become weights. Then you take a weighted average of the Values using those weights. That weighted average becomes the new representation of the word.

That is self-attention.

In our example, the Query of "it" might match strongly with the Key of "cat." So when we compute the new representation of "it," it gets mixed heavily with the Value of "cat." The model has effectively figured out that "it" refers to "cat."

The beautiful part is that nothing here is hand-coded. Nobody tells the model that "it" should attend to "cat." The Q, K, V projections are just three more sets of weights inside the network, and they get learned from data. After training on enough examples, the model figures out for itself that "it" and "cat" should attend to each other, that subjects should attend to verbs, that adjectives should attend to the nouns they modify, and so on.

Where do Q, K, V come from?

Each word starts as an embedding: a fixed-size vector of numbers that represents the word. The embeddings themselves are learned during training (we covered this in Part 1).

Then the model has three weight matrices, one for Q, one for K, one for V. These are just tables of numbers, also learned during training. To get the Q vector of a word, you multiply its embedding by the Q matrix. To get K, you multiply by the K matrix. Same for V. Three matrix multiplications, that is the whole step.

The same three matrices are used for every word in the sentence. The Q matrix is applied to "the", "cat", "sat", and so on. Each word produces its own Q, K, V because the embeddings are different, but the projection rules are shared.

Nobody hand-designs what Q, K, V should be. The matrices are part of the model's parameters and get adjusted by gradient descent. The training signal comes from the final task (predict the next word, or fill in the blank). Over millions of examples, the three matrices end up as projections that make the attention lookup useful.

One small thing for later. In multi-head attention each head has its own Q, K, V matrices. So an 8-head block has 24 separate projection matrices in its attention sub-layer.

A small note on terminology. The transformer does not actually work on whole words. It works on tokens, which are small chunks of text. Common words are usually one token each, rare words get split into multiple tokens. For the purposes of understanding, "word" and "token" mean the same thing in this post.

A transformer block

Self-attention is the heart of the transformer. But a transformer block adds a few more things around it.

A typical block has four parts:

  1. Self-attention. The part we just described.
  2. A feed-forward neural network. A simple two-layer network applied to each token's representation independently. This is where most of the model's parameters actually live.
  3. Residual connections. The original input is added back after each step. This helps gradients flow through deep stacks of layers without vanishing.
  4. Layer normalization. A small math step that keeps the numbers stable across many stacked layers.

The combination is mechanical but powerful. Self-attention does the global "where should I look for relevant information" step. The feed-forward layer does the "now process what I found" step. Residuals make sure the original information is not lost. Layer normalization keeps the training stable.

Then you stack the block many times. The original 2017 transformer used 6 blocks for the encoder and 6 for the decoder. GPT-3 used 96 blocks. Today's frontier models stack hundreds. Each additional layer learns more abstract patterns on top of the previous ones.

Multi-head attention

There is one more wrinkle worth knowing. The transformer does not do self-attention once per layer. It does it multiple times in parallel, with different sets of Q, K, V projections. These parallel copies are called attention heads. The original paper used 8 of them per layer.

The intuition: a sentence has many kinds of relationships. Some heads might learn to track which pronoun refers to which noun (like "it" → "cat"). Others might track grammatical structure (subject, verb, object). Others might track topic relationships, or which adjective modifies which noun. Doing many heads in parallel lets the model learn many kinds of patterns at the same time.

The outputs of all the heads are then combined back into one vector and passed to the next layer. The number of heads (8, 16, 32, ...) is just a hyperparameter, picked by the model designer. Bigger models use more.

How does the model know word order?

If you process all words in parallel, there is no natural sense of "first" or "last." This is a problem for sentences, where order matters. "The dog bit the man" and "The man bit the dog" use the same words.

The fix is called positional encoding. Before any attention happens, the model adds a vector to each word that encodes its position. Position 1 gets a different vector from position 2, and so on. The original paper used a fixed mathematical function (sine and cosine waves at different frequencies) to generate these vectors. Modern models often learn the position vectors directly from data, or use newer schemes like RoPE (rotary positional encoding) and ALiBi.

The point is: the transformer architecture itself is order-blind. Position is supplied as a feature added to the inputs. This sounds like a small detail, but it has practical consequences. The way a model handles position determines how well it generalizes to longer inputs than it was trained on. RoPE became popular partly because it handles this generalization better than the original 2017 scheme.

The original transformer: encoder + decoder

The 2017 paper had two halves.

The encoder read the source sentence (say, English) and produced a rich representation of every word. Six identical transformer blocks stacked on top of each other. Self-attention let every English word see every other English word.

The decoder produced the target sentence (say, French) one word at a time. It also had six blocks, but each block had three sub-parts instead of two:

  1. Self-attention over the French words produced so far. This had a mask: a French word at position 5 could only attend to French words at positions 1 to 4, not the future ones it had not yet generated.
  2. Cross-attention from the French words to the encoder output. This was the same idea as Bahdanau's 2014 attention from Part 1 — let the decoder pull relevant information from the source sentence.
  3. A feed-forward layer.

The architecture was originally built for translation. English came in, French came out. The 2017 paper's main result was that this transformer beat the state-of-the-art for English-to-German and English-to-French translation, and it trained faster than the best RNN-based systems.

You don't need to remember all the details. The thing to notice is that the original transformer was designed as an encoder-decoder model for a specific task (translation). Within three years, people would discover that you could simplify the architecture in different ways, and that the simpler versions were better for many tasks.

Why this won: parallelism

The practical reason transformers beat RNNs is parallelism. Modern GPUs are good at one thing: doing many similar operations at the same time. RNNs do not let you do that, because step N depends on step N-1. You have to wait. A transformer has no such dependency. All the words can be processed at the same time on the GPU.

Sequential RNN vs parallel transformer: an RNN processes words one at a time and the GPU sits idle waiting for each step. A transformer processes all words at once.

This single property is why transformers won. You could throw 10 times more compute at training, and the model could actually use that compute. RNNs would have wasted most of it waiting for previous steps.

It also meant that as GPUs got faster (and as data centers got bigger), transformer models could keep getting bigger. The architecture was "compute-friendly" in a way nothing before it had been. This is the property that, three years later, would make GPT-3 possible.

The fork in the road: BERT, GPT, T5

The original transformer was an encoder-decoder for translation. People quickly realized you could use just one half of it, and the simpler models often worked better for specific tasks.

This led to three architectural families that you still see today:

The three transformer architectures: encoder-only (BERT) for understanding tasks, decoder-only (GPT) for generation, and encoder-decoder (T5) for sequence-to-sequence tasks.

Encoder-only (BERT, 2018). Just the encoder stack. Trained to predict masked words: take a sentence, hide 15% of the words at random, and ask the model to fill them in. Because every word can attend to every other word in the encoder, BERT learns very rich representations. This is good for tasks that need understanding: classification, named-entity recognition, question answering on a fixed passage. Google integrated BERT into search in 2019. For about three years, BERT and its variants (RoBERTa, ALBERT, DistilBERT) were the standard for NLP tasks at most companies.

Decoder-only (GPT, 2018). Just the decoder stack, without the cross-attention to an encoder. Trained to predict the next word in a sequence, looking only at the words before it (the masked self-attention enforces this). This is good for generation: continue a paragraph, finish a sentence, write code. OpenAI's GPT-1 (2018) was small (117M parameters) and not impressive. GPT-2 (2019) was bigger (1.5B parameters) and started to feel surprising. The decoder-only design is what every modern LLM (GPT, Claude, Gemini, Llama, DeepSeek, Qwen) descends from.

Encoder-decoder (T5, 2019). Like the original transformer. Useful when input and output are both sequences but might be different (translation, summarization). Google's T5 framed every NLP task as text-to-text: you give it text in, it produces text out, and the type of task is just a prefix on the input ("translate English to German: ..."). T5 is still used today for some tasks but the encoder-decoder shape has lost ground to decoder-only models.

Of these three, the decoder-only family won the race to today's LLMs. The reason is simple. Decoder-only models can be trained on raw text from the internet without any labels. You just take a sentence and ask the model to predict the next word, again and again. That trick scales to trillions of words. Encoder-only models need a more structured pre-training task. Encoder-decoder models need pairs of input-output sequences, which limits the data you can use. Decoder-only training has no such limit, which is exactly the property that mattered when scale started to dominate.

What's next in Part 3

By the end of 2019, the transformer had taken over NLP. Researchers had three architectural shapes (encoder-only, decoder-only, encoder-decoder), the math worked, and the engineering was getting easier.

What nobody fully appreciated yet was scale. Almost everything we now associate with LLMs (chat, instruction following, reasoning, multilingual ability, code generation, agents) came not from new architectural ideas, but from making the same decoder-only transformer bigger, training it on more data, and running it longer.

The change started with one paper in May 2020. OpenAI took the GPT-2 design, scaled it up by about 100 times, and got a model called GPT-3. The capabilities that emerged from that scale are the start of the LLM era as we know it.

That is Part 3.

What to take from Part 2

Three things worth remembering before we continue.

The first is that the entire transformer is built around one operation: self-attention. Every word produces a Query, Key, and Value, and uses its Query to look up information across all other words. Once you understand this lookup, the rest of the architecture is plumbing around it (feed-forward layers, residuals, layer norm, multiple heads, positional encoding). All of it supports the central idea.

The second is that the transformer's biggest practical win was parallelism. RNNs forced sequential computation. Transformers let GPUs do all the work at once. This is why you can train a transformer on trillions of tokens, but training an RNN at that scale would have been impossible.

The third is that there are three architectural shapes that came out of 2017: encoder-only (BERT for understanding), decoder-only (GPT for generation), and encoder-decoder (T5 for sequence-to-sequence). The decoder-only design won the LLM race because it can be trained on raw internet text without labels, and that property is exactly what scaling laws would later reward.

See you in Part 3.