LLMs Explained, Part 3: How LLMs Got Useful

In Part 2 we ended with the transformer settled. Three architectural shapes (encoder-only, decoder-only, encoder-decoder), the math working, and decoder-only models lined up to ride one specific advantage: they can be trained on raw internet text without labels. By the end of 2019, that property was about to matter more than anyone expected.

This post is about what happened next. From May 2020 to November 2022, the field went from "the transformer is interesting research" to "every tech company has an AI strategy." Almost nothing about the architecture changed. What changed was scale, and a small set of tricks for making big models actually useful.

By the end of this post you will understand the path from GPT-3 to ChatGPT, the alignment recipe (instruction tuning plus RLHF) that made models follow your instructions, and why a single product release in late 2022 became the moment everything went mainstream.

If "neural network" or "transformer" are new terms, start with How Neural Networks Work and Part 2 of this series. The rest of this post assumes those basics.

The 2020 surprise: GPT-3

In May 2020, OpenAI published a paper called "Language Models are Few-Shot Learners". The model was GPT-3. It used the same decoder-only transformer design as GPT-2 from 2019. The only meaningful change was size. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. Roughly 100 times bigger. Trained on around 300 billion tokens of internet text. The training run cost somewhere between 4 and 12 million dollars depending on whose estimate you trust.

Nobody outside OpenAI quite saw this coming. The expectation was that bigger models would do incrementally better at the same things smaller models already did. What actually happened was that new abilities appeared.

GPT-2 could not really translate. GPT-3 could. GPT-2 could not write working Python code. GPT-3 could (often). GPT-2 could not solve simple word problems. GPT-3 could (sometimes). The architecture did not change. The training task did not change. The only thing that changed was that there were a hundred times more knobs to learn from a lot more text.

The most surprising new ability was in-context learning, also called few-shot prompting. You could put a few examples in the prompt itself, and the model would pick up the pattern from those examples and apply it to a new input. No retraining. No fine-tuning.

Here is the kind of prompt that worked:

Translate English to French.
sea otter -> loutre de mer
peppermint -> menthe poivrée
cheese ->

GPT-2 would generate something random after "cheese ->". GPT-3 would write "fromage". The model was learning the task from the examples in the prompt and continuing the pattern. This was the first hint that prompting itself was a kind of programming.

In retrospect this is what happens when a next-token predictor reads enough text. It has seen translation pairs, it has seen lists of examples, it has seen instructions followed by answers. With enough examples and enough scale, the model gets good enough at "what comes next given this context" that giving it the right context becomes the way you tell it what to do.

Scaling laws: bigger reliably means better

In January 2020, a few months before GPT-3 was public, a quieter paper from OpenAI called "Scaling Laws for Neural Language Models" (Kaplan et al.) made a different but equally important claim. If you measured the test loss of a language model and varied three things (the number of parameters, the size of the training dataset, and the amount of compute used), the loss came down on a smooth, predictable curve.

This was the recipe. You did not need to invent a new architecture. You just needed to make the existing one bigger and feed it more text and more compute, and you knew roughly how much better it was going to get.

Two years later, in March 2022, DeepMind published "Training Compute-Optimal Large Language Models". The model was Chinchilla. The result was that most existing models were too big and trained on too little data. For a given compute budget, the optimal split was closer to equal scaling of parameters and tokens than the field had assumed.

This is why later models trained on a lot more text. Llama 2 (2023) trained on around 2 trillion tokens. Llama 3 (2024) on 15 trillion. Today's frontier models train on tens of trillions. The Chinchilla finding shifted budgets toward more tokens and away from "just make it bigger."

The practical consequence: from 2020 onward, the LLM field had a recipe. Want a better model? Spend more on compute, more on data. The reason the next few years felt like a steady march of improvements is that the recipe held.

From completion engine to assistant: instruction tuning

GPT-3 was impressive but weird to use. The base model was trained to predict next tokens given internet text. Internet text contains questions answered with wrong answers, instructions followed by mockery, and prompts that lead nowhere useful. So when you typed "write me a haiku about cats," GPT-3 might write the haiku, or might write five more example prompts, or might continue into a story about cats. It did not know that you wanted it to follow your instruction. It was completing text.

The fix came in two steps. The first step was supervised fine-tuning, often called instruction tuning. Hire human contractors. Have them write thousands of pairs of (instruction, ideal response). Fine-tune the model on those pairs. Now the model has seen the format you want. When it sees an instruction, it generates the kind of response the contractors were trained to write.

This is a small change to the training setup but a big change to behavior. Instead of just predicting "what would the internet write next," the model is being nudged to predict "what would a careful contractor write as a good response to this instruction."

Instruction tuning was enough to make the model usable. It was not enough to make it good.

RLHF: teaching the model what humans actually want

The second step was reinforcement learning from human feedback, RLHF. OpenAI's InstructGPT paper from early 2022 made this part of the public recipe.

The trick is that you cannot easily write "ideal responses" for every instruction in the world. There is no single ideal response. There are better and worse responses, and humans can usually tell which is which.

So instead of writing ideal answers, RLHF asks humans to rank. Generate several candidate responses to the same instruction. Have a human pick the best. Do this thousands of times.

From those rankings, you train a separate small model called a reward model. It takes an instruction and a response and outputs a number: how good a human would think this response is. Now you have a function that scores responses.

The last step is to use that reward model to fine-tune the language model further, using reinforcement learning. The language model generates a response. The reward model scores it. The language model gets nudged in whatever direction makes the reward higher. Repeat for a long time on many instructions.

The result is a model that prefers the responses humans would prefer. It avoids the patterns humans dislike (rambling, off-topic, refusing the wrong things, hallucinating confidently). It learns the patterns humans like (focused, helpful, structured, honest about uncertainty).

The combination (instruction tuning + RLHF) is the alignment recipe that turned GPT-3 into something you would actually want to use. Same base model, very different behavior.

Every modern chat-style LLM uses some version of this recipe. The details vary (DPO instead of PPO, Constitutional AI from Anthropic, RLAIF where the feedback comes from an AI judge), but the shape is the same. Pretrain on a huge text corpus. Instruction-tune to learn the format. Use human feedback to learn what is actually good.

The ChatGPT moment: November 30, 2022

By late 2022, OpenAI had a model internally called GPT-3.5. It was a GPT-3 base, instruction-tuned, RLHF-trained. The pieces had been sitting in the API for a while. Developers could use it. Most people had never heard of it.

On November 30, 2022, OpenAI wrapped the model in a chat web app and put it online for free. The pitch was modest. The blog post called it a "research preview." Internally the team thought it was a small refinement of what they already had.

It got 1 million users in 5 days. 100 million users in 2 months. The fastest-growing consumer application in history at the time.

Why this specific release? The base technology had been around for almost a year. Three things came together.

The first was the chat interface. Calling an API and tuning prompt structure is a developer activity. Typing into a text box and getting a useful answer is a normal-person activity. The chat UI removed the friction.

The second was free access. No payment, no API key, no integration work. Anyone with a browser could try it.

The third was that the model had crossed some quality threshold. It was good enough that you could ask it almost anything in plain English and get a useful answer. Code, essays, recipes, vacation plans, math homework, business emails. It did not always get things right, but it was right often enough to feel like magic.

The aftermath was a step change in how the tech industry talked about AI. Before November 2022, "AI" mostly meant recommender systems, image classifiers, fraud detection, and a few experimental products. After November 2022, "AI" meant LLMs. Every tech company started either building on top of one or scrambling to figure out their strategy. The industry has not gone back.

What's next in Part 4

The ChatGPT moment ended one chapter of the story and started another. Until late 2022, LLM progress was a research effort with one or two visible products. After ChatGPT, it became a platform race.

Part 4 is about what happened from there. GPT-4 and the move to multimodal. Tool use and function calling, which let the model start doing things in the real world. Long context windows. The open-weights wave (Llama, Mistral, DeepSeek, Qwen) that broke the closed-API monopoly. Reasoning models like o1 and DeepSeek-R1 that shifted the recipe again, this time from "more training compute" to "more thinking-time compute." And finally, where we are in 2026.

That is Part 4.

What to take from Part 3

Three things worth remembering.

The first is that scale was the unlock. The transformer architecture from 2017 was already capable of what GPT-3 did. The reason it took until 2020 to find out was that nobody had spent the compute and the data on it. The new abilities came from quantity, not from a new idea.

The second is that the alignment step is what made models useful. A pretrained LLM is a completion engine that has read the internet. It is impressive but impractical. Instruction tuning gives it the format. RLHF teaches it what humans actually want. Without the alignment step, ChatGPT would not have happened.

The third is that ChatGPT was a product breakthrough, not a research one. The model behind it was not new. The chat UI, the free access, and the quality crossing a usability threshold are what created the moment. Most of what feels recent in AI is actually downstream of that one product release.

See you in Part 4.