LLMs Explained, Part 1: How We Got Here

If you are a software engineer thinking about moving into AI work, you probably have one basic question: how does any of this actually work? Not the hype, not the marketing, but the mechanics. What is a large language model? Why does it work? Why did this all suddenly blow up in 2022 and not earlier?

The honest answer is that it took 70 years of work to get here. Most of what you see today is not new. The ideas have been around since the 1950s. What changed recently is that we finally have enough compute and enough data to make those old ideas useful.

This is a 3-part series that explains how we got here.

Part 1 (this one) is the history. From the first imagined "thinking machine" in the 1950s up to the year 2017, when one paper changed everything.
Part 2 will explain that 2017 paper, called "Attention Is All You Need," and how transformers actually work.
Part 3 will trace what happened from 2020 (GPT-3) to today (reasoning models, agents, open-weights).

You don't need a math background. I will use plain language. By the end of all 3 parts you should understand what an LLM is, where it came from, and why it works.

If neural networks themselves are new to you, read How Neural Networks Work first. This series builds on those basics.

Let me start.

The 1950s: when AI got its name

The story usually starts with Alan Turing. In 1950, he published a paper called "Computing Machinery and Intelligence." He asked one simple question: can machines think? He proposed a test (the famous "Turing test"): if a person talking to a machine cannot tell whether it is a human or a machine, that is a useful working definition of "thinking."

Six years later, in 1956, a small group of researchers met at Dartmouth College for a summer workshop. The organizer, John McCarthy, coined the term "artificial intelligence" to describe what they were trying to build. The Dartmouth conference is usually counted as the official start of AI as a field.

The early researchers were very optimistic. They believed that within a generation, machines would be able to do any work a human could do. They were wrong about the timeline by about 60 years. But the questions they asked were the right ones.

The first big idea was simple: computers can already follow logical rules. So if you wrote down everything a human knows as a giant set of rules, the computer could reason like a human. That was the bet.

Symbolic AI and expert systems (1960s and 70s)

The 1960s and 70s were the era of "symbolic AI." Researchers tried to teach computers using formal symbols and logical rules. The idea was: if you can describe knowledge precisely, the computer can use it.

A famous early example was ELIZA (1966), a program that imitated a therapist by matching patterns in your sentences and responding with pre-written templates. People found it surprisingly engaging, even though it had no real understanding. This was the first chatbot.

A bigger example was MYCIN (1972), an "expert system" that diagnosed bacterial infections. Researchers interviewed real doctors and wrote down their reasoning as hundreds of "if-then" rules. MYCIN actually worked — in some tests, it diagnosed patients as well as expert doctors.

Expert systems became a small industry in the 1980s. Companies built systems for credit approval, equipment diagnosis, chemical analysis. For a while, it looked like this was the path to AI.

But the limits became clear. Every new domain needed a fresh set of rules. The rules became too complex to maintain. Real-world knowledge does not fit neatly into "if-then" statements. A doctor knows things that are hard to write down: how a patient looks, what feels off about a case, when to ignore the rules.

By the late 1980s, expert systems hit a wall. Funding dried up. Researchers later called this period the "first AI winter." The dream of writing down all human knowledge was, it turned out, a bad bet.

The other path: neural networks (and their first death)

While most of AI was busy with rules, a smaller group was working on something different. Their idea was inspired by the brain, not by logic. Instead of writing rules, they built models of small connected units (loosely called "neurons") that pass signals to each other and adjust the strength of each connection based on examples.

In 1958, a researcher named Frank Rosenblatt built the "perceptron," the simplest possible neural network. It took some inputs (numbers), multiplied each one by a weight, added them up, and produced an output. The weights got adjusted as the perceptron saw more examples.

The New York Times called the perceptron "the embryo of an electronic computer that... will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." This sounds familiar in 2026. AI hype is not new.

The perceptron worked for simple problems. But in 1969, two researchers (Marvin Minsky and Seymour Papert) wrote a book proving that a single-layer perceptron could not solve some basic problems. The most famous example was XOR, a simple logical operation. The proof was correct. The book got read by funders as "neural networks are useless." Funding collapsed. The neural network field went quiet for almost 15 years.

The strange part is that the math to fix the perceptron's problem was actually known. If you stack multiple layers of perceptrons, you can solve XOR and many other things. The hard part was training a multi-layer network: nobody had a good algorithm for that.

The 1980s: backpropagation and a quiet comeback

In 1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper that fixed the training problem. The algorithm was called . With backprop, you could train a multi-layer network by passing the prediction error backward through the layers, adjusting each connection weight a little bit at a time.

Backprop is still the core training algorithm today. Every neural network you have heard of, including GPT-5 or DeepSeek V4 or Claude, is trained using a version of this same idea from 1986.

But backprop did not change the world right away. Computers were too slow. Datasets were too small. Neural networks worked on toy problems but lost to other methods on real ones.

The other methods, in the 1990s and 2000s, were called "statistical machine learning." Decision trees, Support Vector Machines, random forests, Bayesian networks. These methods were practical and shipped real products. Spam filters, search ranking, recommendation systems. Almost nothing in production used neural networks.

A small group of researchers (Hinton, Yann LeCun, Yoshua Bengio) kept working on neural networks even when they were unfashionable. They believed that with enough data and enough compute, neural networks would beat everything else. Almost no one in the field agreed with them.

Statistical NLP: how language was done before deep learning

For language specifically, the 1990s and 2000s were the era of "statistical NLP." If you wanted a translator, you collected millions of pairs of English-French sentences and fit statistical models to learn alignments between words and phrases. If you wanted a spam filter, you counted word frequencies and used Bayes' rule.

This work was unglamorous but it shipped. Google Translate, when it launched in 2006, was based on these statistical methods. Every spam filter in your inbox circa 2008 used some flavor of Naive Bayes. Even early voice assistants and search engines were built on this foundation.

The methods had a common property: they treated language as a "bag of features." A document was a bag of words. A sentence was a list of n-grams (groups of 2 or 3 consecutive words). The features were hand-crafted by linguists and engineers. The statistical model learned which features mattered for which task.

This worked, but it had a ceiling. Hand-crafted features lose nuance. Bag-of-words throws away word order. N-grams capture only local patterns. The whole field knew it was working with limited tools.

2012: the deep learning shock

The change came suddenly in 2012. A team at the University of Toronto (Geoffrey Hinton's lab, with his student Alex Krizhevsky) entered a yearly competition called ImageNet. The task: classify photos into 1000 categories from a dataset of over a million labeled images.

The Toronto team used a deep neural network trained on two consumer GPUs. They beat the previous best result by an enormous margin: roughly halving the error rate of the next-best system.

This was the moment the field changed. Within two years, every team in computer vision was using neural networks. The reasons were three:

GPUs had finally become fast enough. Training a large neural network requires billions of multiplications. A CPU does these one at a time. A GPU does thousands in parallel. By 2012, NVIDIA's GeForce cards had crossed the threshold of being usable for serious neural network training.
ImageNet was big enough. A dataset of over a million labeled images was something the field had never had. Without enough examples, deep networks could not learn rich features.
The tricks finally worked. Hinton's lab had figured out small but important engineering tricks (the ReLU activation, dropout for regularization, careful initialization) that made deep training stable.

Once the same recipe started working on images, the question became obvious: would it work for language too?

2013: words become numbers (Word2Vec)

In 2013, a team at Google led by Tomas Mikolov published a paper called "Word2Vec." The idea was simple: take every word in a large body of text, and for each word, train a small neural network to predict the words that appear nearby. After training, each word becomes a vector of 300 numbers.

The interesting part was what those vectors learned. Without anyone telling them what words mean, the vectors organized themselves. Words with similar meaning ended up close to each other. King and Queen ended up near each other. Paris was close to France in the same way Tokyo was close to Japan.

You could even do arithmetic on the vectors. The classic example:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This was the first time you could see, very clearly, that a neural network was learning structure that matched human meaning, just from raw text and some math.

Word2Vec changed how we thought about language for computers. Words no longer had to be a "bag of features" picked by hand. They could be vectors learned automatically from data. Every modern language model still works on this basic principle, just with much bigger and more complex versions of the idea.

RNNs and LSTMs: how machines read sentences (2014 to 2016)

Word vectors are static. Each word has one fixed representation. But language is sequential. The meaning of a sentence depends on the order of the words. "The dog bit the man" is very different from "The man bit the dog."

To handle order, researchers used Recurrent Neural Networks, called RNNs. An RNN reads a sentence one word at a time. At each step, it maintains a "hidden state," which is a vector that summarizes everything it has read so far. The hidden state gets updated word by word. By the end of the sentence, the hidden state, in theory, captures the meaning of the whole sentence.

In practice, plain RNNs had a problem: they forgot too quickly. By the time you reached the end of a long sentence, you had forgotten the start. To fix this, researchers used variants called LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units). These had small "gates" inside them that decided what to remember and what to throw away. They worked much better than plain RNNs.

The big application was machine translation. Google released a neural translation system in 2016 based on LSTMs, and the quality jump over the old statistical methods was clear. The architecture was called "encoder-decoder":

One LSTM read the source sentence (in English, say) one word at a time and produced a single vector that summarized the whole sentence. This was the encoder.
Another LSTM took that vector and produced the target sentence (in French) one word at a time. This was the decoder.

This worked. But there was a problem.

The bottleneck: one vector for the whole sentence

The encoder squeezed an entire source sentence into a single fixed-size vector. For short sentences, this was fine. For long sentences, the vector did not have enough room. The decoder simply could not remember a 50-word source sentence with enough detail.

Imagine summarizing a paragraph in one tweet, then asking someone to reconstruct the paragraph from just the tweet. They will get the gist but lose the details. That is what was happening to long-sentence translation.

The field knew this was the bottleneck. The fix came in 2014.

Attention (2014): the idea that changed everything later

In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio published a paper that fixed the bottleneck. Their idea was called "attention."

Instead of squeezing the whole source sentence into one vector, the decoder kept access to all the encoder's hidden states (one per source word). At each step of producing an output word, the decoder would compute a weighted average over the encoder states, paying more attention to the source words most relevant for the current output.

Here is the intuition. When you (a human) translate "The dog bit the man" into French, and you are about to produce the French word for "dog," you focus on the English word "dog" in the source sentence. When you produce the French word for "bit," you focus on "bit." Attention let the model do something similar: at each step, it learned which source words to focus on.

The result was a clear quality jump for translation. Models could now handle long sentences without forgetting the start.

The thing to notice is that in 2014, attention was a small side improvement to RNNs. RNNs were still doing the main work of reading sequences step by step. Attention was an add-on that helped the decoder pick the right source words at each step.

Three years later, somebody had a different idea: what if attention is not just an add-on, but the entire model? What if you remove the RNN completely and use only attention?

That paper was published in June 2017. It was called "Attention Is All You Need." It is the paper that started everything you see today.

Where we were at the start of 2017

Here is a snapshot of where the field was in early 2017, just before the transformer landed:

Neural networks had won in vision since 2012.
Neural networks had won in NLP between 2014 and 2016, replacing the statistical methods.
The standard architecture for sequence problems was an RNN (LSTM or GRU) with attention.
Word vectors (some descendant of Word2Vec) were the standard input representation.
GPUs were widely available and most labs were training on them.

But there was one stubborn problem. RNNs are sequential by nature. To process word 10, you must first process words 1 through 9. This made them slow to train and hard to scale. As datasets and models grew larger, the bottleneck stopped being algorithms and became compute. And RNNs were the worst possible architecture for parallel compute.

The next post is about how a small team at Google removed the RNN entirely and replaced it with pure attention. The architecture they invented, the transformer, is the foundation of every LLM today: GPT, Claude, Gemini, Llama, DeepSeek, Qwen, all of them.

If you read only one paper in your AI career, "Attention Is All You Need" is the one. Part 2 will walk through it.

What to remember from Part 1

Three things from this history are worth carrying forward.

The first is that AI is not a continuous line of progress. The field has been through at least two "winters" where everyone thought it was a dead end. Hype and disappointment are part of the pattern. Be careful with both.

The second is that the deep learning revolution was not a new idea. It was old ideas (neural networks, backpropagation) finally meeting enough data and enough compute. The bet that Hinton, LeCun, and Bengio made in the 1990s was correct, but it took 20 more years for the world to catch up.

The third is that almost every current LLM is built on three older ideas, stacked on top of each other:

Words become vectors (Word2Vec, 2013).
Models read sequences and maintain state (RNNs and LSTMs, 2014).
Models can pay attention to the right parts of an input (Bahdanau et al., 2014).

The transformer paper from 2017, which we'll cover next, kept the idea of attention and threw away the RNN. That is the whole story. Everything else, including the modern LLM, follows from there.

See you in Part 2.