How Neural Networks Work

Every modern AI model is a neural network. Image generators, chatbots, code completion, voice assistants, all of them. If you have used ChatGPT or Claude, you have used a neural network.

But what is a neural network, really? Most explanations either skip the math (so you cannot tell what is happening) or drown you in equations (so you cannot tell what is happening either). This post takes the middle path. Plain English, simple examples, no equations.

By the end you will know what a neuron is, how a network stacks them into layers, how it learns from data, and what an embedding is. That is enough to make sense of any AI article you read after this.

If you prefer to learn by clicking through, there is an interactive step-by-step demonstration of how a tiny network trains, covering the same ideas in this post.

A neuron is just a weighted sum

The "neuron" in a neural network has nothing to do with biology. It is a small math operation. Here is what it does.

It takes a few numbers as input. Each input has a weight. The neuron multiplies each input by its weight, adds them all up, and passes the result through a squashing function. The output is one number.

That is it. A neuron is a weighted sum followed by a squash.

Let me make this concrete. Suppose you are building a tiny spam detector. You have three signals about an email:

$x_1$ : number of "click here" phrases (let's say 1)
$x_2$ : number of exclamation marks (3)
$x_3$ : 1 if sender is unknown, else 0 (1, sender is unknown)

The neuron has three weights, one per input. Say the model has learned $w_1 = 1.5$ , $w_2 = 0.4$ , $w_3 = 2.0$ . These weights say: an unknown sender is the strongest spam signal, "click here" matters too, exclamation marks are a weaker hint.

The neuron computes a weighted sum: $1.5 \times 1 + 0.4 \times 3 + 2.0 \times 1 = 4.7$ .

Then comes the squash. A common one is the sigmoid: it takes any number and squashes it to a value between 0 and 1. After the squash, 4.7 becomes about 0.99. So the neuron says: 99% spam.

The squash matters. Without it, the neuron's output is just a raw weighted sum, which can be any number. With it, you get something interpretable: a probability, or a "how strongly do I believe this?" value. The squash function is also called the activation function, because it decides how strongly the neuron "activates" given its inputs.

The whole neuron is: multiply by weights, sum, squash. Simple arithmetic.

A network is layers of neurons

One neuron is not very smart. It can only learn simple patterns. Real problems need more.

The fix is to use many neurons, arranged in layers. Each layer takes the previous layer's output as its input.

A typical small network has three kinds of layers:

Input layer: the raw features going in. In our spam example, the three numbers.
Hidden layers: where the actual learning happens. Each neuron takes the previous layer's output, applies its own weights, sums, and squashes. There can be one hidden layer or hundreds.
Output layer: produces the final answer. For spam, one neuron with a sigmoid: a number between 0 and 1.

Every neuron in a layer is connected to every neuron in the next layer. Each connection has its own weight. So in a small network with 3 inputs, a hidden layer of 4 neurons, another hidden layer of 4 neurons, and 1 output, you have 3·4 + 4·4 + 4·1 = 32 weights. A real model has billions of these. GPT-3 had 175 billion. The idea is the same. Just bigger.

Why does stacking layers help? Each layer learns a slightly more complicated pattern. The first hidden layer might learn simple things like "many exclamation marks plus 'click here' is suspicious." The next layer combines those simple patterns into more complex ones, like "suspicious words and unknown sender means very likely spam." A deep network can learn very subtle patterns by stacking many simple ones.

When data flows from input through hidden layers to output, that is called the forward pass. It is just one big chain of weighted sums and squashes.

How does it learn?

The forward pass needs weights to do its math. But where do the weights come from?

When you build a network, the weights start as random numbers. The first prediction is garbage. The next prediction is also garbage. Then training happens.

Training has three steps, repeated many times:

Forward pass: feed an example through the network and get a prediction.
Compute the loss: compare the prediction to the correct answer. Get a single number that says how wrong this prediction was. This number is called the loss.
Update the weights: nudge every weight a tiny bit in the direction that would have reduced the loss.

Repeat this for millions of examples and the weights settle into values that make the network's predictions accurate. That is all there is to training.

The hard part is step 3. Let me explain.

Gradient descent in plain words

Imagine you are blindfolded on a hilly landscape. You want to reach the lowest point. What do you do? Feel which direction the ground slopes down, take a small step that way, and repeat.

That is gradient descent. The "landscape" is the loss as a function of the weights. The "lowest point" is where the loss is smallest, which means the network's predictions are most accurate. The "direction the ground slopes down" is what the math calls the gradient.

For each weight, the gradient says: "if I increase this weight a little, does the loss go up or down, and by how much?" You then take a small step in the direction that reduces the loss. Do this for every weight, every training example, again and again.

The "small step" size is called the learning rate. Too big a step and you overshoot, the loss bounces around, and the network never settles. Too small a step and learning takes forever. Picking a good learning rate is one of the practical tricks of training neural networks.

Backpropagation: how the gradient is computed

Here is one more piece. The loss is computed at the output. But the network has many layers. How do we know which weight in the first hidden layer should change, and by how much?

The answer is backpropagation. The name sounds technical but the idea is simple. The error at the output gets passed backward through the layers, one layer at a time. Each layer figures out how much each of its own weights contributed to the error, and adjusts them accordingly.

Think of it like a chain of blame. The output layer says: "I was off by this much, and you (the previous layer) gave me these values, so you are responsible for this share of the error." That layer then turns to the layer before it and does the same thing. The error flows backward, layer by layer, and each weight gets its update.

This is why "back-propagation" is the name. The forward pass goes input to output. The backward pass goes output to input, distributing the error.

The math behind this is calculus, specifically the chain rule. You do not need to know the math to understand the idea. The point is: the network can figure out, for every weight in every layer, exactly how to nudge it to reduce the loss. That is the whole training algorithm.

Forward pass, compute loss, backward pass, update weights. That cycle, run a few million times, is how every modern neural network learns.

Embeddings: turning anything into a vector

There is one more idea you need before reading anything about LLMs.

Neural networks only understand numbers. So if you want to train a network on text, you have a problem. "cat" is a word, not a number. How do you feed "cat" to a neural network?

The answer is to give every word a vector: a list of numbers. For example, "cat" might be [0.21, -0.34, 0.78, ...] with 300 or 1000 numbers in it. This vector is called the word's embedding.

The trick is that the embeddings are learned. The numbers start as random values. During training, they get updated along with the rest of the network. After training, similar words end up with similar vectors. "cat" and "kitten" sit close together. "cat" and "calculator" sit far apart.

You can almost think of the embedding space as a giant map of meaning. Words used in similar ways are near each other. Words used in different ways are far apart. The network learns this map purely from seeing many examples of how words appear in text.

This is also why neural networks can do anything with text. Once a word becomes a vector, it is just numbers, and the network can do its weighted sums and squashes on it. Same trick for images (each pixel becomes a number), audio (waveforms become numbers), and so on. Embeddings are the bridge between the real world and the math the network can do.

In modern LLMs, embeddings are not just for input words. Every layer's intermediate output is also a vector. The transformer mixes and remixes these vectors, layer after layer. All of that is built on the same idea: turn things into vectors, do weighted sums, and squash.

What to take from this

Three things worth remembering before you read any other AI article.

The first is that a neural network is a stack of very simple operations. Each neuron is a weighted sum followed by a squash. A network is many of these arranged in layers. There is no magic. Just a lot of arithmetic.

The second is that training is a feedback loop. Predict, measure how wrong you were, nudge the weights to be a little less wrong, repeat. The math behind the nudge (gradient descent and backpropagation) is the engine, but the loop itself is straightforward.

The third is that embeddings are how neural networks handle anything that is not already a number. A word, an image, a sound. Turn it into a vector, and the network can work with it.

Once you have these three ideas, every other AI concept (transformers, attention, fine-tuning, RAG, agents) is just a specific way of arranging them. The basics do not change.

If you want to see how these ideas built up to the modern LLM, the LLMs Explained series picks up the story from the 1950s and walks through to today.