LLMs Explained, Part 4: The Platform Era

In Part 3 we ended at the ChatGPT moment. November 30, 2022. A free chat interface on top of an instruction-tuned, RLHF-trained model. A hundred million users in two months. After that release, "AI" stopped being a research topic and became something every tech company needed an answer for.

This post is about what happened next. From early 2023 to 2026, the field stopped being about one chatbot and turned into a platform. New capabilities came in waves: multimodal input, tool use, long context, open weights, and reasoning. Each one changed what you could build with an LLM and how the products around them looked.

By the end you will have a working picture of where the field stands in 2026 and how it got there.

If "neural network", "transformer", or "RLHF" are new terms, start with Parts 1 to 3 of this series.

GPT-4 and the start of the platform race (2023)

OpenAI released GPT-4 on March 14, 2023, about four months after ChatGPT. The pitch was that the model was substantially better at reasoning, longer-form writing, and tasks that required following careful instructions. The technical report was deliberately thin on details, citing competitive concerns. The benchmarks (LSAT, bar exam, coding interviews) jumped clearly above GPT-3.5.

Two things mattered more than the benchmarks. The first was that GPT-4 accepted images as input, not just text. You could give it a screenshot of a chart and ask what was wrong with it. You could photograph a hand-drawn diagram and ask it to write the code. This was the start of "multimodal" as a normal feature, not a research demo.

The second was that GPT-4 became the bar everyone else had to clear. Anthropic released Claude, then Claude 2 mid-2023. Google merged its AI labs and released Gemini at the end of 2023. Meta open-sourced Llama. Within about a year, the closed-API frontier had three or four credible competitors instead of one.

The platform race had started. From this point on, every six months brought a new top model, a new context length, or a new modality.

Tool use: the model starts doing things

Until 2023, an LLM only generated text. You typed something. It typed back. That was it. It could not run a calculator, look up a stock price, send an email, or read a file.

In June 2023, OpenAI released function calling. The idea is simple. You describe a set of functions the model can call (with their inputs and outputs). When the user asks something that needs one of those functions, the model emits a structured request to call it. Your code executes the function and feeds the result back. The model continues, now with the function's output as part of its context.

This was the unlock. The model could now look things up, do calculations, read files, send emails. The model itself did not change. The products around it grew. ChatGPT got plugins (later renamed GPTs). Every API provider added function calling. The pattern became standard within months.

A year and a half later, function calling turned into the foundation for agents: programs where an LLM decides which functions to call, calls them, sees the results, and decides what to do next. We will get to those.

Long context: how big is the model's whiteboard?

The first big public LLMs had 4K-token context windows. Around 8 pages of text. Useful for chat but cramped for real documents.

In mid-2023, the numbers started moving. Anthropic launched Claude with 100K tokens in July 2023. OpenAI followed in November with GPT-4 Turbo at 128K. Google's Gemini 1.5 announced 1M tokens in February 2024. Today, several frontier models handle 1M, and Meta's Llama 4 Scout pushes that to 10M.

What changed at each step was what you could naturally pass to the model: a long contract, an entire codebase, hundreds of customer support tickets, a book. Long context did not require new architecture. It required engineering work to make attention efficient enough at long sequence lengths.

There is a catch. Context length is not the same as context quality. Models perform measurably worse on information buried in the middle of a long context window than on information at the start or end. The numbers grew faster than the model's ability to actually use them. The field has terms for this ("lost in the middle", "needle in a haystack"). Pragmatically, anything beyond the model's effective context is decoration, not capability.

The open-weights wave

Until early 2023, every frontier model was closed. You called an API. You did not see the weights. Then Meta released Llama in February 2023, originally to researchers under a restricted license. The weights leaked within a week. Llama 2 (July 2023) was released openly. Llama 3 (April 2024) too, and Llama 4 followed in 2025.

Mistral, a French startup, released Mistral 7B in September 2023. It punched above its size. Their Mixtral 8x7B in December 2023 was the first widely-used open mixture-of-experts model.

Then in late 2024, the Chinese labs broke through. DeepSeek released V3 in December 2024 and R1 in January 2025, a reasoning model competitive with OpenAI's o1 at a fraction of the API cost. Alibaba's Qwen 2.5 (September 2024) and the Qwen 3 family that followed became the standard small-and-medium open models. By 2025, the open-weights frontier was within months (sometimes weeks) of the closed-API frontier on most evaluations.

Why does this matter for an engineer? Two reasons. First, you can actually run these models. A 7B or 14B model fits on a high-end laptop. A 70B model fits on a workstation with a few GPUs. For applications where data cannot leave a private network (legal, medical, defense, anything regulated), open-weights is the only option. Second, the price floor on inference dropped sharply. A frontier-quality answer costs much less from an open model than it did when only the closed APIs existed. If you are building products on top of LLMs, this directly affects your unit economics.

Reasoning models: trading thinking time for quality

Through most of 2024, the recipe was still "scale up training". Bigger models, more data, more RLHF.

Then in September 2024, OpenAI released o1-preview. The pitch was that the model was trained, in part, to think for a long time before answering. Internally it generates a long chain of reasoning, possibly trying multiple approaches, before producing the final answer. On hard problems (Olympiad math, scientific reasoning, complex coding) it scored dramatically better than GPT-4o.

The shift was that capability now scales with inference compute, not just training compute. Want a better answer? Let the model think longer. The model becomes a knob you can turn at runtime, not only at training time.

DeepSeek's R1 paper in January 2025 described how to train a reasoning model openly using reinforcement learning on chains of reasoning. It was the recipe paper for everyone outside OpenAI. Within a few months, every major lab had a reasoning model: Anthropic's Claude with extended thinking, Google's Gemini reasoning variants, open Llama and Qwen reasoning checkpoints.

The cost is that reasoning answers consume far more tokens. A reasoning model can spend 30 seconds and 50,000 tokens producing one answer. That is fine for an offline analysis or a hard coding task. It is bad for a real-time chatbot. The practical consequence is that products now route requests: easy questions go to a cheap, fast model; hard questions go to a reasoning model.

Where we are in 2026

A few things define the current state.

Coding is the biggest application. Cursor, Claude Code, and similar IDE-native coding tools have moved a lot of professional software work into a workflow where the LLM writes most of the boilerplate and the human reviews and steers. This is the largest, most measurable productivity shift LLMs have caused so far.

Agents are real but flaky. A reasoning model with tool use can be wrapped into a program that executes multi-step tasks: schedule meetings, run a research workflow, fix a bug across files, complete a software ticket. They work for narrow, well-defined tasks. They still fail in surprising ways at the boundary of their scope. Most agent products you see in 2026 are heavily scaffolded: a small set of allowed tools, careful prompting, human review at key steps.

Multimodal is normal. Image input is standard. Audio in and out is standard. Video in is becoming standard. Native multimodal training (one model that sees and hears and talks) has replaced the older approach of bolting separate models together.

The bottlenecks moved. A few years ago the bottleneck was raw model capability. Today the bottleneck is one of: cost (reasoning models are expensive), latency (good answers are slow), context quality (1M tokens does not mean 1M useful tokens), evaluation (it is hard to know whether a new model is actually better for your specific task), or hallucination at scale (a 95% accurate model in a workflow with 20 steps is unreliable).

Cost is dropping fast. A given level of quality is several times cheaper today than a year ago. Open weights and competition are the main drivers. Use cases that were uneconomical at GPT-4 prices in 2023 are reasonable now.

The field has settled into something that looks more like a normal infrastructure layer than a frontier of research breakthroughs. There will still be jumps. But for an engineer building products, the day-to-day work is mostly about routing the right model to the right task, evaluating outputs, and integrating LLMs into products without falling into the trap of "let the model do everything."

What to take from Part 4

Three things worth remembering.

The first is that the platform era was about adding capabilities around the same core model. Multimodal, tool use, long context, reasoning. The transformer architecture from 2017 still sits underneath all of it.

The second is that open weights changed the economics. Frontier-quality models that you can run yourself, at a fraction of the API cost, exist. If you are building products on LLMs, this is the single most important practical shift of the last few years.

The third is that capability is no longer one number. A 2026 LLM is a knob you can turn: cheap and fast, smart and slow, multimodal, reasoning, with or without tools, on closed APIs or on your own hardware. The interesting engineering question is no longer "how good is the model" but "which model do I send this request to."

Closing the series

Across four posts we walked from a 1950 paper on whether machines can think to the LLM-powered products you use every day. The architecture changed (perceptron, RNN, attention, transformer). The training changed (supervised, self-supervised, instruction tuning, RLHF, RL on reasoning chains). The scale grew by many orders of magnitude. But the underlying idea is simpler than any individual product makes it look: a stack of weighted sums and squashes, fed enough text, that learns the patterns of how text continues.

The interesting work from here is not about understanding LLMs. It is about building with them. Picking the right tool. Routing the right tasks. Evaluating outputs honestly. Falling for hype slowly.

Thanks for reading.