Cloud Services··21 min read

GPUs Explained, Part 3: The Hardware Behind Modern AI

What an H100, H200, B200, and a GB200 NVL72 rack actually are. Plus HBM, NVLink, training vs inference, and the alternatives like TPUs and AMD MI300X.

Ramachandran Bakthavachalam (Ram)Ramachandran Bakthavachalam (Ram)
GPUs Explained, Part 3: The Hardware Behind Modern AI

Part 1 explained why AI runs on GPUs. Part 2 opened up a single GPU to show the SMs, threads, memory, and tensor cores. This post is about what happens when one GPU is not enough.

Spoiler: one GPU is almost never enough. A frontier model in 2026 is too big to fit on one chip and would take years to train on one chip even if it did fit. So the chips are connected. Eight GPUs in a server. Several servers in a rack. Many racks in a data center. Many data centers in a region. Each layer of this stack has its own engineering, its own bottlenecks, and its own jargon.

This post covers the current generation of NVIDIA GPUs (H100, H200, B200, B300), the high-bandwidth memory that everyone is fighting over (HBM), the interconnects that turn many GPUs into one logical machine (NVLink, NVSwitch, NVL72), the difference between training and inference, and the alternatives to NVIDIA: Google TPUs, AMD MI300X, Amazon Trainium, and the more exotic chips like Cerebras and Groq.

The goal is the same as the previous parts. By the end you should be able to read announcements, follow infrastructure conversations, and understand the trade-offs in the space without getting lost in marketing.

The 2026 GPU lineup, in plain terms

NVIDIA refreshes its GPU architecture every two years or so. Each new architecture has a code name. Each architecture has multiple chips at different price points. The names look like alphabet soup until you know the pattern.

The relevant ones for AI in 2026 are:

  • A100 (Ampere, 2020). The chip that trained GPT-3 and most LLMs through 2022. 40 or 80 GB of HBM2e, 312 TFLOPS BF16. Mostly retired from frontier training but still in service for inference.
  • H100 (Hopper, 2022). The dominant AI chip from 2023 to 2024. 80 GB HBM3, 3 TB/s memory bandwidth, around 1 PFLOPS BF16, 2 PFLOPS FP8 with the new tensor cores. Most public LLMs released in 2023 and 2024 trained on H100s.
  • H200 (Hopper refresh, 2024). Same compute as H100 but with 141 GB of faster HBM3e and 4.8 TB/s memory bandwidth. Released to address the memory wall, not the compute wall. Especially good for inference of large models, where memory size and bandwidth matter most.
  • B200 (Blackwell, 2024-2025). The current flagship. Two dies on one package, 192 GB HBM3e, 8 TB/s memory bandwidth, around 2.25 PFLOPS dense BF16, 4.5 PFLOPS dense FP8, and 9 PFLOPS dense FP4 with second-generation tensor cores. Roughly 2.5x the training throughput of an H100 and an even larger gap on inference of big models thanks to the bigger memory and bandwidth.
  • B300 (Blackwell Ultra, 2025). A mid-cycle refresh of Blackwell with more HBM (288 GB), higher bandwidth, and more FP4 throughput. Aimed at the "reasoning model" workload where inference compute has become very heavy.

The pattern here is worth noticing. Across these chips, raw compute (counting the new lower-precision modes) has gone up by close to an order of magnitude in four years. Memory size has gone up about 2.4x (80 GB to 192 GB). Memory bandwidth has gone up about 4x (2 TB/s to 8 TB/s). Compute is racing ahead of memory, just like Part 2 said it would.

A comparison of recent NVIDIA AI GPUs: A100, H100, H200, B200, and B300 across compute (FP8/FP16), memory size, memory bandwidth, and power. Each generation pushes more compute and memory, but memory and bandwidth grow more slowly than compute.

The successor to Blackwell, code-named Rubin, is on NVIDIA's roadmap for 2026 and 2027. It is expected to bring HBM4, NVLink 6, and another big jump in compute. By the time you read this, some details may already be public.

HBM: the memory that runs AI

You have probably noticed that I keep mentioning HBM. High-Bandwidth Memory is special enough to deserve its own section, because in 2026 the supply of HBM is a bigger constraint on AI than the supply of GPUs.

A normal computer has DRAM chips on a stick (DIMM) connected to the CPU through a few channels. Bandwidth is around 50 to 100 GB/s. This is fine for general computing.

For AI, this is way too slow. A modern GPU needs to read tens of gigabytes per second per SM, and an H100 has 132 SMs. Multiplying out, you need terabytes per second of memory bandwidth, total.

The way to get there is HBM. Instead of putting DRAM on a separate stick connected by a long wire, you stack many DRAM dies vertically on top of a base die, and put that stack right next to the GPU on the same package, connected by a very wide bus.

An HBM stack: multiple DRAM dies are stacked vertically on a base die using through-silicon vias, and the whole stack sits next to the GPU die on a silicon interposer with thousands of fine connections.

A single HBM stack might have 8 to 12 DRAM dies, connected by through-silicon vias (TSVs) that pass signals straight through the silicon. The stack sits on a silicon interposer that connects it to the GPU using thousands of microscopic wires. The bus is more than 1000 bits wide per stack, compared to 64 bits for a normal DDR memory channel.

The H100 has 5 active HBM3 stacks, the H200 has 6 HBM3e stacks, the B200 has 8 HBM3e stacks across two dies. Each stack contributes around 1 TB/s of bandwidth. Adding more stacks is how you grow memory bandwidth.

The reason HBM is constrained is that very few companies can make it. As of 2026, the main suppliers are SK hynix, Samsung, and Micron. Each new generation (HBM2 -> HBM2e -> HBM3 -> HBM3e -> HBM4) takes years to ramp up. NVIDIA, AMD, and Google all want as much as possible. There are not enough HBM stacks in the world to satisfy demand.

This is part of why GPUs are expensive and hard to get. The compute die itself is not that exotic by chip standards. The HBM stacks, the interposer, and the packaging that puts everything together (CoWoS, made mostly by TSMC) are the bottleneck. AI capex in 2026 is, to a first approximation, capex on HBM and on the packaging that connects HBM to compute.

A single B200 has 192 GB of HBM. A serious LLM in 2026 has hundreds of billions or even trillions of parameters. At 2 bytes each (BF16), a trillion-parameter model is 2 TB of weights, before you count activations and gradients. That does not fit on one GPU.

So you have to split the model across GPUs and have them work together. To do this efficiently, the GPUs need to talk to each other very fast. PCIe, the bus that normally connects a CPU to a GPU, is far too slow for this. PCIe 5.0 is around 64 GB/s per direction. The math inside a GPU happens at terabytes per second. PCIe is a hundred times too slow.

NVIDIA's answer is NVLink, a custom GPU-to-GPU link that runs much faster than PCIe. The latest version (NVLink 5, on Blackwell) gives 1.8 TB/s of total bandwidth per GPU. That is 28 times faster than PCIe 5.0 per GPU, and within an order of magnitude of HBM bandwidth.

But NVLink is point-to-point. To connect 8 GPUs in a server, you do not want each pair to have its own dedicated link. You want a switch fabric, like a mini network. NVIDIA built NVSwitch, a chip that acts like a router for NVLink. With NVSwitches in the middle, every GPU can talk to every other GPU at full NVLink bandwidth.

A standard 8-GPU server (NVIDIA calls it a DGX) has 8 GPUs and 4 NVSwitch chips. The 8 GPUs can do all-to-all communication at the speed of NVLink. This is what makes it possible to treat 8 GPUs almost like one big GPU with 8 times the memory and 8 times the compute, for many workloads.

This was the standard for years. NVIDIA's 2024-2026 generation pushed this further with the NVL72.

An NVL72 rack viewed as a topology: 72 Blackwell GPUs and 36 Grace CPUs sit in a single rack, connected through NVSwitch trays so every GPU can talk to every other GPU at full NVLink bandwidth, the whole thing acting like one big accelerator.

The NVL72 is a single rack with 72 Blackwell GPUs and 36 Grace CPUs (NVIDIA's ARM-based CPU), all connected by NVLink through a tray of NVSwitches at the back of the rack. From a software point of view, the whole rack acts like one big accelerator with around 13.5 TB of HBM and roughly 360 PFLOPS of dense FP8 compute (and over 1 EFLOPS in sparse FP4). The communication between any two GPUs in the rack happens over NVLink at 1.8 TB/s, not over a slower network.

The NVL72 draws around 120 kilowatts of power and has to be liquid-cooled because air cooling cannot move that much heat in such a small space. A single NVL72 costs north of three million dollars in 2026.

For multi-rack training, racks are connected by either InfiniBand (NVIDIA's high-speed network, the same kind used in supercomputers) or Spectrum-X Ethernet (NVIDIA's AI-optimized Ethernet). These run at 400 to 800 Gbps per port and are slower than NVLink but fast enough to coordinate gradients between racks during training.

A modern frontier training cluster in 2026 has hundreds of NVL72 racks, connected by InfiniBand into a single training run. Total: 50,000 to 100,000 GPUs, several hundred MW of power, and an interconnect fabric that costs almost as much as the GPUs themselves.

Distributed training: how you actually use 100,000 GPUs

Connecting GPUs is one thing. Splitting a training workload across them efficiently is another. There are three classic ways to parallelize a model across many GPUs, and modern training combines all three.

Data parallelism. The simplest. Every GPU has a full copy of the model. Each GPU processes a different chunk of the input data. After each step, the gradients are averaged across all GPUs (an "all-reduce" operation), and every GPU updates its weights. This is how the first multi-GPU training worked, and it still works for small enough models.

The catch is that data parallelism does not help when the model itself is too big to fit on one GPU. If your model is 2 TB of weights and one GPU has 192 GB of memory, you cannot do data parallelism. Every GPU would have to hold the full model.

Tensor parallelism. A single layer (a single matrix multiplication) is split across GPUs. If a layer has a 10000×1000010000 \times 10000 weight matrix, you might split it into 8 columns of 10000×125010000 \times 1250 each, and each GPU holds and computes one column. The inputs are sent to all GPUs, each computes its slice of the output, and the results are gathered. This needs a lot of communication on every layer, which is why it works best inside one server (over NVLink) and badly across servers (over InfiniBand).

Pipeline parallelism. Different layers go to different GPUs. GPU 1 holds layers 1 to 10, GPU 2 holds layers 11 to 20, and so on. The activations flow forward through the pipeline, and the gradients flow backward. This is much more communication-friendly than tensor parallelism, but it has its own challenges: you need many "micro-batches" in flight at the same time to keep all GPUs busy, otherwise most are idle.

Combining them (3D parallelism). Frontier training in 2026 typically uses all three. Within an NVL72 rack, tensor parallelism splits each layer across the 72 GPUs, since they have NVLink bandwidth. Pipeline parallelism splits the layers across racks, sending activations between racks over InfiniBand. Data parallelism replicates the whole pipeline across many copies, each copy seeing different data, with a global all-reduce of gradients at the end of each step.

There is also expert parallelism for mixture-of-experts (MoE) models, and sequence parallelism for very long contexts. The details get hairy, but the basic idea is the same: split the work along several axes at once, and try to keep the communication on the fastest links you have.

In practice, you do not write all this from scratch. Frameworks like PyTorch with FSDP, DeepSpeed, Megatron-LM, and JAX with pjit handle it for you. But knowing which axis is which is essential when something is slow and you have to figure out why.

Training vs inference: different shapes of bottleneck

You may have noticed I keep saying "training" and "inference" as if they are different things. They are. The hardware looks the same but the workload is very different, and that changes which bottleneck matters.

Training runs the full forward and backward pass on a batch of data, computes the gradients, and updates the weights. The activations from every layer have to be saved during the forward pass, because they are needed for the backward pass. Training is mostly compute-bound: lots of matrix multiplications, big batches, gradients flowing in both directions. The big costs are FLOPs and the bandwidth of the all-reduce between GPUs.

Inference has two phases.

The first is prefill: the model reads the prompt and processes all input tokens at once. Prefill is compute-bound, like a forward pass, because you can do all the tokens in parallel as a big matrix multiply.

The second is decode: the model generates output tokens one at a time. Each decode step does the same computation as prefill but for only one token. This is memory-bound, not compute-bound. To generate one token, you have to load the entire model weights from HBM. If your model is 200 GB and your HBM bandwidth is 8 TB/s, you can do at most 40 tokens per second per GPU per request.

This is why inference performance is dominated by memory bandwidth, not raw compute. A B200 with 8 TB/s of HBM bandwidth and 192 GB of memory will generate roughly twice as many decode tokens per second as an H100 with 3 TB/s and 80 GB. The compute number (FLOPS) barely matters in decode.

Inference engineers spend a lot of time on tricks that get more useful work out of the same memory traffic. Batching runs many requests together so the model weights are loaded once for a batch instead of once per request. KV-cache stores the attention keys and values from previous decode steps, so you do not redo work, but it eats memory. Speculative decoding uses a smaller fast model to guess several tokens ahead, then verifies them with the big model in a single forward pass, getting more tokens per memory load. Continuous batching in modern serving frameworks like vLLM lets new requests join an existing batch instead of waiting.

The rough rule for 2026 is: training is compute-bound and benefits from chips with the most FLOPS, while inference is memory-bound and benefits from chips with the most HBM and the highest HBM bandwidth. This is why the H200 and B200 generations focused so much on memory: inference is now where most of the dollars are spent, and where the bottleneck has shifted.

The alternatives to NVIDIA

NVIDIA is dominant but not alone. A few real alternatives exist in 2026, and more are being built. The space is more interesting than it looks if you only follow the NVIDIA news.

Google TPUs. Tensor Processing Units, Google's custom AI chip. The current generation is TPU v6e and v7 (code-named Trillium and Ironwood). TPUs are very good at matrix math (their main building block is a big systolic array, which is even more specialized than tensor cores) and have their own interconnect (ICI) that scales to TPU "pods" with thousands of chips. TPUs are not for sale: you rent them on Google Cloud. Google trains its own models (Gemini) on TPUs and offers them to customers as an alternative to NVIDIA. The software stack is JAX-first, with PyTorch support through XLA.

AMD Instinct. AMD's competing line, with the MI300X (released 2023), MI325X (2024), and MI350 (2025). The MI300X has 192 GB of HBM3, more than an H100, and is competitive on inference of large models. AMD's software stack (ROCm) has been the main weakness historically. By 2026 it has improved, and serious workloads run on AMD, especially for inference where the software requirements are simpler. AMD is the closest direct competitor to NVIDIA on hardware, and its main growth in 2025 and 2026 has been in inference for hyperscalers.

Amazon Trainium. Amazon's custom AI chips, with Trainium2 widely deployed in 2025-2026. Trainium is offered only on AWS, used to train Amazon-internal models and rented to customers as a cheaper alternative to NVIDIA on EC2. Performance per dollar is competitive, but the software stack and ecosystem are smaller than NVIDIA's.

Meta MTIA, Microsoft Maia, Apple's silicon. Most hyperscalers are building custom AI chips. Meta has MTIA for inference. Microsoft has Maia. Apple has the Neural Engine in its M-series chips, which is excellent for on-device inference. None of these is sold to outsiders, but they reduce the hyperscalers' dependence on NVIDIA over time.

Cerebras. A different bet entirely. Cerebras builds wafer-scale chips: instead of cutting a silicon wafer into many small dies, the WSE-3 is one giant chip the size of a dinner plate, with 4 trillion transistors and 900,000 cores. It avoids the inter-GPU communication problem by putting everything on one chip. Cerebras is best known for inference of medium-sized models at very high speed, often used as a "fast inference API" for customer-facing chatbots that need low latency.

Groq. Builds an LPU (Language Processing Unit) that is also designed for fast inference. Different architectural approach than GPUs, with deterministic timing and very high decode throughput. Used by Groq's own inference cloud and a few partners. Like Cerebras, the niche is fast inference rather than training.

SambaNova, Tenstorrent, Graphcore (acquired by SoftBank in 2024), and several Chinese companies (Huawei Ascend, Cambricon, Biren). A long tail of alternatives, mostly competing on specific workloads or specific markets.

The pattern across all alternatives is that they specialize. They give up some flexibility for more speed on a narrow set of workloads. NVIDIA's strength is generality: a CUDA program that runs on an A100 still runs on a B200, with the same library calls, the same frameworks, the same deployment tools. That portability is hard to match, and is what keeps NVIDIA dominant even when other chips are cheaper or faster on a specific workload.

What this means for AI engineers

Here is the practical takeaway if you are a software engineer moving toward AI work.

You do not need to know how to design a GPU. You do need to know what one can and cannot do, so you can reason about cost, latency, and bottlenecks.

For training:

  • Frontier training is in the 10,000+ GPU range, with custom infrastructure. Most companies do not do it, because the cost is hundreds of millions of dollars per run.
  • Mid-tier training (fine-tuning a 70B parameter model, say) needs 8 to 64 GPUs and runs in days. This is the range where most production-relevant work happens.
  • Small training (fine-tuning a 7B model with LoRA, say) often fits on a single 8-GPU server or even one GPU. This is the range most teams will touch.

For inference:

  • Memory and bandwidth dominate. The size of your model and the HBM of your GPU set the floor of what is possible.
  • Batching, KV-cache, and the right serving framework matter more than the raw chip choice.
  • For very low latency or very high throughput, alternatives like Cerebras and Groq might beat NVIDIA on cost per token. For everything else, NVIDIA on a managed cloud is the default.

For cost:

  • An H100 costs around $25,000 to $35,000 to buy and rents for $2 to $5 per hour on cloud, depending on the provider, contract length, and supply.
  • A B200 costs around $35,000 to $45,000 to buy and rents for $4 to $8 per hour.
  • A full NVL72 rack rents for around $200 to $300 per hour.
  • A frontier-scale cluster (10,000+ GPUs) is built and operated, not rented by the hour. Hyperscalers and a few well-funded startups own these.

For careers:

  • Knowing how to use GPUs well (writing fast kernels, profiling, optimizing serving) is one of the highest-paying specializations in software in 2026. The supply of engineers who actually understand the hardware is small.
  • You do not have to specialize in low-level GPU work. Understanding the trade-offs at a high level is enough to be effective in most AI engineering roles.

What is on the horizon

A few things to watch over the next two years:

HBM4 will start shipping in 2026 and 2027, with significantly more bandwidth and capacity per stack. The memory wall will not close, but it will be pushed back a bit.

Optical interconnects. Today's GPU-to-GPU communication uses copper. The longer the distance, the harder copper gets. Optical interconnects are starting to appear inside servers and across racks. NVIDIA, Broadcom, and several startups are pushing this. By 2027 or 2028, racks may communicate optically, lowering power and increasing bandwidth.

Power and cooling. Today's biggest AI clusters use 100 to 500 MW. The next generation will use gigawatts. This is bumping into the limits of regional power grids in places like Northern Virginia and Phoenix. Future training runs will be sited based on power availability as much as anything else, and may be split across multiple regions, with the model state synchronized over long-distance networks. Liquid cooling, which used to be exotic, is now standard.

Disaggregated inference. Instead of running prefill and decode on the same GPU, future systems will route prefill (compute-bound) to one pool of GPUs and decode (memory-bound) to another pool with more memory bandwidth. The hardware mix in a serving cluster may become more diverse.

Custom silicon at hyperscale. By 2027, more than half of inference at the big clouds may run on chips that are not from NVIDIA. Training will likely still be NVIDIA-dominated for at least a few more years, but even there, alternatives are growing.

The space moves fast. The fundamentals do not. If you have followed this series, you have the mental model to read any new announcement and understand the actual significance of the change. Bandwidth, capacity, FLOPS, interconnect. That is the ledger.

Wrapping up the series

Three things to carry from these three posts:

The first is that AI runs on GPUs because the math (mostly matrix multiplication) is parallel by construction, and GPUs are chips designed to do parallel math very fast. CPUs cannot match this without becoming GPUs, and that is the deep reason CPUs lost AI.

The second is that a GPU is a structured machine, not a black box. SMs run thread blocks. Threads are grouped into warps that execute in lockstep. Tensor cores do most of the AI math. Memory bandwidth is usually the bottleneck before raw compute.

The third is that one GPU is rarely enough. Modern AI is built on clusters: NVLink to connect GPUs in a server, NVSwitch to connect servers in a rack, InfiniBand to connect racks in a cluster, all the way up to data centers drawing hundreds of megawatts. Software has to split work across all these layers, and the people who can do this well are highly valued.

If you are moving from general software engineering into AI, the GPU is one of the things you should understand at this depth. It will come up in cost discussions, in latency debates, in roadmap arguments, in performance bug hunts. Once you have the mental model, you can hold your own in any of these conversations.

There is a lot more to learn, of course. Picking up CUDA, learning to profile GPU workloads with tools like Nsight, understanding the specifics of FlashAttention or PagedAttention, getting fluent with PyTorch's distributed APIs. Each of these is a project of its own. But you now have the foundation. The rest is just specifics.

If you want to keep going with the AI series on this blog, the LLMs Explained series covers the algorithm side, from the 1950s through the modern transformer. Together with this series, you will have a working picture of both the math and the machines that make modern AI work.

That is the end of GPUs Explained. See you in the next post.