GPU (Graphics Processing Unit): Definition & Meaning | AI Glossary

A GPU is a processor with thousands of small cores designed to do many simple operations in parallel. Originally built to render 3D graphics, GPUs turned out to be a near-perfect fit for the matrix multiplications at the heart of neural networks. Today every major AI lab runs on GPU clusters, and the supply of high-end chips — primarily from NVIDIA — is one of the binding constraints on the entire industry.

The reason GPUs matter is that training a transformer is essentially trillions of multiply-and-add operations on tensors. A CPU has a few dozen powerful cores optimised for sequential branchy code; a GPU has thousands of weak cores optimised for the same operation across many numbers at once. For the workload neural networks present, GPUs run roughly 10–100x faster per dollar than CPUs.

The names that show up in AI hardware conversations in 2026:

NVIDIA H100, H200, B200 — the data-centre GPUs that power most frontier training runs. A single H100 costs around $30k; clusters use tens of thousands.
NVIDIA RTX 5090, 4090 — high-end consumer GPUs popular for running open-weight LLMs and image models on a workstation.
Apple Silicon (M3/M4 Ultra) — unified-memory chips that can run 70B-parameter models locally, reshaping personal AI workflows.
Google TPUs — Google's custom AI accelerators, used internally and via Google Cloud.
AWS Trainium / Inferentia — Amazon's in-house alternatives.
AMD MI300, Intel Gaudi — credible challengers chipping away at NVIDIA's monopoly.

For a US developer, the practical question is rarely "which GPU?" — it is "which API?" Renting frontier compute from OpenAI, Anthropic, Google, AWS Bedrock or Together AI is dramatically cheaper than buying hardware. Self-hosting only makes sense at high volume, with strict data residency requirements, or when latency matters enough to keep a model warm in the same datacentre.

If you are building locally, three numbers tell the story: VRAM (how much model fits on the chip), memory bandwidth (how fast tokens generate) and TFLOPS (peak compute). For LLM inference, VRAM and bandwidth usually matter more than raw FLOPS — the bottleneck is moving weights, not multiplying them.

Related terms