Edge ComputingAI HardwareGPUsTPUs

AI Hardware Deep Dive: GPUs, TPUs, and the Custom Silicon Race

From NVIDIA Blackwell's 4.8 TB/s HBM3e to Google Trillium, AMD MI350, and wafer-scale Cerebras CS-3, the AI accelerator landscape is fragmenting around memory bandwidth, networking, and software lock-in.

Sarah Chen

April 4, 2026

14 min read

AI Hardware Deep Dive: GPUs, TPUs, and the Custom Silicon Race

Every frontier model trained in the last 18 months has been a story about silicon as much as software. When OpenAI trained GPT-4 in 2022, the bottleneck was raw FLOPS. By 2025, Anthropic's Claude family, Google's Gemini Ultra 2, and Meta's Llama 4 had all run into the same wall: memory bandwidth, interconnect topology, and the cost of keeping tens of thousands of accelerators in sync for months at a time. The shape of the next generation of AI is being drawn inside fabs at TSMC Arizona, Samsung Pyeongtaek, and Intel Ohio, and the design decisions made there determine which labs can afford to train at the trillion-parameter scale and which cannot.

NVIDIA Hopper and the Blackwell Generation

The H100, launched in March 2022, defined modern AI compute. Built on TSMC's 4N process with 80 billion transistors, it packed 132 streaming multiprocessors, fourth-generation tensor cores with FP8 support, and 80 GB of HBM3 delivering 3.35 TB/s of bandwidth. The Transformer Engine, a software-hardware co-design, dynamically chose between FP8 and FP16 per layer, roughly doubling training throughput on GPT-style workloads compared to the A100.

Blackwell, announced at GTC 2024 and shipping in volume through 2025, is two reticle-limit dies stitched together with a 10 TB/s NV-HBI interface, presented to software as a single GPU. The B200 carries 208 billion transistors, 192 GB of HBM3e at 8 TB/s of aggregate bandwidth (the headline 4.8 TB/s figure refers to a single stack at 1.2 TB/s across four stacks), and second-generation Transformer Engine support for FP4 inference. NVIDIA's GB200 NVL72 rack ties 36 Grace CPUs and 72 Blackwell GPUs together with a fifth-generation NVLink Switch fabric pushing 1.8 TB/s per GPU, an order of magnitude beyond PCIe Gen5.

The architectural takeaway is that NVIDIA is no longer selling chips; it is selling pre-integrated systems where the chassis, copper cabling, optical transceivers, liquid cooling loop, and CUDA stack are co-designed. SemiAnalysis estimates the per-GPU bill of materials for a GB200 system at roughly 60 percent silicon and 40 percent networking, packaging, and cooling, a ratio inverted from the V100 era.

Dense rack of GPUs with copper and optical interconnects — A modern AI training pod: the accelerators are visible, but most of the cost lives in the cables, switches, and cold plates around them.

Google TPU: The Systolic Array Bet, Three Times Validated

While NVIDIA generalized the GPU into an AI accelerator, Google built a chip specifically for matrix multiplication from day one. The TPU's systolic array, a 2D grid of multiply-accumulate units where data rhythmically pulses through the fabric, eliminates the register file traffic that consumes a meaningful fraction of GPU energy. TPU v4, deployed in production from 2022, organized 4,096 chips into a 3D torus with optical circuit switches that let Google dynamically reconfigure the topology per job.

TPU v5p, the configuration used to train Gemini 1.0 Ultra, scaled this to 8,960 chips in a single pod with 459 teraFLOPS of BF16 per chip and 95 GB of HBM at 2.76 TB/s. Trillium (v6), announced in May 2024 and generally available in late 2025, delivered a claimed 4.7x peak compute improvement per chip over v5e, 32 GB of HBM at 1.6 TB/s, and a third-generation SparseCore for embedding-heavy workloads. Google's argument, which Norm Jouppi laid out in a series of ISCA papers, is that the systolic array's deterministic dataflow makes it cheaper to keep utilized: there are no warp schedulers, no L1 cache misses, no branch divergence.

The Challenger Set: AMD, AWS, and the Hyperscaler ASICs

AMD's MI300X, launched December 2023, was the first credible alternative to NVIDIA at the inference frontier. Its 192 GB of HBM3 at 5.3 TB/s outpaced the H100 on memory-bound workloads, and Microsoft, Meta, and Oracle all placed multi-billion-dollar orders through 2024. The MI350 series, announced at AMD Advancing AI 2024 and shipping mid-2025, moves to a chiplet-heavy CDNA 4 architecture with 288 GB of HBM3e and native FP4 support, targeting the same FP4 inference workloads Blackwell is optimized for.

AWS Trainium2, announced at re:Invent 2024, is the first ASIC from a hyperscaler that publicly competes head-to-head with NVIDIA for large-model training. Anthropic's Project Rainier, disclosed in late 2024, is a Trainium2 cluster of hundreds of thousands of chips intended for next-generation Claude training. Inferentia2 handles serving. Meta's MTIA v2 (2024) targets recommendation and ranking workloads, where the FLOPS-to-bandwidth ratio looks nothing like an LLM. Microsoft's Maia 100, announced November 2023, is Azure's bet on vertical integration; OpenAI is reportedly working with Broadcom and TSMC on a custom inference ASIC targeting 2026 production.

The Specialists: Groq, Cerebras, and Deterministic Dataflow

Groq's LPU strips out almost everything a GPU has, including HBM. Each chip carries 230 MB of on-die SRAM at 80 TB/s and runs a statically compiled schedule with no caches, no branch predictors, and no out-of-order execution. The result: Llama 3 70B inference at 300+ tokens per second per user, with first-token latency in the low tens of milliseconds. The tradeoff is that the model has to fit, sharded across hundreds of chips, in SRAM, and the cluster has to be sized to the workload.

Cerebras CS-3, announced March 2024, takes the opposite approach: build the entire accelerator on a single 46,225 square millimeter wafer with 900,000 cores and 44 GB of on-chip SRAM at 21 PB/s. A single CS-3 reaches 125 petaFLOPS of FP16. The Condor Galaxy supercomputer, built with G42 in Abu Dhabi, chains 64 of these into a 256 exaFLOPS system. The pitch is that you sidestep the off-chip memory and networking penalties that define GPU clusters; the catch is that you live entirely inside the Cerebras software stack.

Memory Bandwidth Is the New FLOPS

Three years ago, NVIDIA marketing led with peak FLOPS. Today, the slide that matters is HBM bandwidth. The reason is mechanical. A 175-billion-parameter model in BF16 is 350 GB; to generate a single output token, the autoregressive loop reads roughly that volume from memory once. At 4.8 TB/s per stack on B200 (or 8 TB/s aggregate across four stacks), the theoretical token rate is bounded by bandwidth, not compute. This is why FP4 and FP8 matter so much: shrinking weights from 16 bits to 4 bits quadruples the effective bandwidth and roughly quadruples tokens per second.

H100 SXM5: 80 GB HBM3, 3.35 TB/s, 989 TFLOPS BF16 dense, 700 W TDP
H200: 141 GB HBM3e, 4.8 TB/s, same compute as H100, refreshed for memory-bound inference
B200: 192 GB HBM3e, ~8 TB/s aggregate, 2.25 PFLOPS FP8 dense, 1,000 W TDP
TPU v5p: 95 GB HBM, 2.76 TB/s, 459 TFLOPS BF16, optical-switched 3D torus
Trillium (v6): 32 GB HBM, 1.6 TB/s, ~4.7x v5e peak compute, third-gen SparseCore
AMD MI300X: 192 GB HBM3, 5.3 TB/s, 1.3 PFLOPS BF16, 750 W TDP
Groq LPU: 230 MB SRAM, 80 TB/s on-die, no HBM, deterministic compile
Cerebras CS-3: 44 GB SRAM, 21 PB/s, 900K cores, 125 PFLOPS FP16 per wafer

Networking: The Hidden Half of the Stack

A 100,000-GPU training cluster is, electrically, a distributed shared-memory machine the size of a warehouse. NVLink Switch handles intra-rack traffic at 1.8 TB/s per GPU. Between racks, the fight is between InfiniBand NDR/XDR at 800 Gbps per port and Ethernet at the same rate carrying RoCE v2. Meta's 24,576-GPU cluster paper from March 2024 disclosed two parallel designs, one with RoCE and one with InfiniBand, and reported comparable training throughput. For inter-building links, optical interconnects with co-packaged optics are moving from research into production; NVIDIA, Broadcom, and Marvell all demoed CPO switches in 2024.

The next gigawatt training run will be limited by how cheaply we can move bits between buildings, not by how many FLOPS we can pack into a single rack.

The CUDA Moat and the Software Counterattack

NVIDIA's hardest-to-replicate asset is not the GPU; it is CUDA, the 18-year-old C++ extension and the layered libraries above it: cuBLAS, cuDNN, NCCL, CUTLASS, TensorRT-LLM, and the kernels written against them. Every alternative accelerator has to answer the same question: how does a PyTorch model written for CUDA run on this chip with acceptable performance? AMD's ROCm has closed a meaningful share of the gap on inference; Google relies on XLA and JAX; AWS Neuron compiles to Trainium and Inferentia; Groq compiles statically. OpenAI's Triton, originally a research project, has emerged as a portable kernel-authoring language that targets multiple backends and is now used in production by Meta, Anthropic, and others.

# A Triton kernel: portable across GPUs and increasingly across other accelerators
import triton
import triton.language as tl

@triton.jit
def softmax_kernel(out_ptr, in_ptr, n_cols, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    cols = tl.arange(0, BLOCK)
    mask = cols < n_cols
    x = tl.load(in_ptr + row * n_cols + cols, mask=mask, other=-float("inf"))
    x = x - tl.max(x, axis=0)
    num = tl.exp(x)
    den = tl.sum(num, axis=0)
    tl.store(out_ptr + row * n_cols + cols, num / den, mask=mask)

Economics: Why Inference Is Migrating Off GPUs

Training is bursty, expensive, and runs for months on the most flexible hardware available, which today means NVIDIA. Inference is the opposite: steady-state, cost-sensitive, and runs for the model's entire useful life. The math has shifted decisively. Cloud inference at 1 million tokens per dollar requires accelerators where the per-token cost of memory bandwidth, power, and rack space all collapse together. That is the niche AWS Inferentia2, Google TPU v5e, Groq LPU, and the rumored OpenAI ASIC are designed for. SemiAnalysis estimated in late 2024 that inference already accounts for more than half of NVIDIA's data-center revenue, and that share is the most contested over the next 24 months. Whichever architecture wins the cost-per-token race for the long tail of production inference will define the economics of consumer AI for the rest of the decade.