Business AIEconomicsComputeScaling Laws

The Economics of Training Frontier AI Models: Where the Money Goes

A line-item breakdown of how a billion-dollar training run is actually spent, from H100 cluster amortization to data licensing, and what it means for open source labs trying to keep up.

Alex Turner

April 7, 2026

14 min read

The Economics of Training Frontier AI Models: Where the Money Goes

When a frontier lab announces a billion-dollar training run, the headline obscures a far more interesting question: where, exactly, does that money go? GPU rental is the largest single line item, but it is rarely a majority of the total, and the share has been shrinking as data, talent, and reliability engineering have grown. A clean breakdown is the first step toward understanding why only a handful of organizations can run at the frontier and why open source labs face a structural rather than merely a capital disadvantage.

Compute: The Headline Number

Start with FLOPs. Chinchilla scaling laws, from Hoffmann et al. at DeepMind in 2022 (arXiv:2203.15556), established the compute-optimal frontier: for a fixed training budget, you should scale parameters and training tokens roughly in proportion, with about 20 tokens per parameter at the optimum. A 70-billion-parameter Chinchilla-optimal model wants 1.4 trillion tokens, and the total training compute is approximately 6 times parameters times tokens, or roughly 5.9 times 10 to the 23 FLOPs.

Translating FLOPs into dollars requires a hardware assumption. An NVIDIA H100 SXM delivers around 989 teraFLOPs of BF16 dense throughput, or about 1.98 petaFLOPs sparse. Realized utilization on a well-tuned training run lands between 35 and 50 percent of peak. At a hyperscaler reserved-instance rate of roughly 2 to 3 dollars per H100-hour in 2025, a 10 to the 25 FLOP training run on H100s costs approximately 40 to 80 million dollars in raw compute, before any networking premium for the kind of low-latency InfiniBand fabric these jobs actually need.

B200 changes the math. NVIDIA's Blackwell generation roughly doubles throughput per chip and meaningfully improves memory bandwidth, which matters more than raw FLOPs for the large-batch attention-heavy workloads at the frontier. On-demand B200 capacity priced in the 5 to 7 dollar per hour range in early 2026 still produces a lower cost per useful FLOP than H100, which is why the entire industry is rebuilding clusters around it.

Scaling Costs Across GPT-3, GPT-4, and Beyond

Public estimates from Epoch AI, whose database of training compute is the most cited reference in the field, place GPT-3 at roughly 3.14 times 10 to the 23 FLOPs and a hardware cost in the 4 to 5 million dollar range when amortized on V100s at 2020 cloud prices. GPT-4 is estimated at around 2 times 10 to the 25 FLOPs and a hardware cost near 80 to 100 million dollars on a mix of A100 and early H100 capacity. Industry reports on GPT-5 era runs have put pure compute costs in the 500 million to over 1 billion dollar range, depending on assumptions about cluster ownership versus rental.

The growth is not just larger models. Epoch's analyses consistently show training compute roughly quadrupling per year through the 2020 to 2024 window, faster than Moore's Law and faster than chip-level price-performance gains. The deficit is closed by spending more, which is why frontier training budgets have grown by roughly an order of magnitude every two to three years.

Rows of GPU servers in a hyperscale data center aisle — A single Blackwell-class training pod can consume 1.2 megawatts of power and require dedicated water cooling loops.

Data: The Quiet Capex

Data acquisition has shifted from a near-zero line item to a significant one. Public crawls like Common Crawl remain the foundation, but every frontier lab now layers in licensed corpora: news archives from major publishers, Stack Overflow and GitHub deals, academic publisher licenses, image and video libraries from Shutterstock and Getty equivalents, and music catalogs for multimodal work. Disclosed deals in 2024 and 2025 ranged from low single-digit millions for niche corpora to nine-figure multi-year agreements with the largest publishers.

Curation is the larger hidden cost. A modern pretraining pipeline runs deduplication at exact, near-duplicate, and semantic levels; quality filtering with classifier models that themselves require training; toxicity and PII scrubbing; and language identification and balancing. Synthetic data generation, used heavily for code and reasoning data, requires running large teacher models across billions of prompts. A reasonable estimate is that data work consumes 15 to 25 percent of a frontier training budget, split between licensing and the compute and salaries that go into curation.

Salaries, Electricity, and the Rest of the Stack

Research staff costs are not small. A senior research scientist or research engineer at a frontier lab carries a fully loaded annual cost in the 800,000 to 2,000,000 dollar range, with the long tail of equity-heavy packages for star hires running well into eight figures. A 200-person research and engineering team supporting one frontier training cycle is comfortably a 300 to 500 million dollar annual line item.

Electricity and cooling are the most physical part of the bill. A 30,000 H100 cluster running at 700 watts per accelerator, plus networking and host CPUs, draws roughly 30 megawatts continuously. At industrial rates of 6 to 10 cents per kilowatt-hour, that is 16 to 26 million dollars per year in raw power, before accounting for PUE overhead from cooling, which typically multiplies the bill by 1.2 to 1.5. Sites are increasingly chosen for power availability rather than latency to users, which is why north Texas, Iowa, and the Nordic countries dominate new buildouts.

Capex vs Opex: Buy or Rent

Owning a cluster fundamentally changes the cost structure. A 100,000 H100 cluster carries roughly 3 to 4 billion dollars in capex for GPUs alone, plus another 1 to 2 billion for networking, storage, real estate, and electrical infrastructure. Depreciated over four years, that yields an effective per-GPU-hour cost in the 1 to 1.50 dollar range, roughly half the cloud rental price, but only if utilization stays above 70 percent and the operator absorbs the engineering burden of running the cluster.

This is the underlying logic of the capital partnerships that now dominate the field. Microsoft's roughly 13 billion dollar investment in OpenAI is partially capital and partially Azure compute credits, which lets OpenAI consume compute as opex while Microsoft books the capex on its own balance sheet. Amazon's 4 to 8 billion dollar commitment to Anthropic follows the same template through AWS Trainium and GPU credits. Google trains internally on its own TPU fleet, which is a pure capex play backed by ad cash flow.

What This Means for Open Source and Smaller Labs

A pure-play open lab without hyperscaler ties faces a 2 to 3x cost disadvantage on compute, before any concession on data licensing or talent compensation.
Mistral, AI21, and Cohere all trade some independence for cloud partnerships precisely to neutralize this gap, typically through preferred-rate compute agreements rather than equity.
Open source models like Llama 3, Qwen 2.5, and DeepSeek-V3 are economically viable only because the sponsoring entity has either ad cash flow, e-commerce cash flow, or quantitative trading cash flow to subsidize the run.
Sovereign initiatives in France, the UAE, Singapore, and India are structurally similar: state-backed compute subsidies in exchange for national alignment, because no private actor of moderate size can fund frontier training at market rates.
Distillation, mixture-of-experts, and continued pretraining from open base models have become the only practical avenues for labs with under 100 million dollars to spend, which is most of them.

Inference Unit Economics

Training is a one-time cost; inference is forever. The relevant metric is cost per million tokens, which for a frontier closed-weight model in early 2026 sits between 2 and 15 dollars on input depending on caching, and 6 to 60 dollars on output. The marginal cost to the provider, after amortizing the cluster and accounting for typical batch sizes and KV cache reuse, is roughly an order of magnitude below the list price, which is what gives providers room to discount aggressively for high-volume customers.

Open-weight models served on rented H100s at 70 percent utilization land near 0.20 to 0.80 dollars per million output tokens for a 70-billion-parameter dense model, which is why the open ecosystem is so competitive in the commodity segment. The closed labs defend margin through capability frontier, retrieval and tool integration, and reliability guarantees rather than raw price.

The frontier is not gated by ideas anymore. It is gated by access to ten-figure pools of patient capital willing to depreciate over four years.

What Changes Next

Three forces are reshaping these economics on a multi-year horizon. First, custom silicon: Google's TPU v6, Amazon's Trainium 2, and Microsoft's Maia chips push 20 to 40 percent of a hyperscaler's internal training workload off NVIDIA, which compresses NVIDIA's pricing power and the cost floor for everyone. Second, algorithmic efficiency: techniques like Multi-Token Prediction, FP8 training, and improved mixture-of-experts routing have together yielded roughly 2 to 4x effective compute gains per generation, partially offsetting the FLOP growth. Third, energy: the binding constraint on the next generation of clusters is not chips but power interconnect approvals, which has turned utility relationships into a strategic asset.

The likely path through 2027 is a continued bifurcation. A small number of labs running 10-billion-dollar training cycles on owned or partner-subsidized clusters will hold the absolute capability frontier. A larger ecosystem of mid-tier and open-weight labs will sit a generation behind, competing on cost, latency, and specialization. The economic gravity of training has become a more durable moat than any individual model architecture, which is why the cap table of a frontier lab now tells you more about its long-term trajectory than its research roadmap does.