← henrywhelan.com
2026-02-23 · RESEARCH

The $700 Billion Question

An analysis of the economics, physics, and politics of training frontier AI models

Executive Summary

The AI industry is in the middle of the largest capital expenditure surge in the history of technology. In 2026, Microsoft, Alphabet, Amazon, Meta, and Oracle will collectively spend between $660–690B on infrastructure — nearly doubling 2025's $381 billion. Most of this money is being poured into GPU clusters, data centers, and power infrastructure to train and serve ever-larger AI models.

But a growing body of evidence suggests that the simple strategy of “make the model bigger, train it on more data” is running into three simultaneous walls: diminishing returns on pre-training scaling, exhaustion of high-quality human-generated training data, and eye-watering energy and financial costs. The industry's response has been a pivot toward alternative scaling strategies — inference-time compute, synthetic data, mixture-of-experts architectures, and distillation — that may reshape the economics of AI in fundamental ways.

This report examines the data behind each of these pressures, evaluates the alternatives, and asks the uncomfortable question: is the current trajectory of AI investment sustainable?

1. The Cost Curve: From Millions to Billions

Training costs are growing at 2.4× per year

According to Epoch AI's landmark analysis, the amortized cost of training frontier AI models has grown at approximately 2.4×/year since 2016 (90% CI: 2.0×–2.9×). The most expensive publicly confirmed training runs include:

ModelOrganizationYearEst. CostCompute (FLOPs)
GPT-3OpenAI2020~$4.6M3.1 × 10²³
PaLMGoogle2022~$12M2.5 × 10²&sup4;
GPT-4OpenAI2023$40–100M2.1 × 10²&sup5;
Gemini UltraGoogle2023$30–191M~5.0 × 10²&sup5;
GPT-5OpenAI2025est. $300M+est. >10²&sup6;

The wide ranges on GPT-4 and Gemini Ultra costs reflect genuine ambiguity. Epoch AI's lower estimates account only for amortized compute; other analyses fold in R&D staff, failed experiments, and hardware procurement, pushing estimates much higher. If the 2.4× trend continues, we should expect training runs costing over $1 billion by 2027 and potentially $5–10 billion by 2029–2030.

The cost breakdown reveals structural constraints

Energy's low share is striking — and misleading. At 2–6% of a $100M training run, energy costs $2–6M. But as training runs scale to billions of dollars and the industry collectively consumes hundreds of TWh, energy becomes a bottleneck, not because of cost per joule, but because of physical availability of power. You can buy more GPUs. You cannot easily build new power plants.


2. The Three Walls

Wall 1: Diminishing returns on pre-training compute

The original scaling laws, formalized by Kaplan et al. (2020) and refined by Hoffmann et al. (2022, the “Chinchilla” paper), showed a remarkably smooth power-law relationship between compute and model performance. Double the compute, get a predictable improvement in loss.

But in late 2024, reports from Bloomberg and The Information suggested that OpenAI, Google, and Anthropic were all seeing diminishing returns from simply scaling up pre-training. The improvements from 10× more compute were no longer delivering the expected leap in capabilities.

This doesn't mean the scaling laws are wrong — they describe loss curves, not capability curves. A model can have measurably lower perplexity while showing minimal improvement on the benchmarks that matter to users.

However, the picture is more nuanced than “scaling is dead.” As of early 2026, the latest frontier models show continued impressive gains on specific benchmarks:

What's changed is that different models are excelling on different tasks. The era of a single model being uniformly “best” seems to be ending. Performance is becoming spiky, not uniform.

Wall 2: Data exhaustion

Epoch AI estimates the effective stock of quality, repetition-adjusted human-generated public text at approximately ~300T tokens. Their 80% confidence interval for when this stock will be exhausted: 2028–2032.

The practical effects are already visible:

The data wall interacts with the compute wall: Chinchilla scaling laws say for compute-optimal training, scale data and model size equally (~20 tokens per parameter). A 10-trillion-parameter model would need ~200 trillion tokens — close to the entire stock of available human text.

Wall 3: Energy and physical infrastructure

The IEA estimates global data center electricity consumption at ~415 TWh in 2024, projected to reach ~945 TWh by 2030. Some estimates suggest 1,000 TWh by 2026 — equivalent to Japan's total electricity consumption.

The response has been dramatic:

This is the wall that money can't easily solve. GPU prices can drop. Algorithms can improve. But building power generation and transmission infrastructure takes years to decades. It's the slowest-moving variable in the system.


3. The Alternatives: How the Industry Is Adapting

Inference-time compute (“thinking longer”)

The most dramatic pivot has been the shift toward test-time compute scaling — spending more compute at inference rather than training. OpenAI's o1/o3, DeepSeek-R1, and Gemini 2.5 pioneered this approach.

The core insight: letting models “think longer” through extended chain-of-thought produces reasoning capabilities that training alone cannot match. Unlike pre-training compute, inference-time compute has different scaling properties:

Key Insight

This is arguably the most important shift in AI economics since the transformer. It decouples capability improvement from the one-time, upfront, massive training investment, converting it to an ongoing, demand-driven cost.

Mixture-of-Experts: More parameters, less compute

DeepSeek's success has proven that sparse MoE architectures can dramatically alter the cost equation:

DeepSeek's reported training cost of $5.6M for V3 stunned the industry. But context matters: SemiAnalysis estimated total hardware spend at >$500M, and later reporting suggested ~$1.6B including 50,000 NVIDIA GPUs. Still, the marginal cost per training run being $6M rather than $100M+ is a genuine breakthrough.

Distillation: Compression without catastrophe

The implication: expensive frontier models can be efficiently compressed for most practical applications. The $100M model trains once; its knowledge is distilled into $10K models that run on phones.

Synthetic data: Teaching models from models

The emerging consensus: synthetic data works well when anchored in human data. Pure synthetic training degrades; synthetic augmentation of real data improves efficiency. The ratio of synthetic to human data will grow, extending the effective life of the human data stock.


4. The $700 Billion Question

The revenue gap

The hyperscalers will spend ~$660–690B in 2026 on capex. What are they getting back?

The financial strain is real:

Three scenarios

Scenario 1: The Scaling Continues (Bull Case)

Algorithmic breakthroughs keep capability improvements on track despite diminishing returns from brute-force scaling. Enterprise adoption hits an inflection point in 2027–2028. Revenue catches up. The capex was early, not wasted.

Scenario 2: The Efficiency Revolution (Base Case)

The DeepSeek lesson generalizes: algorithmic efficiency matters more than raw scale. Training costs plateau or decline as MoE, distillation, and better optimization bend the cost curve. The winners are those who do more with less. Some of the $700B capex is stranded.

Scenario 3: The Correction (Bear Case)

Revenue fails to materialize at scale. One or more hyperscalers dramatically cut AI capex. GPU cloud prices collapse (already down 64% from 2024 peaks). The model market commoditizes. Most of the $700B capex is written down over the following decade.


5. What This Means

For Practitioners

The cost of using frontier AI is falling rapidly even as the cost of training them rises. GPU cloud prices have dropped 64% from peak. Distilled small models run on phones. You don't need to train a frontier model to benefit from frontier capabilities. Distill, fine-tune, or use the API.

For Investors

The divergence between capex and revenue is the central risk. The question isn't “will AI be transformative?” (it will) but “will the companies spending the most on training infrastructure capture proportional returns?” History suggests infrastructure builders often don't capture the value — it flows to application developers. The railroad companies went bankrupt; the farmers and merchants thrived.

For Researchers

The most impactful work right now isn't on scaling — it's on efficiency. DeepSeek's innovations in MoE architecture, multi-head latent attention, and group relative policy optimization had more impact than any amount of additional compute. The next breakthrough will come from someone who figures out how to do more with less.

For Policymakers

Compute governance faces a challenge: the FLOPs required for frontier capability are a moving target. If AI data centers consume 1,000 TWh by 2026, the intersection of AI policy and energy policy is likely to become the defining governance challenge of the next decade.


Conclusion

Are we hitting the wall on AI scaling? The answer is both yes and no.

Yes, the simple strategy of training bigger dense models on more human-generated data is approaching fundamental limits — data exhaustion, diminishing returns, and energy constraints are all real.

No, because the industry is adapting with remarkable speed. Inference-time compute, mixture-of-experts, distillation, and synthetic data represent genuine paradigm shifts that alter the economics fundamentally. The “wall” in pre-training scaling is catalyzing innovation in other dimensions.

The most likely outcome is that AI development follows a path familiar from previous technological revolutions: the infrastructure gets overbuilt, some investors lose money, but the technology itself changes everything. The question isn't whether AI will be transformative. It's who pays for the transformation, and whether they get their money back.