Chapter 12 · The Data Wall · Updated June 2026

The Data Wall: Exhausting human written words.

The early scaling laws relied on scraping the public internet. As the reserve of high-quality human text is depleted and legal barriers rise, AGI development has transitioned to test-time compute scaling and verified synthetic feedback.

The Human Data Wall

Epoch AI projects that high-quality human-written text data on the public internet was exhausted before 2026. Frontier models are already overtrained by 5× compared to standard compute-optimal levels. Furthermore, publishers are actively withholding access: the MIT Data Provenance Initiative documented a sharp contraction in crawlable content. Cloudflare data shows AI crawler growth slowed from 32% in April 2025 to just 4% in July 2025 as publisher blocking surged, highlighted by Anthropic's lopsided 38,000:1 crawl-to-refer traffic ratio.

Test-Time Compute Scaling

As pre-training data hits physical limits, scaling laws have shifted to inference runtime. Reasoning models (OpenAI o1/o3, DeepSeek-R1, Qwen-QwQ) spend 20,000–60,000 thinking tokens per query to execute self-correction and logical search. This has shifted the economic balance: a single query can cost 4–17× more in compute and latency, driving Chinese daily token call volumes to over 140 trillion tokens in Q1 2026, making inference the dominant infrastructure cost.

Synthetic Data & Self-Play

Hyperscalers are training models on synthetic data. This includes Reinforcement Learning from AI Feedback (RLAIF) under Constitutional AI, self-play distillation (e.g. DeepSeek-R1 generating 800,000 high-quality reasoning examples to train smaller open-source models), and sandbox simulations (generating verifiable data from environment goals, such as training AlphaGeometry 2 on synthetic geometry datasets or agents on Factorio/Minecraft).

Model Collapse Mitigation

Recursively training models on synthetic data from prior models introduces 'model collapse,' where semantic variance degrades. To mitigate this, labs are combining synthetic generation with ground-truth verification loops: executing generated code in sandboxes to verify syntax, checking mathematical steps with formal proof assistants (Lean), and comparing output trajectories with physical simulators.

The Decisional Pivot

System 1 vs System 2 Thinking.

Standard autoregressive LLMs operate on 'System 1' thinking—reflexively emitting the next token with constant compute. Inference-time scaling introduces 'System 2' thinking, allowing models to formulate multi-step chains-of-thought, run search trees, and identify errors before printing the final answer.

This shifts AI from a static chat interface to a dynamic utility: a model can run instantly for low-cost conversational tasks, or deliberate for hours to solve a complex mathematical theorem or identify a molecular compound for biology.

← Chapter 11: Open vs Closed Appendix: Daily Practice →