The Human Data Wall
Epoch AI projects that high-quality human-written text data on the public internet was exhausted before 2026. Frontier models are already overtrained by 5× compared to standard compute-optimal levels. Furthermore, publishers are actively withholding access: the MIT Data Provenance Initiative documented a sharp contraction in crawlable content. Cloudflare data shows AI crawler growth slowed from 32% in April 2025 to just 4% in July 2025 as publisher blocking surged, highlighted by Anthropic's lopsided 38,000:1 crawl-to-refer traffic ratio.