TL;DR

Thorsten Meyer AI’s Control Series Part 3 argues that training data, not compute, is becoming a more significant constraint for the AI industry. It cites Epoch AI estimates that high-quality public text may be fully used between 2026 and 2032, while settlements, licensing deals and sovereign data controls are raising the cost of private corpora.

Thorsten Meyer AI published Part 3 of its Control Series, arguing that training data is becoming a significant constraint for the AI industry. The analysis cites estimates that high-quality public web text could be fully used between 2026 and 2032 and points to lawsuits, licensing deals and state-held datasets that are raising the cost of access.

The piece frames data as different from other AI inputs. Compute can be rented, power can be bought and models can be copied or matched over time, but private, expert, enterprise and sovereign datasets are harder to recreate. The series says that distinction is gaining weight as H100 rental prices fall from peak levels and model performance gaps narrow.

Epoch AI estimates that the public internet contains roughly 300 trillion tokens of high-quality text, according to the source material. Its projection places full use of public human text between 2026 and 2032, with a median around 2028. Elon Musk made a similar claim in early 2025, saying the accumulated stock of human knowledge available for training had been exhausted, though that remains an industry claim rather than a settled measurement.

The analysis also points to legal pressure. In the Anthropic authors case, the company agreed to a $1.5 billion settlement covering alleged use of pirated books, roughly $3,000 per work across about 500,000 titles, and agreed to destroy the disputed files. The settlement covered past piracy claims, not future training or model outputs. The New York Times case against OpenAI remains in discovery, while some publishers, including News Corp, have moved toward licensing.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Proprietary Data Gains Commercial Importance

Data ownership now affects who can build competitive AI systems and who must pay for access. Large incumbents may be better positioned than smaller labs to absorb licensing costs, settlements and expert-data contracts. That could make data access a market barrier as well as a technical constraint.

The impact extends beyond AI labs. Enterprises that hold customer records, internal workflows, legal documents, medical expertise or industrial process data may have assets that cannot be replaced by public scraping. The series argues that companies should define rights and model-use limits before sharing proprietary data with providers.

Amazon

high quality training data datasets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

From Web Scraping to Deals

Early frontier models relied heavily on large-scale web crawling and broad text collections. The new analysis says that phase is changing because the most accessible public text has already been heavily used and because creators, publishers and courts are forcing AI developers to account for copyright and licensing.

Synthetic data is one response. Nvidia’s $320 million deal for Gretel and Microsoft’s use of hundreds of billions of synthetic tokens show that machine-generated training material is already part of the toolset. The source material also warns that synthetic data can compound errors in fields where answers are hard to verify, making fresh human-made and verified data more valuable.

The series also cites demand for expert-authored data from lawyers, doctors, physicists and other specialists. It points to Meta’s reported $14.3 billion deal for a 49% stake in Scale AI and reports that some customers then looked elsewhere, a sign that data suppliers can become strategic risks for buyers.

“The public internet holds roughly 300 trillion tokens of high-quality text, with full use projected between 2026 and 2032.”

— Epoch AI, cited by Thorsten Meyer AI

Understanding Open Source and Free Software Licensing

Understanding Open Source and Free Software Licensing

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Token Forecasts and Legal Limits

Several points remain unsettled. Epoch AI’s token ceiling is a forecast, not a fixed date for when all useful public data disappears. The value of synthetic data also depends on domain, verification and model design, so it is not clear how much it can offset scarce human data.

The legal picture is still developing. The Anthropic settlement resolved past piracy claims, but it did not settle future training rights, output disputes or the broader copyright questions in pending cases such as The New York Times v. OpenAI.

Amazon

private enterprise datasets for AI

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Licensing Disputes Move Through Courts

The next milestones are legal and commercial. Courts will continue to define the limits of training-data use, while publishers, data brokers, enterprises and governments negotiate access terms. Watch for more licensing deals, more expert-data marketplaces and more contracts that limit how customer data can be used to train or improve models.

Splash It!: 99 Customizable Press Release Tools, Texts & Layout Templates (Sovereign Series)

Splash It!: 99 Customizable Press Release Tools, Texts & Layout Templates (Sovereign Series)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Is the public internet out of training data?

No. The claim is narrower: Epoch AI estimates that the stock of high-quality public human text could be fully used by frontier training runs between 2026 and 2032. The precise date is uncertain.

What did the Anthropic settlement resolve?

According to the source material, Anthropic agreed to pay $1.5 billion over alleged use of pirated books and destroy disputed files. The settlement covered past piracy claims, not future training or model-output disputes.

Why can’t AI firms just rent data like compute?

Cloud compute is standardized and broadly available. Proprietary data is held by specific companies, experts, publishers or governments, and access depends on ownership, rights, contracts and trust.

Could synthetic data solve the shortage?

It can help, and major companies already use it. The risk cited in the source material is that synthetic data can amplify errors when answers are hard to verify, which increases the value of fresh, checked human data.

What should businesses take from this?

The practical reading is that proprietary data has strategic value. Businesses should know what they own, how vendors can use it and whether contracts allow their data to train systems that may later compete with them.

Source: Thorsten Meyer AI

You May Also Like

What would J.R.R. Tolkien think of Palantir?

Exploring how Tolkien’s views might align or clash with Palantir’s tech and culture, and what this reveals about modern technology and power.

AI and Gig Work: Platforms Using AI to Manage Gig Workers

Gig platforms harness AI to manage workers, raising ethical questions and fairness concerns that could reshape the future of gig employment.

New arXiv policy: 1-year ban for hallucinated references

arXiv introduces a new policy imposing a one-year ban for authors submitting papers with hallucinated or fabricated references, aiming to improve research integrity.

The Future of Obsidian Plugins

Obsidian unveils a new community platform with automated plugin reviews, safety enhancements, and developer tools to support its growing ecosystem.