In 2024, News Corp licensed its archive to OpenAI for a reported $250M over five years.1 Google licensed Reddit’s content for about $60M per year.2 These are not one-off content deals. They are the opening prints of a new asset class.
AI researchers flagged years ago that human-generated text is a depleting resource. Epoch AI estimates the usable stock of public human text at roughly 300 trillion tokens, fully consumed somewhere between 2026 and 2032.3 As supply tightens, models train on a rising share of synthetic data. And synthetic data has a known failure mode.
Model collapse is the steady decay that sets in when models learn from their own output.4 The distribution narrows with each generation. It does not narrow uniformly. The high-frequency middle survives. The low-frequency tail erodes first, because rare patterns are the ones a model is least likely to regenerate.
That asymmetry is the part to price.
The value of a dataset is no longer mostly its size. It is the position of its data on the distribution. Common data is cheap and increasingly easy to generate. Tail data is scarce and, by definition, cannot be reproduced.
In healthcare, the tail is rare disease. Atypical presentations. The symptom that surfaces in a patient forum a year before the literature does. This is the most clinically valuable data and the hardest to reproduce, at the same time. Those are the same property.
For drug developers, this is not an abstract data-market story. The tail is where the missing evidence lives. The everyday symptoms patients deal with that never made it onto a drug’s label. The safety signals a company has to keep tracking after a drug is approved. The real-life burden insurers want to see proof of before they will pay for a treatment. As the open web fills with synthetic noise, that evidence gets harder to find and more valuable to hold.
The implication for anyone building on health data: provenance becomes the differentiator. Not architecture, not prompts. The origin and integrity of the underlying data. Research on model collapse shows the determining factor is whether real human data is retained or replaced by the model’s own output.5
First-person patient-community language sits at the far end of that appreciating asset. No public source is immune to the synthetic tide. But experience reported by patients themselves, with its origin documented rather than assumed, is less exposed to that contamination and easier to reason about.
The thesis has one failure point worth naming. If synthetic methods improve enough to reconstruct the tail, its scarcity collapses and the asset depreciates. So far they have not. The collapse research points the other way: the tail is the first thing lost and the hardest thing to rebuild. The forecast holds as long as that does.
The market has started pricing the archive. It has not yet priced the tail.
At TREND Community, we work with patient communities to analyze and capture that tail. It helps our clients, and it helps the communities themselves. Public conversations and direct partnerships alike, with provenance documented either way. How we do it: trend.community/how-we-work
References
- “News Corp Inks OpenAI Licensing Deal Potentially Worth More Than $250 Million,” Variety, May 2024, citing The Wall Street Journal. The figure is a reported estimate and has not been confirmed in public filings.
- Reddit–Google licensing agreement reported at roughly $60 million per year, per the Associated Press; a separate Reddit–OpenAI agreement was also disclosed. Terms of the OpenAI agreement have not been publicly confirmed.
- Villalobos, P., et al. “Will we run out of data? Limits of LLM scaling based on human-generated data.” Epoch AI, 2024 (peer-reviewed, ICML 2024; original working paper 2022).
- Shumailov, I., et al. “AI models collapse when trained on recursively generated data.” Nature, 2024. (First circulated as an arXiv preprint in 2023.)
- Gerstgrasser, M., et al. “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data.” First Conference on Language Modeling (COLM), 2024 (arXiv:2404.01413).
