The Synthetic Data Crisis: AI Models and Their Collapse

In April 2025, researchers analysing nearly 900,000 web domains found that 74.2% of newly created webpages contained AI-generated text. By August 2025, 10.4% of the sources cited inside Google’s AI Overviews were themselves AI-generated. The internet — the primary source of training data for every major language model — is rapidly becoming a hall of mirrors, where AI systems are training on the outputs of previous AI systems, which were themselves trained on AI outputs before that.

The consequence has a name: model collapse. And in 2025, it stopped being a theoretical concern.

What Model Collapse Actually Is

Model collapse is what happens when a generative model is trained on data produced by previous generations of generative models, without sufficient real-world signal to anchor it. Errors compound. Rare patterns — the low-frequency events and edge cases that matter most for model robustness — are progressively erased. The model drifts toward a narrower, blander version of reality, confidently generating outputs that are increasingly disconnected from how the world actually behaves.

The research published at ICLR 2025 made the scale of the problem concrete: even small proportions of synthetic data in a training set can initiate collapse. A study by Dohmatob et al. demonstrated that as little as 1% synthetic contamination causes measurable performance degradation. Scaling the model or increasing dataset size does not reliably prevent it. A separate 2025 Apple study found that large reasoning models face what the researchers called “complete accuracy collapse” on complex tasks when recursively trained on synthetic outputs. Stanford research has shown that rare patterns — exactly the edge cases that enterprise ML models most need to handle correctly — are the first to disappear.

The Feedback Loop Nobody Planned For

The mechanism is not subtle. A model generates content. That content gets published online. Future crawls ingest it as training data. The next model generation trains on a dataset where an increasing proportion of the “human” content is actually machine output from the previous generation. Errors and biases amplify with each iteration. The model does not know it is eating its own tail.

This is not a hypothetical trajectory. Analysis of the LAION-5B dataset — one of the foundational image-text datasets used across computer vision — has already found measurable contamination from synthetic sources. The contamination of training pipelines is not coming. It is here.

What the Research Says About Prevention

The 2025 research consensus is not that synthetic data should be abandoned. It is that real-world data must be treated as a protected anchor — the signal that prevents a training pipeline from drifting into recursive self-reference.

Studies comparing accumulation strategies (adding synthetic data alongside real data) against replacement strategies (substituting real data with synthetic data) show a clear result: models remain stable when real-world data is preserved as a proportion of the training mix. The practical implication is that the value of real-world training data is rising, not falling, as synthetic generation becomes cheaper and more abundant. Scarcity drives value. Real-world observations that cannot be generated or scraped are becoming the rarest and most important input to any training pipeline.

The Specific Problem for Physical-World Models

The model collapse problem is acute across all domains, but it is particularly severe for models trained to understand physical-world events — retail behavior, logistics operations, safety compliance, environmental conditions, human activity in real spaces. These are domains where synthetic generation is weakest and real-world observation is irreplaceable.

You cannot generate synthetic training data for a genuinely novel behavior pattern, because novelty by definition means the pattern has not yet been captured anywhere. You cannot scrape it from the internet, because it does not leave a digital trace. You cannot crowdsource-annotate it from stock images, because the contextual nuance that makes a label meaningful is absent from any image that was not captured for that purpose.

The only reliable source of ground truth for physical-world events is a human who was present, understood what to look for, and recorded it accurately under a defined taxonomy. That is not a limitation of current technology. It is a structural property of what physical-world observation data is.

Provenance as a First-Class Data Property

A practical implication of the 2025 model collapse research is that data provenance — knowing where each training example came from, how it was captured, and whether it originated with a human or a machine — is becoming a first-class property of any serious training pipeline, not an afterthought.

Every Sentinel Watch observation record carries full provenance by design: timestamp at capture, geolocation at capture, observer identifier, taxonomy version, and quality review status. Clients know exactly what is in their dataset, how each record was produced, and that it originates with a human observer who was physically present at the moment of classification. In a training landscape increasingly contaminated by recursive synthetic content, that provenance is not just operationally useful. It is a competitive differentiator for any model that needs to perform reliably on the edge cases that matter.

Read: Why Human Observation?

Talk to Us

References

Dohmatob et al. (2025). Strong Model Collapse. Published at ICLR 2025. iclr.cc
Winssolutions. (2025). The AI Model Collapse Risk is Not Solved in 2025. winssolutions.org
Humans in the Loop. (2025). Preventing Model Collapse with Human-in-the-Loop Annotation. humansintheloop.org
Influencers Time. (2026). Mitigating Model Collapse Risks in AI Data Training 2025. influencers-time.com

What Model Collapse Actually Is

The Feedback Loop Nobody Planned For

What the Research Says About Prevention

The Specific Problem for Physical-World Models

Provenance as a First-Class Data Property

Related Signals