Skip to content

December 20256 min read

Why synthetic data hits a wall

A practical guide to when synthetic data works, when it fails, and why expert data still wins at the frontier

PathWize

Key takeaways

Synthetic data is one of the great unlocks of the last few years. For broad, well-trodden tasks it is cheap, fast and good enough, and it scales in ways human work never will.

On the hardest problems, frontier maths, clinical judgment, security, it hits a wall. The cases that matter most are exactly the ones a model cannot yet generate correctly, and that is where credentialed, expert-generated data still wins.

  • Use synthetic data to extend capabilities a model already has.
  • Expect a wall wherever the task exceeds the model's own competence.
  • On hard cases, synthetic data tends to look plausible while being subtly wrong.
  • At the edge, the value is the judgment, not the volume.

What is synthetic data?

Synthetic data is data generated by a model or simulation rather than collected from the real world or produced by a human. In modern AI pipelines it is used to augment coverage, balance rare classes, and create variations of examples the model already understands.

Its appeal is obvious: near-zero marginal cost, instant scale, and no scheduling of human annotators. The question is never whether to use it, but where its usefulness stops.

Projected share of data used for AI that is synthetic
2021 (approx., actual)1%
2024 (Gartner projection)60%
Source: Gartner (2021 projection) 路 A projection, not a measured value; Gartner expected synthetic data to overtake real data later this decade

Where synthetic data works well

When a capability is already well represented in a model, synthetic data is a force multiplier. It fills gaps, balances classes and augments coverage at a fraction of the cost of human work. For these tasks, paying experts to generate data is wasteful; the right default is to let the model do what it can.

  • Augmenting or rebalancing datasets the model already handles.
  • Generating format and style variations of known-good examples.
  • Bootstrapping low-stakes tasks where occasional errors are cheap.
  • Creating large volumes for pre-training breadth rather than edge accuracy.

Where synthetic data hits a wall

The wall appears where the task exceeds the model's own competence. A model cannot reliably generate a correct novel proof, a defensible diagnosis on an ambiguous case, or a genuinely adversarial security finding, because if it could, you would not need the data in the first place.

Worse, synthetic data on these tasks tends to look plausible while being subtly wrong, the most dangerous failure mode of all. Training on confidently incorrect data does not just fail to help; it actively teaches the model the wrong thing and can be hard to detect downstream.

How to decide between synthetic and expert data

The practical answer is not synthetic versus human; it is knowing precisely where the wall is and spending expert effort only beyond it. A simple test: could a strong model produce this example correctly and could you verify that it did? If yes, synthesize it. If no, route it to a credentialed human.

  • Map tasks by model competence and by the cost of an error.
  • Synthesize where competence is high and error cost is low.
  • Use experts where competence is uncertain or errors are expensive.
  • Always verify hard synthetic examples before they enter training.

Frequently asked questions

Is synthetic data as good as real data?

For tasks a model already handles well, synthetic data can match or beat collected data on cost and scale. For tasks at or beyond the model's competence, it is unreliable and often plausibly wrong, where real expert data is far more valuable.

What are the limitations of synthetic data?

Synthetic data cannot reliably create genuinely new judgment on hard problems, tends to produce confident but subtly incorrect examples on those problems, and can amplify the generating model's existing biases and blind spots.

When should you use human experts instead of synthetic data?

Use experts wherever the task exceeds the model's competence or where errors are expensive, frontier maths, clinical judgment, legal reasoning, security, and verify any hard synthetic examples before training on them.

MH

Mareike Hoffmann

Data Research

Mareike studies training-data quality and where generated data stops being enough for frontier tasks.