Key takeaways
Synthetic data is one of the great unlocks of the last few years. For broad, well-trodden tasks it is cheap, fast and good enough, and it scales in ways human work never will.
On the hardest problems, frontier maths, clinical judgment, security, it hits a wall. The cases that matter most are exactly the ones a model cannot yet generate correctly, and that is where credentialed, expert-generated data still wins.
- Use synthetic data to extend capabilities a model already has.
- Expect a wall wherever the task exceeds the model's own competence.
- On hard cases, synthetic data tends to look plausible while being subtly wrong.
- At the edge, the value is the judgment, not the volume.
What is synthetic data?
Synthetic data is data generated by a model or simulation rather than collected from the real world or produced by a human. In modern AI pipelines it is used to augment coverage, balance rare classes, and create variations of examples the model already understands.
Its appeal is obvious: near-zero marginal cost, instant scale, and no scheduling of human annotators. The question is never whether to use it, but where its usefulness stops.
Where synthetic data works well
When a capability is already well represented in a model, synthetic data is a force multiplier. It fills gaps, balances classes and augments coverage at a fraction of the cost of human work. For these tasks, paying experts to generate data is wasteful; the right default is to let the model do what it can.
- Augmenting or rebalancing datasets the model already handles.
- Generating format and style variations of known-good examples.
- Bootstrapping low-stakes tasks where occasional errors are cheap.
- Creating large volumes for pre-training breadth rather than edge accuracy.
Where synthetic data hits a wall
The wall appears where the task exceeds the model's own competence. A model cannot reliably generate a correct novel proof, a defensible diagnosis on an ambiguous case, or a genuinely adversarial security finding, because if it could, you would not need the data in the first place.
Worse, synthetic data on these tasks tends to look plausible while being subtly wrong, the most dangerous failure mode of all. Training on confidently incorrect data does not just fail to help; it actively teaches the model the wrong thing and can be hard to detect downstream.
How to decide between synthetic and expert data
The practical answer is not synthetic versus human; it is knowing precisely where the wall is and spending expert effort only beyond it. A simple test: could a strong model produce this example correctly and could you verify that it did? If yes, synthesize it. If no, route it to a credentialed human.
- Map tasks by model competence and by the cost of an error.
- Synthesize where competence is high and error cost is low.
- Use experts where competence is uncertain or errors are expensive.
- Always verify hard synthetic examples before they enter training.
