Data wranglers have tried bulking up training data sets with synthetically generated images, but such input often fails to capture real-world variety. Researchers propose a way to generate labeled data for visual tasks that aims to bring synthetic and real worlds into closer alignment.
What’s new: Where most approaches to synthesizing visual data concentrate on matching the appearance of real objects, Meta-Sim also aims to mimic their diversity and distribution as well. Its output proved quantitatively better than other methods in a number of tasks.
How it works: For a given task, Meta-Sim uses a probabilistic scene grammar, a set of compositional rules that attempt to represent objects and their distributions found in real-world scenes. To optimize the grammar attributes, Meta-Sim uses a neural network that continuously minimizes the divergence of object distributions between real-life images and synthetic images rendered from the grammar. The neural network can also be used to modify the grammar itself to boost performance on downstream tasks.
Results: Amlan Kar and his colleagues at MIT, Nvidia, University of Toronto, and Vector Institute show that tuning probabilistic scene grammars via Meta-Sim significantly improves generalization from synthetic to test data across a number of tasks. Trained on Meta-Sim data, networks built for digit recognition, car detection, and aerial road segmentation perform accurately on real-world data.
To be sure: Meta-Sim relies on probabilistic scene grammars for any particular task. Its output is only as good as the grammar itself, and it can model only scenes that are represented in PSG format.
Takeaway: There’s no such thing as too much labeled data. Meta-Sim offers an approach to generating endless quantities of diverse visual data that more closely mimics the real world, and points the way toward synthesizing more realistic data for other kinds of tasks. That could make for more accurate models going forward.