Collecting and annotating a dataset of facial portraits is a big job. New research shows that synthetic data can work just as well.
What's new: A team led by Erroll Wood and Tadas Baltrušaitis at Microsoft used a 3D model to generate an effective training set for face parsing algorithms intended to recognize facial features. The FaceSynthetics dataset comprises 100,000 diverse synthetic portraits in which each pixel is annotated according to parts of the face.
Key insight: Face datasets annotated with facial features are expensive and time-consuming to build. Beyond the ethical issues that arise in collecting pictures of people, they require that every pixel of every image be labeled. Creating high-quality synthetic images can be similarly difficult, since a digital artist must design each face individually. A controllable 3D model can ease the burden of producing and labeling realistic portraits.
How it works: The authors used a high-quality 3D model of a face, comprising over 7,000 polygons and vertices as well as four joints, that changes shape depending on parameters defining a unique identity, expression, and pose. They fit the model to the average face derived from 500 scans of people with diverse backgrounds.
- Given the average face, the authors derived the identity, pose, and expression from each of the 500 scans. They added further expressions from a dataset of 27,000 expression parameters. Meanwhile, artists produced a library of skin textures, facial expressions, facial hair, clothing, and accessories.
- To create novel faces, the authors fit a distribution to match that of the real-world identity parameters and sampled from it. Then they applied elements from the library to render 100,000 face images.
- They trained a U-Net encoder-decoder to classify each pixel as belonging to the right or left eye, right or left eyebrow, top or bottom lip, head or facial hair, neck, eyeglasses, and so on. The loss function minimized the difference between predicted and ground-truth labels.
- Given real-life faces from the Helen dataset, the authors used the U-Net to classify each pixel. Then, given the U-Net's output, they trained a second U-Net to transform the predicted classifications to be similar to the human labels. This label adaptation step helped the system’s output to match biases in the human-annotated test data (for example, where a nose ended and the rest of the face began).
Results: The authors compared their system to a U-Net trained using images in Helen. Their system recognized the part of the face each pixel belonged to with an overall F1 score (a number between 0 and 1 that represents the balance between precision and recall, higher is better) of 0.920. The comparison model scored 0.916. This result fell somewhat short of the state of the art, EAGRNet, which achieved an F1 score of 0.932 in the same task.
Why it matters: Synthetic data is handy when the real thing is hard to come by. Beyond photorealistic, annotated faces, the authors’ method can produce similarly high-quality ultraviolet and depth images. It can also generate and label images outside the usual data distribution in a controllable way.
We're thinking: The authors generated an impressive diversity of realistic faces and expressions, but they were limited from the library of 512 discrete hairstyles, 30 items of clothing, and 54 accessories. We look forward to work that enables a 3D model to render these features as well.