An image matte is what makes it possible to take an image of a zebra in a zoo, extract the zebra, and paste it over a savannah background. Make the background (zoo) pixels transparent, leave the foreground (zebra) pixels opaque, and maintain a fringe of semitransparent pixels around the foreground (the zebra’s fur, especially its whispy mane and tail), which will combine the colors of the original foreground and the new background. Then you can meld the foreground seamlessly with any background. New work produces mattes automatically with fewer errors than previous machine learning methods.
What’s new: Guowei Chen, Yi Liu, and colleagues at Baidu introduced PP-Matting, an architecture that, given an image, estimates the transparency of pixels surrounding foreground objects to create mattes without requiring additional input.
Key insight: Previous matte-making approaches require a pre-existing three-level map, or trimap, that segments foreground, background, and semitransparent transitional regions. The previous best neural method trains one model to produce trimaps and another to extract the foreground and estimate transparency. But using two models in sequence can result in cumulative errors: If the first model produces an erroneous trimap, the second will produce an erroneous matte. Using a single model to produce both trimaps and mattes avoids such errors and thus produces more accurate output.
How it works: The authors’ model comprises a convolutional neural network (CNN) encoder that feeds into two CNN branches. They trained and tested it on Distinctions-646 and Adobe Composition-1k, datasets that contain foreground images of people, objects, or animals, each stacked atop a background image, with a transparency value for each pixel.
- One branch classified each pixel of an input image as foreground, background, or transitional area, creating a trimap. A Pyramid Pooling Module captured large- and small-scale features by scaling and processing the encoder’s output to produce representations at different scales. It concatenated these representations with the encoder’s output and fed them to the CNN, which produced the trimap. During training, the loss function encouraged the trimap to match the ground-truth trimap.
- The other branch estimated the transparency of each pixel, creating a so-called detail map. To take advantage of context from the trimap, the model combined the output of each convolutional layer in this branch with the output of each layer in the other branch using a Gated Convolutional Layer. During training, the loss function encouraged the estimated transparencies and the difference in transparency between adjacent pixels to be similar to ground truth. The loss was applied only to pixels in transitional regions.
- The model replaced the transitional areas of the trimap with the corresponding areas of the detail map, producing a final matte. During training, it reapplied the loss function in the previous step to the entire matte.
- The model used the generated matte to estimate pixel colors in the original image. It applied the generated matte to the ground-truth foreground and stacked it atop the ground-truth background. A further loss function encouraged the estimated pixel colors to match ground truth.
Results: The authors compared their model with techniques that require trimap inputs, including IndexNet (the best competing method) and Deep Image Matting. They also compared with Hierarchical Attention Matting Network (HAttMatting), a single model that doesn’t require trimap inputs but also doesn’t produce the trimaps internally. The authors’ method achieved equal or better performance on three of four metrics for both datasets. On Composition-1k, the authors’ method scored a mean squared error of 0.005, equal to IndexNet. On Distinctions-646, it achieved 0.009 mean squared error, equal to Deep Image Matting and HAttMatting.
Why it matters: The main problems with previous trimap-free approaches to matting were cumulative errors and blurred output. This work addresses cumulative errors by separating processes into different branches. It addresses image quality by feeding output from the first branch into the second to refine representations of transitional areas.
We're thinking: The ability to produce high-quality mattes without needing to produce trimaps by hand seems likely to make video effects quicker and less expensive to produce. If so, then deep learning is set to make graphics, movies, and TV — which are already amazing — even more mind-boggling!