Multimodal to the Max 4M-21 multimodal model excels in handling diverse input and output types

Published
Reading time
3 min read
A graphic shows an any-to-any multimodal model, with text mapping to RGB or geometric modalities.

Researchers introduced a model that handles an unprecedented number of input and output types, including many related to performing computer vision tasks.

What’s new: Roman Bachmann, Oguzhan Fatih Kar, David Mizrahi and colleagues at EPFL and Apple built 4M-21, a system that works with 21 input and output types. These include modalities related to images, geometry, and text along with metadata and embeddings produced by other models.

Key insight: The authors followed and extended their insight from the earlier 4M, which handles seven input and output types, as well as work such as Unified-IO 2, which handles 11. The key to training a model to handle multiple types of data input is to ensure that the training data takes the same format with the same-sized embedding across all input types. Using the transformer architecture, tokens suffice. 

How it works: 4M-21 comprises a large transformer and several encoder-decoders that convert different data types into tokens and back. The authors repeated their training strategy for 4M, but they increased the transformer’s size from 303 million parameters to 3 billion parameters, boosted the training dataset size from 400 million examples to 500 million examples, and incorporated new input types. 

  • The authors started with RGB images and captions from CC12M and COYO700M plus text from C4.
  • Using a variety of tools, they extracted depth images, surface-normal images, semantically segmented images, images of edges, graphics metadata, bounding boxes, color palettes, web text, image embeddings (feature maps and global embeddings), and text embeddings. For instance, they performed semantic segmentation using Mask2Former and SAM, and extracted edges using OpenCV and SAM, counting each output as a separate data type.
  • They converted all input types into tokens. For image-like data types and image embeddings, they trained VQ-VAE to reconstruct images and, in doing so, represent images as tokens. For human poses and the embeddings from DINOv2 and ImageBind, they trained Bottleneck MLP to reconstruct them and thus learn to represent them as tokens. They produced tokens of sequence data including text and metadata using WordPiece.
  • Given a random sample of tokens of all modalities, 4M-21 learned to predict a different random sample of tokens. The random samples were sometimes biased toward one modality and other times biased toward a more balanced sampling. To determine which tokens to produce, 4M-21 received mask tokens that specified the desired modalities and token positions in the output.

Results: 4M-21 demonstrated strong zero-shot performance in a variety of vision tasks. For instance, in estimating surface normals for each point in an image, 4M-21 achieved a 20.8 L1 score (average absolute difference between predicted and true values, lower is better), while the multimodal model UnifiedIO 2-XL achieved a 34.8 L1. In estimating an image’s depth map, 4M-21 achieved 0.68 L1, while UnifiedIO 2-XL achieved 0.86 L1. In semantic segmentation, 4M-21 reached 48.1 percent mean intersection over union (overlap between predicted and ground-truth segments divided by their union, higher is better), while UnifiedIO 2-XL achieved 39.7 percent mean intersection over union.

Why it matters: Since 4M-21 learned to predict tokens of several modalities using tokens from other modalities, it isn’t limited to a single modalities as input. The authors demonstrate that it can generate new images conditioned by the combination of a caption and 3D human poses, edges, or metadata.

We’re thinking: The authors say 4M-21 can take as input any combination of the modalities it’s trained to handle and output any of them. The limits of this capability aren’t clear, but it opens the door to fine control over the model’s output. The authors explain how they extracted the various modalities; presumably users can do the same to prompt the model for the output they desire. For instance, a user could request an image by entering not only a prompt but also a color palette, edges, depth map extracted from another image, and receive output that integrates those elements.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox