Multimodal

20 Posts

Google’s Multimodal Challenger: All you need to know about Gemini, Google's new multimodal model
Multimodal

Google’s Multimodal Challenger: All you need to know about Gemini, Google's new multimodal model

Google unveiled Gemini, its bid to catch up to, and perhaps surpass, OpenAI’s GPT-4. Google demonstrated the Gemini family of models that accept any combination of text (including code), images, video, and audio and output text and images. The demonstrations and metrics were impressive...
Example of alt-text accompanying selected images scraped from the internet and synthetic captions
Multimodal

Synthetic Data Helps Image Generators: OpenAI researchers improved text-to-image prompt following with generated captions.

Text-to-image generators often miss details in text prompts, and sometimes they misunderstand parts of a prompt entirely. Synthetic captions can help them follow prompts more closely.
Vision and Language Tightly Bound: Training on a single loss function improves multimiodal AI.
Multimodal

Vision and Language Tightly Bound: Training on a single loss function improves multimiodal AI.

Recent multimodal models process both text and images as sequences of tokens, but they learn to represent these distinct data types using separate loss functions. Recent work unifies the loss function as well.
GPT-4 Has Landed: Everything you need to know about GPT-4.
Multimodal

GPT-4 Has Landed: Everything you need to know about GPT-4.

Get ready for the next wave of language-model mania. OpenAI introduced the latest in its GPT series of large language models to widespread excitement. The company showed statistics and examples designed to demonstrate...
Illustration of three deers doing holiday household chores: washing a champagne flute, cooking pie and wrapping a gift
Multimodal

One Model Does It All: Multi-task AI models got more sophisticated in 2022.

Individual deep learning models proved their mettle in hundreds of tasks. The scope of multi-task models expanded dramatically in the past year.
Gato’s performance on simulated control tasks | Image captions generated by Gato
Multimodal

One Model, Hundreds of Tasks: Multimodal Transformer Performs Over 600 Different Tasks

Researchers took a step toward achieving a longstanding goal: One model that performs a whole lot of very different tasks. Scott Reed, Konrad Żołna, Emilio Parisotto and a team at DeepMind announced Gato.
Multimodal deep learning model
Multimodal

AI Versus the Garbage Heap: How Amazon uses AI to cut waste.

Amazon reported long-term success using machine learning to shrink its environmental footprint. The online retailer developed a system that fuses product descriptions, images, and structured data to decide how an item should be packed for shipping.
Illustration of a woman riding a sled
Multimodal

Multimodal AI Takes Off: Multimodal Models, such as CLIP and DALL·E, are taking over AI.

While models like GPT-3 and EfficientNet, which work on text and images respectively, are responsible for some of deep learning’s highest-profile successes, approaches that find relationships between text and images made impressive
Animation showing how MERLOT is able to match contextualized captions with their corresponding video frames
Multimodal

Richer Video Representations: Pretraining Method Improves AI's Ability to Understand Video

To understand a movie scene, viewers often must remember or infer previous events and extrapolate potential consequences. New work improved a model’s ability to do the same.
Animations that shows how the Google Search Algorithm works with Multimodal AI
Multimodal

Search Goes Multimodal: Google Upgrades its Search Algorithm with Multimodal AI

Google will upgrade its search engine with a new model that tracks the relationships between words, images, and, in time, videos — the first fruit of its latest research into multimodal machine learning and multilingual language modeling.
Series of example of accurate and inaccurate matching images to text
Multimodal

Crawl the Web, Absorb the Bias: NLP Models Absorb Biases from Web Training Data

The emerging generation of trillion-parameter models needs datasets of billions of examples, but the most readily available source of examples on that scale — the web — is polluted with bias and antisocial expressions. A new study examines the issue.
Frozen Pretrained Transformer (FPT) explained
Multimodal

Transformers Are Smarter Than You Think: Language transformers can do math, vision, and logic.

The transformer architecture has shown an uncanny ability to model not only language but also images and proteins. New research found that it can apply what it learns from the first domain to the others.
Image showing how object detectors work
Multimodal

I Know It When I See It: Zero-shot detection for objects not in training data.

Object detectors typically detect only items that were labeled in their training data. A new method liberates them to locate and recognize a much wider variety of objects.
Architecture of vision-language tasks
Multimodal

One Model for Vision-Language: A general purpose AI for vision and language tasks.

Researchers have proposed task-agnostic architectures for image classification tasks and language tasks. New work proposes a single architecture for vision-language tasks.
CogView home website
Multimodal

Large Language Models for Chinese: A brief overview of the WuDao family

Researchers unveiled competition for the reigning large language model GPT-3. Four models collectively called Wu Dao were described by Beijing Academy of Artificial Intelligence, a research collective funded by the Chinese government, according to Synced Review.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox