Voice-to-Voice and More for GPT-4o API OpenAI unveils tools for speech, vision, and cost-efficiency at DevDay

Published
Reading time
2 min read
A smartphone on a table showing an incoming call with voice waveform displayed on screen.

OpenAI launched a suite of new and updated tools to help AI developers build applications and reduce costs.

What’s new: At its annual DevDay conference, OpenAI introduced an API for speech processing using GPT-4o, distillation toolsvision fine-tuning capabilities, and the ability to cache prompts for later re-use. These tools are designed to make it easier to build fast applications using audio inputs and outputs, customize models, and cut costs for common tasks.

Development simplified: The new offerings aim to make it easier to build applications using OpenAI models, with an emphasis on voice input/output and image input, customizing models, and resolving common pain points.

  • The Realtime API enables speech-to-speech interactions with GPT-4o using six preset voices, like ChatGPT's Advanced Voice Mode but with lower latency. The API costs $100/$200 per 1 million input/output tokens (about $0.06/$0.24 per minute of input/output). (The API processes text at $5/$20 per million input/output tokens.
  • The Chat Completions API now accepts voice input and generates voice outputs for GPT-4o’s usual price ($3.75/$15 per million input/output tokens). However, it generates outputs less quickly than the Realtime API. (OpenAI didn’t disclose specific latency measurements.)
  • The distillation tools simplify the process of using larger models like o1-preview as teachers whose output is used to fine-tune smaller, more cost-efficient students like GPT-4o mini. Developers can generate datasets, fine-tune models, and evaluate performance within OpenAI's platform.
  • Vision fine-tuning allows developers to enhance GPT-4o's image understanding by fine-tuning the model on a custom image dataset. For instance, developers can improve visual search, object detection, or image analysis for a particular application by fine-tuning the model on domain-specific images. Vision fine-tuning costs $25 per million training tokens for GPT-4o, but OpenAI will give developers 1 million free training tokens per day through October 31.
  • Prompt caching automatically reuses input tokens that were entered in recent interactions with GPT-4o, GPT-4o mini, and their fine-tuned variants. Repeated prompts cost half as much and get processed faster. The discount and speed especially benefit applications like chatbots and code editors, which frequently reuse input context.

Behind the news: OpenAI is undertaking a major corporate transformation. A recent funding round values OpenAI at $157 billion, making it among the world’s most valuable private companies, and the company is transferring more control from its nonprofit board to its for-profit subsidiary. Meanwhile, it has seen an exodus of executives that include CTO Mira Murati, Sora co-lead Tim Brooks, chief research officer Bob McGrew, research VP Barret Zoph, and other key researchers.

Why it matters: The Realtime API enables speech input and output without converting speech to text, allowing for more natural voice interactions. Such interactions open a wide range of applications, and they’re crucial for real-time systems like customer service bots and virtual assistants. Although Amazon Web Service and Labelbox provide services to distill knowledge from OpenAI models into open architectures, OpenAI’s tools ease the process of distilling from OpenAI models into other OpenAI models. Image fine-tuning and prompt caching, like similar capabilities for Anthropic Claude and Google Gemini, are welcome additions. 

We’re thinking: OpenAI’s offerings have come a long way since DevDay 2023, when speech recognition was “coming soon.” We’re eager to see what developers do with voice-driven applications!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox