OpenAI introduced a state-of-the-art agent that produces research reports by scouring the web and reasoning over what it finds.
What’s new: OpenAI’s deep research responds to users’ requests by generating a detailed report based on hundreds of online sources. The system generates text output, with images and other media expected soon. Currently the agent is available only to subscribers to ChatGPT Pro, but the company plans to roll it out to users of ChatGPT Plus, Team, and Enterprise.
How it works: Deep research is an agent that uses OpenAI’s o3 model, which is not yet publicly available. The model was trained via reinforcement learning to use a browser and Python tools, similar to the way o1 learned to reason from reinforcement learning. OpenAI has not yet released detailed information about how it built the system.
- The system responds best to detailed prompts that specify the desired output (such as the desired information, comparisons, and format), the team said in its announcement video (which features Mark Chen, Josh Tobin, Neel Ajjarapu, and Isa Fulford, co-instructor of our short courses “ChatGPT Prompt Engineering for Developers” and “Building Systems with the ChatGPT API”).
- Before answering, Deep research asks clarifying questions about the task.
- In the process of answering, the system presents a sidebar that summarizes the model’s chain of thought, terms it searched, websites it visited, and so on.
- The system can take as long as 30 minutes to provide output.
Result: On a benchmark of 3,000 multiple-choice and short-answer questions that cover subjects from ecology to rocket science, OpenAI deep research achieved 26.6 percent accuracy. In comparison, DeepSeek-R1 (without web browsing or other tool use) achieved 9.4 percent accuracy and o1 (also without tool use) achieved 9.1 percent accuracy. On GAIA, questions that are designed to be difficult for large language models without access to additional tools, OpenAI deep research achieved 67.36 percent accuracy, exceeding the previous state of the art of 63.64 percent accuracy.
Behind the news: OpenAI’s deep research follows a similar offering of the same name by Google in December. A number of open source teams have built research agents that work in similar ways. Notable releases include a Hugging Face project that attempted to replicate OpenAI’s work (not including training) in 24 hours (which achieved 55.15 percent accuracy on GAIA) and gpt-researcher, which implemented agentic web search in 2023, long before Google and OpenAI launched their agentic research systems.
Why it matters: Reasoning models like o1 or o3 made a splash not just because they delivered superior results but also because of the impressive reasoning steps the model took to produce the results. Combining that ability with web search and tool use enables large language models to formulate better answers to difficult questions, including those whose answers aren’t in the training data or whose answers change over time.
We’re thinking: Taking as much as 30 minutes of processing to render a response, OpenAI’s deep research clearly illustrates why we need more compute for inference.