Get ready for the next wave of language-model mania.
What’s new: OpenAI introduced the latest in its GPT series of large language models to widespread excitement. The company showed statistics and examples designed to demonstrate that the new model outstrips its predecessors in its language comprehension as well as its ability to adopt a desired style and tone and stay within bounds imposed by its designers. OpenAI co-founder Greg Brockman showed off some of its capabilities in a livestream that accompanied the launch.
How to get access: Text input/output is available via ChatGPT Plus, which costs $20 monthly, with image input to come. An API is forthcoming, and you can join the waitlist here.
How it works: OpenAI didn’t share many details, citing concerns about safety and competition. Like earlier GPT models, GPT-4 is based on the transformer architecture and trained to predict the next token on a mix of public and private datasets. It was fine-tuned using reinforcement learning from human feedback and engineered prompts.
- OpenAI is keeping mum about the precise architecture (including size), datasets, training procedure, and processing requirements.
- GPT-4 processes 32,000 tokens at a time internally, Brockman said — an order of magnitude more than estimates of ChatGPT’s token count — which enables it to work with longer texts than previous large language models.
- The model accepts image inputs including pages of text, photos, diagrams, and screenshots. (This capability isn’t yet publicly available because the company is still working to speed it up, Brockman said.) In one example, GPT-4 explained the humor in a photo of an iPhone whose sleek Lightning port had been adapted to accommodate a hulking VGA connector.
- A new type of input called a system message instructs the model on the style, tone, and verbosity to use in subsequent interactions. For example, a system message can condition the model to respond in the style of Socrates, encouraging users to arrive at their own answers through critical thinking.
- The company offers a new framework, OpenAI Evals, for creating and running benchmarks. It invites everyone to help test the model.
How it performs: GPT-4 aced a variety of AI benchmarks as well as simulated versions of tests designed for humans.
- GPT-4 outperformed the state of the art on MMLU multiple-choice question answering, HellaSwag common sense reasoning, AI2 grade-school multiple-choice science question answering, WinoGrande common-sense reasoning, HumanEval Python coding, and DROP reading comprehension and arithmetic.
- It exceeded GPT-3.5, Chinchilla, and PaLM English-language performance in 24 languages from Afrikaans to Welsh.
The model met or exceeded the state of the art in several vision benchmarks in TextVQA reading text in images, ChartQA, AI2 Diagram, DocVQA, Infographic VQA, and TVQA. - GPT-4 achieved between 80 and 100 percent on simulated human tests including the Uniform Bar Exam, LSAT, SAT, and advanced placement tests in biology, psychology, microeconomics, and statistics.
- GPT-4 jumps its guardrails when asked about disallowed topics like how to obtain dangerous substances roughly 1 percent of the time, while GPT-3.5 does so around 5 percent of the time. Similarly, GPT-4 misbehaves when asked about sensitive topics such as self-harm around 23 percent of the time, while GPT-3.5 does so around 42 percent of the time.
Where it works: Several companies are already using GPT-4.
- OpenAI itself has been using the model for content moderation, sales, customer support, and coding.
- The updated Microsoft Bing search, which launched last month, is based on GPT-4.
- Stripe uses GPT-4 to scan and write summaries of business websites.
- Paid subscribers to Duolingo can learn languages by conversing with GPT-4.
Yes, but: OpenAI doesn’t mince words about the new model’s potential to wreak havoc: “While less capable than humans in many real-world scenarios . . . GPT-4's capabilities and limitations create significant and novel safety challenges.” While the model outperformed its predecessors in internal adversarial evaluations of factual correctness, like other large language models, it still invents facts, makes reasoning errors, generates biased output, and couches incorrect statements in confident language. In addition, it lacks knowledge of events that transpired after September 2021, when its training corpus was finalized. OpenAI details the safety issues here.
Why it matters: As language models become more capable, they become more useful. It’s notable that OpenAI believes this model is ready to commercialize from the get-go: This is the first time it has introduced a new model alongside product launches that take advantage of it.
We’re thinking: Stable Diffusion, Phenaki, MusicLM, GPT-4: This is truly a golden time in AI!