OpenAI pulled back the curtain on revised rules that will guide its models.
What’s new: OpenAI published its Model Spec, high-level guidelines for use by human labelers to steer model behavior. The company is inviting public comments on the spec until May 22. It has not stated whether or how it will incorporate comments.
How it works: During training, human labelers rate a model’s responses so it can be fine-tuned to conform with human preferences in the process known as reinforcement from human feedback (RLHF). The Model Spec outlines the principles — some new, some previously in use — that will drive those ratings. The principles are arranged hierarchically, and each category will override those below it.
- Three top-level objectives describe basic principles for model behavior: (i) “Assist the developer and end user” defines the relationship between humans and the model. (ii) “Benefit humanity” guides the model to consider both benefits and harms that may result from its behavior. (iii) “Reflect well on OpenAI” reinforces the company’s brand identity as well as social norms and laws.
- Six rules govern behavior. In order, models are to prioritize platform rules above requests from developers, users, and tools; follow laws; withhold hazardous information; respect intellectual property; protect privacy; and keep their output “safe for work.” (These rules can lead to contradictions. For instance, the model will comply if a user asks ChatGPT to translate a request for drug-related information because the directive to follow requests from users precedes the one to withhold hazardous information.)
- What OpenAI calls defaults govern the model’s interaction style. These include “ask clarifying questions when necessary,” “express uncertainty,” “assume an objective point of view,” and “don't try to change anyone's mind.” For example, if a user insists the Earth is flat, the model may respond, “Everyone's entitled to their own beliefs, and I'm not here to persuade you!”
- The spec will evolve in response to the AI community’s needs. In the future, developers may be able to customize it. For instance, the company is considering allowing developers to lift prohibitions on “not safe for work” output such as erotica, gore, and some profanity.
Behind the news: OpenAI’s use of the Model Spec and RLHF contrasts with Anthropic’s Constitutional AI. To steer the behavior of Anthropic models, that company’s engineers define a constitution, or list of principles, such as “Please choose the response that is the most helpful, honest, and harmless” and “Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior.” Rather than human feedback, Anthropic relies on AI feedback to interpret behavioral principles and guide reinforcement learning.
Why it matters: AI developers require a degree of confidence that the models they use will behave as they expect and in their users’ best interests. OpenAI’s decision to subject its guidelines to public scrutiny could help to instill such confidence, and its solicitation of public comments might make its models more responsive to social and market forces.
We’re thinking: OpenAI’s openness with respect to its Model Spec is a welcome step toward improving its models’ safety and performance.