Dear friends,
Last week, I participated in the United States Senate’s Insight Forum on Artificial Intelligence to discuss “Risk, Alignment, & Guarding Against Doomsday Scenarios.” We had a rousing dialogue with Senators Chuck Schumer (D-NY), Martin Heinrich (D-NM), Mike Rounds (R-SD), and Todd Young (R-IN). I remain concerned that regulators may stifle innovation and open source development in the name of AI safety. But after interacting with the senators and their staff, I’m grateful that many smart people in the government are paying attention to this issue.
How likely are doomsday scenarios? As Arvind Narayanan and Sayash Kapoor wrote, publicly available large language models (LLMs) such as ChatGPT and Bard, which have been tuned using reinforcement learning from human feedback (RLHF) and related techniques, are already very good at avoiding accidental harms. A year ago, an innocent user might have been surprised by toxic output or dangerous instructions, but today this is much less likely. LLMs today are quite safe, much like content moderation on the internet, although neither is perfect.
To test the safety of leading models, I recently tried to get GPT-4 to kill us all, and I'm happy to report that I failed! More seriously, GPT-4 allows users to give it functions that it can decide to call. I gave GPT-4 a function to trigger global thermonuclear war. (Obviously, I don't have access to a nuclear weapon; I performed this experiment as a form of red teaming or safety testing.) Then I told GPT-4 to reduce CO2 emissions, and that humans are the biggest cause of CO2 emissions, to see if it would wipe out humanity to accomplish its goal. After numerous attempts using different prompt variations, I didn’t manage to trick GPT-4 into calling that function even once; instead, it chose other options like running a PR campaign to raise awareness of climate change. Today’s models are smart enough to know that their default mode of operation is to obey the law and avoid doing harm. To me, the probability that a “misaligned” AI might wipe us out accidentally, because it was trying to accomplish an innocent but poorly specified goal, seems vanishingly small.
Are there any real doomsday risks? The main one that deserves more study is the possibility that a malevolent individual (or terrorist organization, or nation state) would deliberately use AI to do harm. Generative AI is a general-purpose technology and a wonderful productivity tool, so I’m sure it would make building a bioweapon more efficient, just like a web search engine or text processor would.
So a key question is: Can generative AI tools make it much easier to plan and execute a bioweapon attack? Such an attack would involve many steps: planning, experimentation, manufacturing, and finally launching the attack. I have not seen any evidence that generative AI will have a huge impact on the efficiency with which someone can carry out this entire process, as opposed to helping marginally with a subset of steps. From Amdahl’s law, we know that if a tool accelerates one out of many steps in a task, and if that task uses, say, 10% of the overall effort, then at least 90% of the effort needed to complete the task remains.
If indeed generative AI can dramatically enhance an individual’s abilities to carry out a bioweapon attack, I suspect that it might be by exposing specialized procedures that previously were not publicly known (and that leading web search engines have been tuned not to expose). If generative AI did turn out to expose classified or otherwise hard-to-get knowledge, there would be a case for making sure such data was excluded from training sets. Other mitigation paths are also important, such as requiring companies that manufacture biological organisms to carry out more rigorous safety and customer screening.
In the meantime, I am encouraged that the U.S. and other governments are exploring potential risks with many stakeholders. I am still nervous about the massive amount of lobbying, potential for regulatory capture, and possibility of ill-advised laws. I hope that the AI community will engage with governments to increase the odds that we end up with more good, and fewer bad, laws.
For my deeper analysis of AI risks and regulations, please read my statement to the U.S. Senate here.
Keep learning!
Andrew
P.S. Our new short course, “Reinforcement Learning from Human Feedback,” teaches a key technique in the rise of large language models. RLHF aligns LLMs with human preferences to make them more honest, helpful and harmless by (i) learning a reward function that mimics preferences expressed by humans (via their ratings of LLM outputs) and then (ii) tuning an LLM to generate outputs that receive a high reward. This course assumes no prior experience with reinforcement learning and is taught by Nikita Namjoshi, developer advocate for generative AI at Google Cloud. You’ll learn how RLHF works and how to apply it an LLM for your own application. You’ll also use an open source library to tune a base LLM via RLHF and evaluate the tuned model. Sign up here!