Dear friends,
Is prompt engineering — the art of writing text prompts to get an AI system to generate the output you want — going to be a dominant user interface for AI? With the rise of text generators such as GPT-3 and Jurassic and image generators such as DALL·E, Midjourney, and Stable Diffusion, which take text input and produce output to match, there has been growing interest in how to craft prompts to get the output you want. For example, when generating an image of a panda, how does adding an adjective such as “beautiful” or a phrase like “trending on artstation” influence the output? The response to a particular prompt can be hard to predict and varies from system to system.
So is prompt engineering an important direction for AI, or is it a hack?
Here’s how we got to this point:
- The availability of large amounts of text or text-image data enabled researchers to train text-to-text or text-to-image models.
- Because of this, our models expect text as input.
- So many people have started experimenting with more sophisticated prompts.
Some people have predicted that prompt engineering jobs would be plentiful in the future. I do believe that text prompts will be an important way to tell machines what we want — after all, they’re a dominant way to tell other humans what we want. But I think that prompt engineering will be only a small piece of the puzzle, and breathless predictions about the rise of professional prompt engineers are missing the full picture.
Just as a TV has switches that allow you to precisely control the brightness and contrast of the image — which is more convenient than trying to use language to describe the image quality you want — I look forward to a user interface (UI) that enables us to tell computers what we want in a more intuitive and controllable way.
Take speech synthesis (also called text-to-speech). Researchers have developed systems that allow users to specify which part of a sentence should be spoken with what emotion. Virtual knobs allow you to dial up or down the degree of different emotions. This provides fine control over the output that would be difficult to express in language. By examining an output and then fine-tuning the controls, you can iteratively improve the output until you get the effect you want.
So, while I expect text prompts to remain an important part of how we communicate with image generators, I look forward to more efficient and understandable ways for us to control their output. For example, could a set of virtual knobs enable you to generate an image that is 30 percent in the style of Studio Ghibli and 70 percent the style of Disney? Drawing sketches is another good way to communicate, and I’m excited by img-to-img UIs that help turn a sketch into a drawing.
Likewise, controlling large language models remains an important problem. If you want to generate empathetic, concise, or some other type of prose, is there an easier way than searching (sometimes haphazardly) among different prompts until you chance upon a good one?
When I’m just playing with these models, I find prompt engineering a creative and fun activity; but when I’m trying to get to a specific result, I find it frustratingly opaque. Text prompts are good at specifying a loose concept such as “a picture of a panda eating bamboo,” but new UIs will make it easier to get the results we want. And this will help open up generative algorithms to even more applications; say, text editors that can adjust a piece of writing to a specific style, or graphics editors that can make images that look a certain way.
Lots of exciting research ahead! I look forward to UIs that complement writing text prompts.
Keep learning!
Andrew