Large language models increasingly reply to prompts with a believably human response. Can they also mimic human behavior?
What's new: Joon Sung Park and colleagues at Stanford and Google extended GPT-3.5 to build generative agents that went about their business in a small town and interacted with one another in human-like ways. The code is newly available as open source.
Key insight: With the right prompts, a text database, and a server to keep track of things, a large language model (LLM) can simulate human activity.
- Just as people observe the world, an LLM can describe its experiences. Observations can be stored and retrieved to function like memories.
- Just as people consolidate memories, an LLM can summarize them as reflections for later use.
- To behave in a coherent way, an LLM can generate a plan and revise it as events unfold.
How it works: The authors designed 25 agents (represented by 2D sprites) who lived in a simulated town (a 2D background depicting the layout and the contents of its buildings) and let them run for two days. Each agent used GPT 3.5; a database of actions, memories, reflections, and plans generated by GPT 3.5; and a server that tracked agent and object behaviors, locations (for instance, in the kitchen of Isabella’s apartment), and statuses (whether a stove was on or off), and relayed this information to agents when they came nearby.
- At each time step, the server gave each agent an observation that comprised what it last said it was doing, the objects and people in view, and their statuses.
- Given an observation, an agent retrieved a memory based on recency, relevance, and importance. It measured relevance according to cosine similarity between embeddings of the observation and the memory. It rated importance by asking GPT-3.5 to score memories on a scale from “mundane” (1) to “poignant” (10). Having retrieved the memory, the agent generated text that described its action, upon which the server updated the appropriate locations and statuses.
- The reflection function consolidated the latest 100 memories a couple of times a day. Given 100 recent memories (say, what agent Klaus Mueller looked up at the library), the agent proposed 3 high-level questions that its memories could provide answers to (for instance, “What topic is Klaus Mueller passionate about?”). For each question, the agent retrieved relevant memories and generated five high-level insights (such as, “Klaus Mueller is dedicated to his research on gentrification”). Then it stored these insights in the memory.
- Given general information about its identity and a summary of memories from the previous day, the agent generated a plan for the current day. Then it decomposed the plan into chunks an hour long, and finally into chunks that are minutes long (“4:00 p.m.: grab a light snack, such as a piece of fruit, a granola bar, or some nuts. 4:05 p.m.: …”. The detailed plans went into the memory.
- At each time step, the agent asked itself whether and how it should react to its observation given general information about its identity, its plan, and a summary of relevant memories. If it should react, the agent updated its plan and output a statement that describes its reactions. Otherwise, the agent generated a statement saying it would continue the existing plan. For example, a father might observe another agent and, based on a memory, identify it as his son who is currently working on a project. Then the father might decide to ask the son how the project is going.
Results: The complete agents exhibited three types of emergent behavior: They spread information initially known only to themselves, formed relationships, and cooperated (specifically to attend a party). The authors gave 100 human evaluators access to all agent actions and memories. The evaluators asked the agents simple questions about their identities, behaviors, and thoughts. Then they ranked the agents’ responses for believability. They also ranked versions of each agent that were missing one or more functions, as well as humans who stood in for each agent (“to identify whether the architecture passes a basic level of behavioral competency,” the authors write). These rankings were turned into a TrueSkill score (a variation on the Elo system used in chess) for each agent type. The complete agent architecture scored highest, while the versions that lacked particular functions scored lower. Surprisingly, the human stand-ins also underperformed the complete agents.
Yes, but: Some complete agents “remembered” details they had not experienced. Others showed erratic behavior, like not recognizing that a one-person bathroom was occupied or that a business was closed. And they used oddly formal language in intimate conversation; one ended exchanges with her husband, “It was good talking to you as always.”
Why it matters: Large language models produce surprisingly human-like output. Combined with a database and server, they can begin to simulate human interactions. While the TrueSkill results don’t fully convey how humanly these agents behaved, they do suggest a role for such agents in fields like game development, social media, robotics, and epidemiology.
We're thinking: The evaluators found the human stand-ins less believable than the full-fledged agents. Did the agents exceed human-level performance in the task of acting human, or does this result reflect a limitation of the evaluation method?