Large language models that remember more hallucinate less.
What’s new: Johnny Li and colleagues at Lamini introduced Mixture of Memory Experts (MoME), a method that enables large language models (LLMs) to memorize many facts with relatively modest computational requirements. (Disclosure: Andrew Ng invested in Lamini.)
Key insight: The key to getting factual answers from LLMs is to keep training it until it chooses the correct answer every time. In technical terms, train past the point where tokens relevant to the answer have a similar probability distribution, and continue until a single token has 100 percent probability. But this amount of training takes a lot of computation and, since the model may overfit the training set, it also may degrade performance on the test set. Fine-tuning is one solution, and fine-tuning a LoRA adapter to memorize facts reduces the computational burden. But a single LoRA adapter isn’t enough to store all of the knowledge in a large dataset. Training multiple adapters that are selected by cross-attention enables the LLM to memorize a variety of facts.
How it works: The authors extended a pretrained Llama-3-8B with a large number (on the order of 1 million) of LoRA adapters and a cross-attention layer. They froze Llama-3-8B and trained the LoRA adapters to predict the next token in a custom dataset of over 1 million questions and answers.
- For any given question, the model learned to select 32 LoRA adapters, each of which was associated with an embedding. The model selected adapters by performing cross-attention between an embedding of the input query and all adapter embeddings.
- The authors trained the LoRA adapters until they memorized all the answers as measured by the loss function (100 epochs).
- At inference, given a query, the model used cross-attention to select a subset of LoRA adapters and responded accordingly.
Results: The authors tested their LoRA-enhanced model’s ability to answer questions about a database via SQL queries. The model, which was outfitted for retrieval-augmented generation (RAG), achieved 94.7 percent accuracy. An unnamed model with RAG achieved 50 percent accuracy.
Yes, but: It stands to reason that the authors’ approach saves processing, but it’s unclear how much. The authors didn’t mention the cost of fine-tuning Llama-3-8B in the usual way on their training dataset for the same number of epochs.
Why it matters: The authors argue that eliminating hallucinations is possible in typical training, it’s just computationally very expensive (not to mention the risk of overfitting). An architecture designed to store and retrieve facts, via LoRA adapters in this case, makes the process more feasible.
We’re thinking: While some researchers want large language models to memorize facts, others want them to avoid memorizing their training data. These aims address very different problems. Preventing LLMs from memorizing training data would make them less likely to regurgitate it verbatim and thus violate copyrights. On the other hand, this work memorizes facts so the model can deliver consistent, truthful responses that might be stated in a variety of ways.