Making LLMs Explainable Google’s Gemma Scope probes how large language models think

Published
Reading time
3 min read
Gemma Scope 2

Researchers have probed the inner workings of individual layers of large language models. A new tool applies this approach to all layers.

What’s new: Tom Lieberum and colleagues at Google released Gemma Scope, a system designed to illuminate how each layer in Gemma 2-family large language models responds to a given input token. Gemma Scope is available for the 9 billion-parameter and newly released 2 billion-parameter versions of Gemma 2. You can play with an interactive demo or download the weights.

Key insight: A sparse autoencoder (SAE) is a sparse neural network that learns to reconstruct its input. The authors drew on earlier research into using SAEs to interpret neural networks. 

  • To see what a neural network layer knows about a given input token, you can feed it the token and study the embedding it generates. The difficulty with this approach is that the value at each index of the embedding may represent a tangle of concepts that are associated with many other values — too many other values to track. 
  • Instead, an SAE can transform the embedding into one in which each index corresponds to a distinct concept. The SAE can learn to represent the embedding by the weighted sum of a much larger number of vectors than the number of values in the embedding. However, each weighted sum has only a small number of non-zero weights — in other words, each embedding is expressed as only a small-number, or sparse, subset of the SAE vectors. Since the number of learned SAE vectors is far greater than the number of values in the original embedding, any given vector is more likely to represent a distinct concept than any value in the original embedding. 
  • The weights of this sum are interpretable: Each weight represents how strongly the corresponding concept is represented in the input. Given a token, the SAE’s first layer produces these weights.

How it works: The authors built over 400 SAEs, one for each layer of Gemma 2 2B and Gemma 2 9B. They fed Gemma 2 examples from its pretraining set and extracted the resulting embeddings at each layer. Given the resulting embeddings from a specific layer, an SAE learned to reconstruct each of them. An additional loss term minimized the number of non-zero outputs from the SAE’s first layer to help ensure that the SAE used only concepts related to the embedding. To interpret an embedding produced by the first layer of the SAE, the team labeled the embedding’s indices with their corresponding concepts. They used two main methods: manual and automatic. 

  • Manual labeling: (1) Insert the SAE in the appropriate location in Gemma 2. (2) Prompt Gemma 2. (3) Select an index in the embedding from the SAE’s first layer. (4) Note which token(s) cause the value at that index to be high. (5) Label the index manually based on commonalities between the noted tokens.
  • Automatic labeling: This was similar to manual labeling, but GPT4o-mini labeled the indices based on commonalities between the noted tokens. 
  • In addition to testing how Gemma 2 responds to particular input tokens, Gemma Scope can be used to steer the model; that is, to see how the model responds when it’s forced to generate text related (or unrelated) to a particular concept: (1) Search the index labels to determine which index corresponds to the concept in question. (2) Insert the corresponding SAE into Gemma 2 at the appropriate layer. (3) Prompt the modified Gemma 2 to generate text, adjusting the output of the SAE’s first layer at the index. Gemma 2’s text should reflect the changed value.

Behind the news: Earlier research into using SAEs to interpret neural networks was limited to interpreting a single layer or a small network. Earlier this year, Anthropic used an SAE to interpret Claude 3 Sonnet’s middle layer, building on an earlier report in which they interpreted a single-layer transformer

Why it matters: Many questions about how LLMs work have yet to be answered: How does fine-tuning change the way a model represents an input? What happens inside a model during chain-of-thought prompting versus unstructured prompting? Training an SAE for each layer is a step toward developing ways to answer these questions.

We’re thinking: In 2017, researchers visualized the layers of a convolutional neural network to show that the deeper the layer, the more complex the concepts it learned. We’re excited by the prospect that SAEs can deliver similar insights with respect to transformers.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox