Two years after it pointed a new direction for language models, Bert still hovers near the top of several natural language processing leaderboards. A new study considers whether Bert simply excels at tracking word order or learns something closer to common sense.
What’s new: Leyang Cui and colleagues at Westlake University, Fudan University, and Microsoft Research Asia probed whether Bert captures common-sense knowledge in addition to linguistic structures like syntax, grammar, and semantics.
Key insight: The multiheaded self-attention mechanism in transformer-based models like Bert assigns weights that represent the relative importance between one word and another in the input text. This process effectively creates a link between every pair of words. Given common-sense questions and answers, the researchers probed the relative strength of such links between the questions, correct answers, and wrong answers.
How it works: The authors devised two tasks, one designed to show whether Bert encodes common sense, the other to show whether Bert uses it to make predictions. The tasks are based on two metrics the model computes for each of the dozen attention heads per layer: (a) attention weights between words and (b) gradient-based attribution weights that show the importance of each attention weight in a given prediction.
- The authors used the CommonsenseQA dataset of multiple-choice questions about everyday phenomena. They concatenated each question to each potential answer to produce five question-and-answer pairs, only one of which is correct.
- Considering only correct pairs, the authors measured the percentage of times the attention weights between the answer and key concept were greater than that of the attention weights between the answer and every other word in the question. If this percentage was greater than random, they took it as a sign that Bert had encoded common sense.
- Considering all question-and-answer pairs, the authors measured how often the strength of the links (that is, attention and attribution weights) between the key concept and correct answer was greater than those between the key concept and incorrect answers. If this percentage was greater than random, then Bert used the encoded common sense to predict answers.
Results: Bert scored significantly higher than random in both tests. In the test for encoding common-sense knowledge, the highest-scoring attention head achieved 46.82 percent versus a random 10.53 percent. That score rose to 49.22 percent when the model was fine-tuned on a different portion of CommonsenseQA. In the test for using common-sense knowledge, the best attention head with a fine-tuned output layer scored 36.88 percent versus a random 20 percent.
Why it matters: Language models can string words together in ways that conform to conventional grammar and usage, but what do they really know beyond correlations among words? This work suggests that Bert, at least, also gains knowledge that might be considered common sense.
We’re thinking: Researchers have debated the notion that AI might exhibit common sense at least since the Cyc project in 1984. To study common sense a scientific, rather than philosophical, issue requires a clear definition of the phenomenon. Despite efforts from Aristotle (~300 B.C.) to CommonsenseQA, we still don’t have one. Apparently, the definition of common sense defies common sense.