Single Headed Attention RNN (SHA-RNN)
Language Modeling on One GPU: Single-headed attention competes with transformers.
The latest large, pretrained language models rely on trendy layers based on transformer networks. New research shows that these newfangled layers may not be necessary.