d/dx Times Labs
Language Modeling on One GPU: Single-headed attention competes with transformers.
The latest large, pretrained language models rely on trendy layers based on transformer networks. New research shows that these newfangled layers may not be necessary.