Single Headed Attention RNN (SHA-RNN)

1 Post

Language Modeling on One GPU: Single-headed attention competes with transformers.

The latest large, pretrained language models rely on trendy layers based on transformer networks. New research shows that these newfangled layers may not be necessary.

Single Headed Attention RNN (SHA-RNN)

Language Modeling on One GPU: Single-headed attention competes with transformers.

Subscribe to The Batch