d/dx Times Labs

1 Post

Language Modeling on One GPU: Single-headed attention competes with transformers.

The latest large, pretrained language models rely on trendy layers based on transformer networks. New research shows that these newfangled layers may not be necessary.

d/dx Times Labs

Language Modeling on One GPU: Single-headed attention competes with transformers.

Subscribe to The Batch