Mixture of Softmaxes (MoS)
Upgrading Softmax: Mixtape is a faster way to avoid the softmax bottleneck.
Softmax commonly computes probabilities in a classifier’s output layer. But softmax isn’t always accurate in complex tasks — say, in a natural-language task, when the length of word vectors is much smaller than the number of words in the vocabulary.