During training, a neural network usually updates its weights according to an optimizer that’s tuned using hand-picked hyperparameters. New work eliminates the need for optimizer hyperparameters.
What’s new: Luke Metz, James Harrison, and colleagues at Google devised VeLO, a system designed to act as a fully tuned optimizer. It uses a neural network to compute the target network’s updates.
Key insight: Machine learning engineers typically find the best values of optimizer hyperparameters such as learning rate, learning rate schedule, and weight decay by trial and error. This can be cumbersome, since it requires training the target network repeatedly using different values. In the proposed method, a different neural network takes the target network’s gradients, weights, and current training step and outputs its weight updates — no hyperparameters needed.
How it works: At every time step in the target network’s training, an LSTM generated the weights of a vanilla neural network, which we’ll call the optimizer network. The optimizer network, in turn, updated the target network. The LSTM learned to generate the optimizer network’s weights via evolution — iteratively generating a large number of similar LSTMs with random differences, averaging them based on which ones worked best, generating new LSTMs similar to the average, and so on — rather than backpropagation.
- The authors randomly generated many (on the order of 100,000) target neural networks of various architectures — vanilla neural networks, convolutional neural networks, recurrent neural networks, transformers, and so on — to be trained on tasks that spanned image classification and text generation.
- Given an LSTM (initially with random weights), they copied and randomly modified its weights, generating an LSTM for each target network. Each LSTM generated the weights of a vanilla neural network based on statistics of the target network. These statistics included the mean and variance of its weights, exponential moving averages of the gradients over training, fraction of completed training steps, and training loss value.
- The authors trained each target network for a fixed number of steps using its optimizer network. The optimizer network took the target network’s gradients, weights, and current training step and updated each weight, one by one. Its goal was to minimize the loss function for the task at hand. Completed training yielded pairs of (LSTM, loss value).
- They generated a new LSTM by taking a weighted average (the smaller the loss, the heavier the weighting) of each weight across all LSTMs across all tasks. The authors took the new LSTM and repeated the process: They copied and randomly modified the LSTM, generated new optimizer networks, used them to train new target networks, updated the LSTM, and so on.
Results: The authors evaluated VeLO using a dataset scaled to require no more than one hour to train on a single GPU on any of 83 tasks. They applied the method to a new set of randomly generated neural network architectures. On all tasks, VeLO trained networks faster than Adam tuned to find the best learning rate — four times faster on half of the tasks. It also reached a lower loss than Adam on five out of six MLCommons tasks, which included image classification, speech recognition, text translation, and graph classification tasks.
Yes, but: The authors’ approach underperformed exactly where optimizers are costliest to hand-tune, such as with models larger than 500 million parameters and those that required more than 200,000 training steps. The authors hypothesized that VeLO fails to generalize to large models and long training runs because they didn’t train it on networks that large or over that many steps.
Why it matters: VeLO accelerates model development in two ways: It eliminates the need to test hyperparameter values and speeds up the optimization itself. Compared to other optimizers, it took advantage of a wider variety of statistics about the target network’s training from moment to moment. That enabled it to compute updates that moved models closer to a good solution to the task at hand.
We’re thinking: VeLO appears to have overfit to the size of the tasks the authors chose. Comparatively simple algorithms like Adam appear to be more robust to a wider variety of networks. We look forward to VeLO-like algorithms that perform well on architectures that are larger and require more training steps.
We’re not thinking: Now neural networks are taking optimizers’ jobs!