According to the lottery ticket hypothesis, the bigger the neural network, the more likely some of its weights are initialized to values that are well suited to learning to perform the task at hand. But just how big does it need to be? Researchers investigated the impact of initial parameter values on models of various sizes.
What’s new: Jacob M. Springer at Swarthmore College and Garrett T. Kenyon at Los Alamos National Laboratory used the Game of Life to explore how slight changes in a network’s initial weights affect its ability to learn. To learn consistently, they found, networks need more parameters than are theoretically necessary.
Key insight: Devised by mathematician John Horton Conway in 1970, the Game of Life starts with a pattern of black (dead) or white (living) squares on a grid. It changes the color of individual squares according to simple rules that reflect the ideas of reproduction and overpopulation as illustrated above in an animation by Emanuele Ascani. Because the outcome is deterministic, a network that learns its rules can predict its progress with 100 percent accuracy. This makes it an ideal environment for testing the lottery ticket hypothesis.
How it works: Each step in the game applies the rules to the current grid pattern to produce a new pattern. The authors limited the grid to eight by eight squares and built networks to predict how the pattern would evolve.
- The authors generated training data by setting an initial state (randomly assigning a value to each square based on a random proportion of squares expected to be 1) and running the game for n steps.
- They built minimal convolutional neural networks using the smallest number of parameters theoretically capable of predicting the grid’s state n steps into the future (up to 5).
- They also built oversized networks, scaling up the number of filters in each layer by a factor of m (up to 24).
- For a variety of combinations of n and m, they trained 64 networks on 1 million examples generated on the fly. In this way, they found the probability that each combination would master the task.
Results: The authors chose the models that learned to solve the game and tested their sensitivity to changes in their initial weights. When they flipped the sign of a single weight, about 20 percent of the models that had learned to predict the grid’s pattern one step into the future failed to learn a consistent solution. Only four to six flips were necessary to boost the failure rate above 50 percent. They also tested the oversized models’ probability of finding a solution. Only 4.7 percent of the minimal one-step models solved the problem, compared to 60 percent of networks that were three times bigger.
Why it matters: The authors’ results support the lottery ticket hypothesis. Future machine learning engineers may need to build ever larger networks — or find a way to rig the lottery.
We’re thinking: When it comes to accuracy, the old maxim holds: The bigger, the better.