When you’re training a neural network, it takes a lot of computation to optimize its weights using an iterative algorithm like stochastic gradient descent. Wouldn’t it be great to compute the best parameter values in one pass? A new method takes a substantial step in that direction.
What's new: Boris Knyazev and colleagues at Facebook developed Graph Hyper Network (GHN-2), a graph neural network that computed weights that enabled arbitrary neural network architectures to perform image recognition tasks. (A neural network that finds weights for another neural network is known as a hypernetwork.) GHN-2 improves on a similar hypernetwork, GHN-1, proposed by a different team.
Key insights: GHN-1 learned based on how well a given architecture using generated weights performed the task. GHN-2 improved its predecessor’s performance by drawing on insights from training conventional neural networks:
- A greater number of training examples per batch can improve trained performance.
- Connections between layers that are not adjacent can pass information within representations across successive layers without error.
- Normalization can moderate representations that grow too large or too small.
GNN basics: A graph neural network processes datasets in the form of a graph made up of nodes connected by edges (say, customers connected to products they’ve purchased or research papers connected to other papers they cite). During execution, it uses a vanilla neural network to update the representation of each node based on the representations of neighboring nodes.
How it works: GHN-2 consists of an embedding layer, a gated graph neural network, which uses a gated recurrent unit (a type of recurrent network layer) to update node representations, and a convolutional neural network. Its input is a neural network architecture in graph form, where each node represents a set of weights for an operation/layer such as convolution, pooling, or self-attention, and each edge is a connection from one operation/layer to the next. Its output is a set of weights for each operation/layer. The authors trained it to generate weights for classifying images in CIFAR-10 or ImageNet using a dataset of 1 million randomly generated neural network architectures composed of convolutional layers, pooling layers, self-attention layers, and so on.
- Given a batch of architectures and a batch of images, GHN-2 learned to generate weights for all architectures, applying what it learned in processing previous batches to the next. Then it used the images to test the resulting models.
- As it trained, it added connections between layers in a given architecture, analogous to skip connections in a ResNet. These connections allowed information to pass directly from earlier layers to later ones when updating the representation of each node, reducing the amount of information lost over successive updates. (They were discarded when running the architecture with the generated weights.)
- Having added temporary connections, it processed the architecture in three steps. (1) It created an embedding of each layer. (2) It passed the embeddings through the gated graph neural network that updated them in the order in which a typical neural network, rather than a graph neural network, would execute. (3) It passed the updated embeddings through a convolutional neural network to produce new weights for the input architectures.
- Prior work found that models produced by hypernetworks generate representations whose values tend to be either very high or very low. GHN-2 normalized, or rescaled, the weights to moderate this effect.
- Given a batch of network architectures and a set of images from CIFAR-10 or ImageNet during training, GHN-2 assigned weights in a way that minimized the difference between the networks’ predicted classes and the actual classes.
Results: Architectures similar to those in the training set generally performed better using parameter values generated by GHN-2 than GHN-1. So did architectures that were wider, deeper, or more dense than those in the training set. Parameter values generated by GHN-2 yielded average CIFAR-10 accuracy of 66.9 percent versus GHN-1’s 51.4 percent. While GHN-2 outperformed GHN-1 on ImageNet, neither model produced great parameter values for that task. For instance, architectures similar to those in the training set and outfitted with parameter values from GHN-2 produced an average top-5 accuracy of 27.2 percent compared to GHN-1’s 17.2 percent.
Why it matters: GHN-2 took only a fraction of a second to generate better-than-random parameter values, while training a ResNet-50 to convergence on ImageNet can take over one week on a 32GB Nvidia V100 GPU. (To be fair, after that week-plus of training, the ResNet-50’s accuracy can be 92.9 percent — a far better result.)
We're thinking: The authors also found that initializing a model with GHN-2 boosted its accuracy after fine-tuning with a small amount of data. How much additional time did the initialization save compared to conventional initialization and fine-tuning?