The costs of deep learning are causing several challenges for the artificial intelligence community, including a large carbon footprint and the commercialization of AI research. And with more demand for AI capabilities away from cloud servers and on “edge devices,” there’s a growing need for neural networks that are cost-effective. While AI researchers have made progress in reducing the costs of running deep learning models, the larger problem of reducing the costs of training deep neural networks remains unsolved. Recent work by AI researchers at MIT Computer Science and Artificial Intelligence Lab (MIT CSAIL), University of Toronto Vector Institute, and Element AI, explores the progress made in the field. In a paper titled, “Pruning Neural Networks at Initialization: Why are We Missing the Mark,” the researchers discuss why current state-of-the-art methods fail to reduce the costs of neural network training without having a considerable impact on their performance. They also suggest directions for future research.
Pruning deep neural networks after training
The recent decade has shown that in general, large neural networks provide better results. But large deep learning models come at an enormous cost. For instance, to train OpenAI’s GPT-3, which has 175 billion parameters, you’ll need access to huge server clusters with very strong graphics cards, and the costs can soar at several million dollars. Furthermore, you need hundreds of gigabytes worth of VRAM and a strong server to run the model. There’s a body of work that proves neural networks can be “pruned.” This means that given a very large neural network, there’s a much smaller subset that can provide the same accuracy as the original AI model without significant penalty on its performance. For instance, earlier this year, a pair of AI researchers showed that while a large deep learning model could learn to predict future steps in John Conway’s Game of Life, there almost always exists a much smaller neural network that can be trained to perform the same task with perfect accuracy. There is already much progress in post-training pruning. After a deep learning model goes through the entire training process, you can throw away many of its parameters, sometimes shrinking it to 10 percent of its original size. You do this by scoring the parameters based on the impact their weights have on the final value of the network. Many tech companies are already using this method to compress their AI models and fit them on smartphones, laptops, and smart-home devices. Aside from slashing inference costs, this provides many benefits such as obviating the need to send user data to cloud servers and providing real-time inference. In many areas, small neural networks make it possible to employ deep learning on devices that are powered by solar batteries or button cells.
Pruning neural networks early
The problem with pruning of neural networks after training is that it doesn’t cut the costs of tuning all the excessive parameters. Even if you can compress a trained neural network into a fraction of its original size, you’ll still need to pay the full costs of training it. The question is, can you find the optimal sub-network without training the full neural network? In 2018, Jonathan Frankle and Michael Carbin, two AI researchers at MIT CSAIL and co-authors of the new paper, published a paper titled, “The Lottery Ticket Hypothesis,” which proved that for many deep learning models, there exist small subsets that can be trained to full accuracy. Finding those subnetworks can considerably reduce the time and cost to train deep learning models. The publication of the Lottery Ticket Hypothesis led to research on methods to prune neural networks at initialization or early in training. In their new paper, the AI researchers examine some of the better known early pruning methods: Single-shot Network Pruning (SNIP), presented at ICLR 2019; Gradient Signal Preservation (GraSP), presented at ICLR 2020, and Iterative Synaptic Flow Pruning (SynFlow). “SNIP aims to prune weights that are least salient for the loss. GraSP aims to prune weights that harm or have the smallest benefit for gradient flow. SynFlow iteratively prunes weights, aiming to avoid layer collapse, where pruning concentrates on certain layers of the network and degrades performance prematurely,” the authors write.
How does early neural network pruning perform?
In their work, the AI researchers compared the performance of the early pruning methods against two baselines: Magnitude pruning after training and lottery-ticket rewinding (LTR). Magnitude pruning is the standard method that removes excessive parameters after the neural network is fully trained. Lottery-ticket rewinding uses the technique Frankle and Carbin developed in their earlier work to retrain the optimal subnetwork. As mentioned earlier, these methods prove the suboptimal networks exist, but they only do so after the full network is trained. These pre-training pruning methods are supposed to find the minimal networks at the initialization phase, before training the neural network. The researchers also compared the early pruning methods against two simple techniques. One of them randomly removes weights from the neural network. Checking against random performance is important to validate whether a method is providing significant results or not. “Random pruning is a naive method for early pruning whose performance any new proposal should surpass,” the AI researchers write. The other method removes parameters based on their absolute weights. “Magnitude pruning is a standard way to prune for inference and is an additional naive point of comparison for early pruning,” the authors write. The experiments were performed on VGG-16 and three variations of ResNet, two popular convolutional neural networks (CNN). No single early method stands out among the early pruning techniques the AI researchers evaluated, and the performances vary based on the chosen neural network structure and the percent of pruning performed. But their findings show that these state-of-the-art methods outperform crude random pruning by a considerable margin in most cases. None of the methods, however, match the accuracy of the benchmark post-training pruning. “Overall, the methods make some progress, generally outperforming random pruning. However, this progress remains far short of magnitude pruning after training in terms of both overall accuracy and the sparsities at which it is possible to match full accuracy,” the authors write.
Investigating early pruning methods
To test why the pruning methods underperform, the AI researchers carried out several tests. First, they tested “random shuffling.” For each method, they randomly switched the parameters it removed from each layer of the neural network to see if it had an impact on the performance. If, as the pruning methods suggest, they remove parameters based on their relevance and impact, then random switching should severely degrade the performance. Surprisingly, the researchers found that random shuffling did not have a severe impact on the outcome. Instead, what really decided the result was the amount of weights they removed from each layer. “All methods maintain accuracy or improve when randomly shuffled. In other words, the useful information these techniques extract is not which individual weights to remove, but rather the layerwise proportions in which to prune the network,” the authors write, adding that while layer-wise pruning proportions are important, they’re not enough. The proof is that post-training pruning methods reach full accuracy by choosing specific weights and randomly changing them causes a sudden drop in the accuracy of the pruned network. Next, the researchers checked whether reinitializing the network would change the performance of the pruning methods. Before training, all parameters in a neural network are initialized with random values from a chosen distribution. Previous work, including by Frankle and Carbin, as well as the Game of Life research mentioned earlier in this article, show that these initial values often have considerable impact on the final outcome of the training. In fact, the term “lottery ticket” was coined based on the fact there are lucky initial values that enable a small neural network to reach high accuracy in training. Therefore, parameters should be chosen based on their values, and if their initial values are changed, it should severely impact the performance of the pruned network. Again, the tests didn’t show significant changes. “All early pruning techniques are robust to reinitialization: accuracy is the same whether the network is trained with the original initialization or a newly sampled initialization. As with random shuffling, this insensitivity to initialization may reflect a limitation in the information that these methods use for pruning that restricts performance,” the AI researchers write. Finally, they tried inverting the pruned weights. This means that for each method, they kept the weights marked as removable and instead removed the ones that were supposed to remain. This final test would check the efficiency of the scoring method used to select the pruned weights. Two of the methods, SNIP and SynFlow, showed extreme sensitivity to the inversion and their accuracy declined, which is a good thing. But GraSP’s performance did not degrade after inverting the pruned weights, and in some cases, it even performed better. The key takeaway from these tests is that current early pruning methods fail to detect the specific connections that define the optimal subnetwork in a deep learning model.
Future directions for research
Another solution is to perform pruning in early training instead of initialization. In this case, the neural network is trained for a specific number of epochs before being pruned. The benefit is that instead of choosing between random weights, you’ll be pruning a network that has partially converged. Tests made by the AI researchers showed that the performance of most pruning methods improved as the target network went through more training iterations, but they were still below the baseline benchmarks. The tradeoff of pruning in early training is that you’ll have to spend resources on those initial epochs, even though the costs are much smaller than full training, and you’ll have to weigh and choose the right balance between performance-gain and training costs. In their paper, the AI researchers suggest future targets for research on pruning neural networks. One direction is to improve current methods or research new methods that find specific weights to prune instead of proportions in neural network layers. A second area is to find better methods for early-training pruning. And finally, maybe magnitudes and gradients are not the best signals for early pruning. “Are there different signals we should use early in training? Should we expect signals that work early in training to work late in training (or vice versa)?” the authors write. Some of the claims made in the paper are contested by the creators of the pruning methods. “While we’re truly excited about our work (SNIP) attracting lots of interests these days and being addressed in the suggested paper by Jonathan et al., we’ve found some of the claims in the paper a bit troublesome,” Namhoon Lee, AI researcher at the University of Oxford and co-author of the SNIP paper, told TechTalks. Contrary to the findings of the paper, Lee said that random shuffling will affect the results, and potentially by a lot, when tested on fully-connected networks as opposed to convolutional neural networks. Lee also questioned the validity of comparing early-pruning methods to post-training magnitude pruning. “Magnitude based pruning undergoes training steps before it starts the pruning process, whereas pruning-at-initialization methods do not (by definition),” Lee said. “This indicates that they are not standing at the same start line—the former is far ahead of others—and therefore, this could intrinsically and unfairly favor the former. In fact, the saliency of magnitude is not likely a driving force that yields good performance for magnitude based pruning; it’s rather the algorithm (e.g., how long it trains first, how much it prunes, etc.) that is well-tuned.” Lee added that if magnitude-based pruning starts at the same stage as with pruning-at-initialization methods, it will be the same as random pruning because the initial weights of neural networks are random values.
Making deep learning research more accessible
It would be interesting to see how research in this area unfolds. I’m also curious to see how these and future methods would perform on other neural network architectures such as Transformers, which are by far more computationally expensive to train than CNNs. Also worth noting is that these methods have been developed for and tested on supervised learning problems. Hopefully, we’ll see similar research on similar techniques for more costly branches of AI such as deep reinforcement learning. Progress in this field could have a huge impact on the future of AI research and applications. With the costs of training deep neural networks constantly growing, some parts of areas of research are becoming increasingly centralized in wealthy tech companies who have vast financial and computational resources. Effective ways to prune neural networks before training them could create new opportunities for a wider group of AI researchers and labs who don’t have access to very large computational resources. This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article here.