Researchers at KAUST discovered a way to significantly increase the speed of training. Large machine learning models can be trained much faster by observing how frequently zero results are produced in distributed machine learning using large training datasets.
According to Jiawei Fei from the KAUST team, communication is a major performance bottleneck in distributed deep learning. In addition to the fast-paced increase in model size, there is also an increase in the proportion of zero values produced during the learning process, called sparsity. The team set out to exploit sparsity to maximize effective bandwidth usage by sending only non-zero data blocks.
AI models develop their “intelligence” by training on datasets labeled to tell the model how to differentiate between different inputs and respond accordingly. The more labeled data that goes in, the better the model is at performing its assigned task. For complex deep learning applications, this requires enormous input datasets and very long training times, even with powerful and expensive highly parallel supercomputing platforms.
During training, small learning tasks are assigned to tens or hundreds of computing nodes, which share results over a communications network before running the next task. A huge source of computing overhead in such parallel computing tasks is this communication among computing nodes at each model step.
Using an earlier KAUST development called SwitchML, which optimized internode communications by running efficient aggregation code on the network switches that process data transfer, Fei, Marco Canini and their colleagues went a step further by identifying zero results and developing a way to drop transmission without interrupting the synchronization of the parallel computing process. The team demonstrated their OmniReduce scheme on a testbed consisting of an array of graphics processing units (GPU) and achieved an eight-fold speed-up for typical deep learning tasks.
Original release: Eureka Alert