Dr. Stefanie Günther, SciComp
Simultaneous Parallel-in-Layer Training for Deep Residual Networks
Deep residual networks (ResNets) have shown great promise to model complex data relations with applications in image classification, speech recognition, or text processing, among others. Despite the rapid methodological developments, compute times for ResNet training however can still be tremendous, measured in the order of hours or even days. While common approaches to decrease the training runtimes mostly involve data-parallelism, the sequential propagation through the network layers creates a scalability barrier where training runtimes increase linearly with the number of layers.
This talk presents an approach to enables concurrency accross the network layers and thus overcome this scalability barrier. The proposed method is inspired by the fact that the propagation through a ResNet can be interpreted as an optimal control problem. In this context, the discrete network layers are interpreted as the discretization of a time-continuous dynamical system. Recent advances in parallel-in-time integration and optimization methods can thus be leveraged in order to speed up training runtimes. In particular, an iterative multigrid-reduction-in-time approach will be discussed, which recurively divides the time domain (i.e. the layers) into multiple time chunks that can be processed in parallel on multiple compute units. Additionally, the multigrid iterations enable a simultaneous optimization framework where weight updates are based on inexact gradient information.