Asynchronous Parallel Stochastic Gradient Descent: A study of the influence of synchronization methods and hyperparameters
Examensarbete för masterexamen
Artificial Neural Networks (ANN) can solve complex tasks and be found in applications such as language translation, object recognition, and more. The underlying optimization algorithm for training ANNs is often Stochastic Gradient Descent (SGD). SGD is a first-order numerical optimization algorithm that repeatedly takes a step in the negative gradient direction of a loss function. Training ANNs generally require a large amount of data, and with an increasing amount of data, more complex tasks can be learned. However, more data increases the training time since several passes through the data are required. The training time can be reduced by parallelizing the training process. There are several ways to do this, and parallelizing the SGD iterations is one way. Parallelizing the SGD iterations can mainly be done in two different ways, synchronously or asynchronously. The asynchronous parallel SGD has shown performance benefits over the synchronous approach and has therefore gained increased attention in recent literature. However, asynchrony introduces challenges with understanding the execution and convergence criteria of SGD. These challenges originate from the fact that asynchrony allows for gradient updates on stale (old) views of the state. Parallel SGD and asynchronous parallel SGD, in particular, can make hyperparameter tuning even more time-consuming and challenging compared to regular sequential SGD. Parallelization of SGD introduces an additional dimension, e.g., the level of parallelism, to the training phase. Increased parallelism also increases the risk of crashed executions. This work aims to increase the understanding of the convergence properties of SGD under asynchronous parallelism. This is done by (i) analyzing how the memory model affects convergence under different levels of parallelism. (ii) What impact batch size and step size have on convergence under a varying level of parallelism is also analyzed. (iii) Moreover, an analysis of how the staleness distribution is affected by different batch sizes is made. Furthermore, (iv) backoff methods are tested to reduce contention for the lock-based algorithms and Leashed-SGD. An alternative lock-based approach to regular mutex lock is proposed using a read-write lock. The read-write lock, mutex lock, and two lock-free algorithms, Leashed-SGD  and HOGWILD! , are compared in all of the dimensions stated in the previous paragraph (i-iv). We can empirically see that the memory model used impacts convergence, where NUMA converges faster than UMA for both lock-based and lock-free AsyncSGD. Further, the level of parallelism has a high impact on what hyperparameters to use. The level of parallelism is also related to the number of crashed and diverged executions, where higher parallelism increases the risk of crashed executions.
Stochastic gradient descent , parallelism , asynchrony , machine learning