Asynchronous Parallel Stochastic Gradient Descent: A study of the influence of synchronization methods and hyperparameters
Ladda ner
Typ
Examensarbete för masterexamen
Program
Publicerad
2021
Författare
Ek, Hampus
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Artificial Neural Networks (ANN) can solve complex tasks and be found in applications
such as language translation, object recognition, and more. The underlying
optimization algorithm for training ANNs is often Stochastic Gradient Descent
(SGD). SGD is a first-order numerical optimization algorithm that repeatedly
takes a step in the negative gradient direction of a loss function.
Training ANNs generally require a large amount of data, and with an increasing
amount of data, more complex tasks can be learned. However, more data increases
the training time since several passes through the data are required. The training
time can be reduced by parallelizing the training process. There are several
ways to do this, and parallelizing the SGD iterations is one way. Parallelizing the
SGD iterations can mainly be done in two different ways, synchronously or asynchronously.
The asynchronous parallel SGD has shown performance benefits over
the synchronous approach and has therefore gained increased attention in recent
literature.
However, asynchrony introduces challenges with understanding the execution and
convergence criteria of SGD. These challenges originate from the fact that asynchrony
allows for gradient updates on stale (old) views of the state. Parallel SGD
and asynchronous parallel SGD, in particular, can make hyperparameter tuning
even more time-consuming and challenging compared to regular sequential SGD.
Parallelization of SGD introduces an additional dimension, e.g., the level of parallelism,
to the training phase. Increased parallelism also increases the risk of
crashed executions.
This work aims to increase the understanding of the convergence properties of SGD
under asynchronous parallelism. This is done by (i) analyzing how the memory
model affects convergence under different levels of parallelism. (ii) What impact
batch size and step size have on convergence under a varying level of parallelism
is also analyzed. (iii) Moreover, an analysis of how the staleness distribution is
affected by different batch sizes is made. Furthermore, (iv) backoff methods are
tested to reduce contention for the lock-based algorithms and Leashed-SGD.
An alternative lock-based approach to regular mutex lock is proposed using a
read-write lock. The read-write lock, mutex lock, and two lock-free algorithms,
Leashed-SGD [1] and HOGWILD! [2], are compared in all of the dimensions stated
in the previous paragraph (i-iv).
We can empirically see that the memory model used impacts convergence, where
NUMA converges faster than UMA for both lock-based and lock-free AsyncSGD.
Further, the level of parallelism has a high impact on what hyperparameters to
use. The level of parallelism is also related to the number of crashed and diverged
executions, where higher parallelism increases the risk of crashed executions.
Beskrivning
Ämne/nyckelord
Stochastic gradient descent , parallelism , asynchrony , machine learning