Unstable gradients in deep neural nets
Examensarbete för masterexamen
Complex adaptive systems (MPCAS), MSc
In the past decade, deep learning algorithms have gained increased popularity due to their ability to detect and represent abstract features in complex data sets. One of the most prominent deep learning algorithms is the deep neural network, having managed to outperform many state-of-the-art machine learning techniques. While its success can largely be attributes to its depth, this feature also causes it to be difficult to train. One of the main obstacles is the vanishing gradient problem; a phenomenon causing updates to the network to exponentially vanish with depth. The problem is severe enough to have been referred to as a fundamental problem of deep learning . However, simulations reveal that DNNs are able to escape the vanishing gradient problem after having been trained for some time, but the dynamics of this escape are still not understood. In this work, the underlying dynamics of the escape from the vanishing gradient problem in deep neural networks is explored by means of dynamical systems theory. In particular, the concept of Lyapunov exponents is used to analyse how signals propagating through the network evolve, and whether this has a connection to the vanishing gradient problem. The study is based on results by  and . Furthermore, a method to circumvent the vanishing gradient problem, developed in  for very wide neural networks, is explored for narrow networks. The results of this thesis suggest the escape from the vanishing gradient problem is unrelated to what data set the deep neural network is trained on, but is rather a consequence of the training algorithm. Furthermore, it is found that the escape is characterised by the maximal Lyapunov exponent of the network growing from a negative value to a value close to 0. To further explore the underlying dynamics, it is suggested to study the training algorithms in the absence of data. The method of avoiding the vanishing gradient problem, presented by , is found to work poorly for narrow neural networks.
Vanishing gradient , dynamical system , Lyapunov exponent , dynamical isometry