Long short-term memory

Long short-term memory ( LSTM , German: long short-term memory ) is a technology that has contributed significantly to improving the development of artificial intelligence .

When training artificial neural networks , methods of error signal descent are used, which can be imagined as a mountain climber looking for the deepest valley. If there are several deepening layers , this can fall short, just as a forgetful mountaineer ends up in the first best valley on the descent and cannot find his village in a deeper valley. The LSTM method solves this problem by using three types of gates for an LSTM cell for better memory: an input gate, a remember and forget gate (forget gate) and an exit gate (output gate). In this way, in contrast to conventional recurrent neural networks, LSTM enables a kind of memory of previous experiences: a short-term memory that lasts for a long time.

In 1997, LSTM networks were presented in a publication by Sepp Hochreiter and Jürgen Schmidhuber and improved in 2000 by Felix Gers and his team. LSTM has been celebrating significant successes since around 2016, since since then large amounts of data can be used for training, further improvements to LSTM technology have been made, sufficiently powerful computers are available and graphics processor programming is used.

Neural networks with many layers are extremely capable of learning. LSTM ensures that precisely such multilayer networks can function well. This has enabled a breakthrough in artificial intelligence.

Gradient disappearing or exploding

In the first step a forward signal is generated (red arrow). Then the weighting is corrected backwards (green) as an error adjustment.

One way to train artificial neural networks is to use error feedback . In the early training phase, for example, a network does some things wrong with pattern recognition : A cat should be recognized in a picture with a cat and not a dog. To correct the error, the causes of the deviations (errors) between the generated assignment (dog) and solution assignment (cat) are traced back and repeatedly controlling factors (weights) in the layers of the network are changed so that the assignment errors become smaller and smaller. This error is minimized in the so-called gradient method : The numbers in the controlling weights are readjusted . Neural networks consist of modules connected in series, each of which traditionally only has a single activation function that ensures that the output is between 0 and 1. With each error correction, the error signal is determined by the derivative of the activation function. This derivation determines the slope of the descent and the direction with which the error valley is determined. Sepp Hochreiter recognized in 1991 that this previously common method was unsuitable for multilayer networks. The further the error is calculated in the process (viewed from back to front), the more often the scaling factor is multiplied by the error term. If the factor (here the spectral radius of a weight matrix) is always smaller than 1, the error disappears and leads to ineffective weight updates: Because if numbers between 0 and 1 are multiplied together, the product is smaller than the smaller of the two factors. An originally high value will therefore disappear in the long term. On the other hand, if the factors were greater than 1, the error value would explode in the long run.

The modules in the middle of the network, so-called hidden layers, which are closer to the input layer than the output layer, are not taken into account in the (backwards calculated) error adjustment. As a result, they are hardly trained, as if in football only the strikers learn when it comes to scoring goals, but not the midfielders or defenders.

Three gates and an inner cell

To solve this problem, an LSTM module was designed that allows for a relatively constant and applicable error flow. You take a close look at what information should go into and out of the inner cell. The LSTM has the ability to remove or add information on cell state, carefully regulated by structures called gates. LSTM modules, like conventional modules, are connected in a chain-like manner, but internally they have a different structure: The additional gates are an option to let information through.

Instead of a single neural function in the LSTM module, there are four that interact with each other in a very special way. An LSTM module contains the aforementioned three gates and an inner cell. In short controls

the input gate the extent to which a new value flows into the cell,
the forget gate the extent to which a value remains in the cell or is forgotten and
the output gate is the extent to which the value in the cell is used to calculate the next module in the chain.

These network elements are connected with sigmoid neural functions and various vector and matrix operations and transformed into one another.

Structure of an LSTM

Rough structure of an LSTM module with the inner cell in the center. The symbols represent the convolution operator here. The large circles with the S-like curve are the sigmoid functions. The arrows pointing from the cell to the gates are the peephole information from the last pass.

{\ displaystyle \ bigotimes}

There are several types of LSTM architectures. The convolutional LSTM network, which is outlined here, is particularly common in image processing . It differs from the mere peephole LSTM, which uses matrix multiplication , in that the activity of each neuron is calculated via a discrete convolution (hence the addition convolutional ). A comparatively small convolution matrix (filter kernel) is moved intuitively over the input image step by step . These networks are called peepholes because the gates can see the cell status , i.e. also process the information from the cell. Index t is the current run, t-1 denotes the previous run. d and e are the numbers of columns and rows of vectors and matrices, respectively.

The flow of data between the various gates and their inner cell is determined by vector and matrix operations . First, the mathematical structure of the forget gate is described here. is the corresponding e -digit activation vector: ${\ displaystyle f_ {t}}$

{\ displaystyle {\ begin {aligned} f_ {t} & = \ sigma _ {g} (W_ {f} * x_ {t} + U_ {f} * h_ {t-1} + V_ {f} \ circ c_ {t-1} + b_ {f}) \\\ end {aligned}}}

.

${\ displaystyle x_ {t}}$ is the d -place input vector . In the chain of successive neurons it is (together with the output vector of the previous cycle) the interface to the neuron previously acting in the chain. The three e d -digit weight matrices form the valuable part of any network because they contain the training knowledge. is the bias vector. If there is no strong input from other units, then the bias ensures that the unit remains active with a heavy weight and inactive with a weak one. represents a sigmoid function of the gates, which forms nonlinear values between 0 and 1 from the whole. ${\ displaystyle h_ {t-1}}$ ${\ displaystyle \ times}$ ${\ displaystyle W, U, V}$ ${\ displaystyle b}$ ${\ displaystyle \ sigma _ {g}}$

There are three different types of matrix operators here:

+: Matrix addition
${\ displaystyle \ circ}$ : Hadamard product (for the peephole information from the previous run)
${\ displaystyle *}$ : Matrix multiplication .

These formulaic representations may appear complicated, but the actual computing is done by the respective program libraries of the AI providers.

Here is the structure of the activation vector from the input gate and the vector of the output gate, they both correspond to the structure of the forget gate vector: ${\ displaystyle i_ {t}}$ ${\ displaystyle o_ {t}}$

{\ displaystyle {\ begin {aligned} i_ {t} & = \ sigma _ {g} (W_ {i} * x_ {t} + U_ {i} * h_ {t-1} + V_ {i} \ circ c_ {t-1} + b_ {i}) \\ o_ {t} & = \ sigma _ {g} (W_ {o} * x_ {t} + U_ {o} * h_ {t-1} + V_ {o} \ circ c_ {t-1} + b_ {o}) \\\ end {aligned}}}

The cell state is something like a conveyor belt. The information runs in a straight line over the entire chain, with only minor linear interactions. The inner cell with the cell status vector has the following structure: ${\ displaystyle c_ {t}}$

{\ displaystyle {\ begin {aligned} c_ {t} & = f_ {t} \ circ c_ {t-1} + i_ {t} \ circ \ sigma _ {c} (W_ {c} * x_ {t} + U_ {c} * h_ {t-1} + b_ {c}) \\\ end {aligned}}}

.

The hyperbolic tangent (also: tanh) is usually used for the sigmoid functions and . is the output vector (from the previous run, not shown in the rough diagram). ${\ displaystyle \ sigma _ {c}}$ ${\ displaystyle \ sigma _ {h}}$ ${\ displaystyle h_ {t-1}}$

The initial values for and are each initialized with zero vectors . The output vector is calculated as follows: . ${\ displaystyle c_ {0}}$ ${\ displaystyle h_ {0}}$ ${\ displaystyle {\ begin {aligned} h_ {t} & = o_ {t} \ circ \ sigma _ {h} (c_ {t}) \ end {aligned}}}$

Variants and alternatives

Before LSTMs became generally accepted, delayed networks, so-called Time Delay Neural Networks , were used, later Hidden Markov Models .

Since its inception, more and more variants of the LSTM have been added. As described above, the forget gate and the peephole technique were also developed, as well as the convolution technique. LSTM networks are used in particular in speech recognition for the classification of phonemes . The first work dealing with the classification of phonemes using LSTM was published in 2005 by Alex Graves . In 2010, LSTM was first used in a publication by Martin Wöllmer for the recognition of continuous speech. Researchers like Haşim Sak and Wojciech Zaremba continued to develop LSTM techniques for acoustic modeling and speech recognition.

As an alternative to LSTM, Kyunghyun Cho and his team developed gated recurrent units in 2014 . These are particularly used in music modeling. They combine the forget gate and the input gate into a single update gate . The resulting model is simpler than traditional LSTM models and the gates are arranged in a different way.

successes

In the years after 2010, the technical situation for LSTM improved tremendously: The introduction of Big Data made huge amounts of data available for training the networks. The boom in computer games , in which the actors walk across the room, led to ever better and cheaper graphics cards . A great number of matrix multiplications are carried out on these graphics cards for the simulated movement of the actors in space . This is exactly what you need for AI and LSTM. Fast GPU implementations of this combination were introduced in 2011 by Dan Ciresan and colleagues in Schmidhuber's group. Since then they have won numerous competitions, including a. the "ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks Challenge" and the "ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images". As an alternative to the graphics processor, Google developed Tensor Processing Units to accelerate applications in the context of machine learning . Among other things, they are used to effectively process LSTMs.

Large technology companies such as Google, Apple and Microsoft have been using LSTM as a basic component for new products since around 2016. For example, Google used LSTM for speech recognition on smartphones, for the Allo smart assistant and for Google Translate . Apple uses LSTM for the "Quicktype" function on the iPhone and for Siri . Amazon uses LSTM for Amazon Alexa .

literature

Ramon Wartala: Practical introduction to deep learning: Create your own deep learning applications with Python, Caffe, TensorFlow and Spark . Heidelberg 2018, ISBN 978-3960090540 .

Web links

Christopher Olah's blog about LSTM
Recurrent Neural Networks with over 30 LSTM contributions from Jürgen Schmidhuber's team at IDSIA
Olusola Adeniyi Abidogun: Fraud detection paper ... with two chapters that specifically deal with LSTM.
Learning aid how to set up LSTM in Python with Theano

Individual evidence

↑ ^a ^b Sepp Hochreiter, Jürgen Schmidhuber: Long short-term memory In: Neural Computation (journal), vol. 9, issue 8, pp. 1735-1780, 1997 online

↑ Sepp Hochreiter: Studies on dynamic neural networks Diploma thesis PDF Munich 1991

↑ The Forget Gate was developed in 2000 by Felix A. Gers and his team. Felix A. Gers, Jürgen Schmidhuber, Fred Cummins: Learning to Forget: Continual Prediction with LSTM In: Neural Computation (journal) vol. 12 issue 10, pp. 2451-2471, 2000 online

↑ Felix Ger's dissertation on LSTM networks with Forget Gate.

↑ Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun Woo: Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting In: Proceedings of the 28th International Conference on Neural Information Processing Systems , online , pp. 802-810, 2015

↑ Alex Graves, Jürgen Schmidhuber: Framewise Phoneme Classification with Bidirectional LSTM Networks In: Proc. of IJCNN 2005, Montreal, Canada, pp. 2047-2052, 2005 online

↑ Martin Wöllmer, Florian Eyben, Björn Schuller, Gerhard Rigoll: Recognition of Spontaneous Conversational Speech using Long Short-Term Memory Phoneme Predictions In: Proc. of Interspeech 2010, ISCA, pp. 1946-1949, Makuhari, Japan, online 2010

↑ Haşim Sak, Andrew Senior, Françoise Beaufays: Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition arxiv 2014

↑ Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: Recurrent Neural Network Regularization arxiv 2014/2015

↑ Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation arxiv 2014.

^ Dan C. Ciresan, U. Meier, J. Masci, LM Gambardella, J. Schmidhuber: Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011.

^ Dan Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber: Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. In: Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe, 2012.

^ Dan Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber: Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.

^ Françoise Beaufays: The neural networks behind Google Voice transcription (en-US) . In: Research Blog , August 11, 2015.

↑ Pranav Khaitan: Chat Smarter with Allo (en-US) . In: Research Blog , May 18, 2016.

↑ Amir Efrati: Apple's Machines Can Learn Too . June 13, 2016.

↑ Werner Vogels: Bringing the Magic of Amazon AI and Alexa to Apps on AWS. - All Things Distributed . November 30, 2016.

[shjs_1997-1] Sepp Hochreiter, Jürgen Schmidhuber: Long short-term memory In: Neural Computation (journal), vol. 9, issue 8, pp. 1735-1780, 1997 online

[2] Sepp Hochreiter: Studies on dynamic neural networks Diploma thesis PDF Munich 1991

[3] The Forget Gate was developed in 2000 by Felix A. Gers and his team. Felix A. Gers, Jürgen Schmidhuber, Fred Cummins: Learning to Forget: Continual Prediction with LSTM In: Neural Computation (journal) vol. 12 issue 10, pp. 2451-2471, 2000 online

[4] Felix Ger's dissertation on LSTM networks with Forget Gate.

[5] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun Woo: Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting In: Proceedings of the 28th International Conference on Neural Information Processing Systems , online , pp. 802-810, 2015

[6] Alex Graves, Jürgen Schmidhuber: Framewise Phoneme Classification with Bidirectional LSTM Networks In: Proc. of IJCNN 2005, Montreal, Canada, pp. 2047-2052, 2005 online