Transformer (machine learning)

from Wikipedia, the free encyclopedia

A transformer is a deep learning architecture for machine learning . This was published in 2017 as part of the Neural Information Processing Systems (NIPS) conference by Google Brain and represents a further development of the long short-term memory (LSTM).

Transformers are used to translate one sequence of symbols into another sequence of symbols. This is used, for example, for the machine translation of sentences from one language into another language, for generating text or for summarizing longer texts. Transformers are more efficient than LSTM networks and are the basic architecture of many pre-trained machine learning models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pretrained Transformers (GPT).

background

Before the introduction of the transformer, recurrent models such as LSTM , GRU and Seq2Seq were used in natural language processing (NLP) , which processed an input sequence sequentially. These methods were later by an attention mechanism (Engl. Attention extended). Transformers build on the attention mechanism and dispense with the recurrent structure. Similar or better results are achieved in the transformation of sequences than with the recurrent models with less computational effort.

architecture

A transformer essentially consists of encoders connected in series and decoders connected in series . The input sequence is converted into a vector representation by a so-called embedding layer. The weights of the embedding layer are adjusted during training. In the case of the transformer, position coding is also used, whereby the sequential sequence of the words can be taken into account. A word therefore has a different representation at the beginning of a sentence than at the end.

The input sequence is passed to a series of encoders in the vector representation and converted into an internal representation. This internal representation maps the meaning of the input sequence abstractly and is translated into an output sequence by the decoder. The input sequence is processed in batches , with the length of the encoder-decoder pipeline limiting the maximum length of the input sequence. Depending on the size of the network, individual sentences or even entire paragraphs can be processed, for example. In the case of input sequences which are shorter than the length of the encoder-decoder pipeline, padding is used to fill in the input sequence.

An encoder consists of a self-attention module and a feedforward module, while the decoder consists of a self-attention module, an encoder-decoder-attention module and a feedforward module.

Attention module

The task of the attention module is to calculate the correlation of an input symbol to the other input symbols. For example, the assignment of a pronoun to the associated noun . A distinction between the embedding (engl. Embedding ) , wherein it is the as vector encoded input symbol, the query vector (engl. Query ) , the key vector (engl. Key ) and the value vector (engl. Value ) . From each embedding the other three vectors are calculated by them with a learned through training matrices , and are multiplied:

From this review is (Engl. Score ) calculated

and finally divided by the square root of the length of the key vectors to get more stable gradients :

The Softmax function is applied to this. This expression determines where the output symbol should be:

This value is now multiplied by the value vector. As a result, symbols that are not important for the meaning are multiplied by a small value and symbols that are important for the meaning are multiplied by a high value:

Where the vector represents the calculated output of the attention module. The vector represents the probability distribution over the vocabulary of the possible output symbols.

The difference between the self-attention module and the encoder-decoder-attention module is that the self-attention module only uses the values ​​of the previous encoder or decoder and calculates the vectors , and . The encoder-decoder attention module, however, only calculates the vector from the upstream attention module, while the vectors and are obtained from the encoder.

In practice, so-called multi-head attention is used. Each head consists of its own version of the matrices , and . Each attention module has several heads. If a head is not relevant for a particular input , a low value is calculated, while a head relevant to an input calculates a high output value.

swell

  1. a b Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention Is All You Need. In: arxiv . Google Brain , June 12, 2019, accessed June 19, 2020 .
  2. Rainald Menge-Sonnentag: Who, How, What: Text analysis using Natural Language Processing with BERT. In: Heise Online. August 12, 2019, accessed August 13, 2020 .
  3. a b c d Alexander Rush: The Annotated Transformer. Harvard NLP, accessed June 19, 2020 .
  4. a b c d e Jay Alammar: The Illustrated Transformer. Retrieved June 19, 2020 (English).