Dynamic time warping

from Wikipedia, the free encyclopedia

Dynamic time warping refers to an algorithm that maps sequences of values ​​of different lengths to one another.


A prominent application example is speech recognition (the recognition of speech characteristics when dictating): Here, individual words from a spoken text are to be recognized by comparing them with stored speech patterns. One problem is that the words are often pronounced differently. Vowels in particular are often spoken longer or shorter than is the case in the stored language pattern: the word "heben", for example, can be pronounced once with a short "e" and another time as "heeeben". For a successful pattern comparison, the word should be stretched or compressed accordingly, but not evenly, but mainly on the vowels that were spoken longer or shorter. (Incidentally, this also applies to a lesser extent to consonants; certain sounds can also be swallowed completely.) The dynamic time warping algorithm does this adaptive time normalization .

Other areas of application of dynamic time warping are e.g. B. Gesture recognition in image processing or measurements of correlated noise in physics.


Animation Dynamic Time Warping.gif
Assignment Dynamic Time Warping distorted.png

The algorithm finds z. B. in speech recognition application. A spoken speech signal is to be compared with a number of existing templates in order to be able to recognize the spoken word and, for example, to carry out a corresponding action on a mobile phone . Speech signals are usually stored as spectral or cepstral value tuples, which are supplemented with further voice information such as intensity, etc. as feature vectors.

With the aid of a weighting function for the individual parameters of each value tuple, a difference measure between any two values ​​of the two signals can be established (cost function), for example a normalized Euclidean distance or the Mahalanobis distance .

The algorithm is now looking for the “cheapest” way from the beginning to the end of both signals via the spanned matrix of the paired costs of all points of both signals. This is done efficiently with the help of dynamic programming. The actual path, i.e. the warping , is obtained through what is known as backtracking after the first run of the algorithm. For the pure cost determination (i.e. the template selection), however, a simple run without backtracking is sufficient. The backtracking enables an exact mapping of each point of the shorter signal to one or more points of the longer signal and thus represents the approximate time distortion.

Due to algorithmic problems in the extraction of speech signal parameters (octave errors, voice activation errors, etc.), the optimal path through the signal difference matrix does not necessarily have to correspond to the actual time distortion.

See also

Individual evidence

  1. a b DTW or dynamic time warping . www.inf.fu-berlin.de. Retrieved April 8, 2009.
  2. Wendemuth, Andreas: Basics of stochastic language processing , p. 137
  3. Wendemuth, Andreas: Basics of stochastic language processing , p. 133
  4. ^ Black, AW; P. Taylor (1997a): Automatically clustering similar units for unit selection in speech synthesis. In: Proc. Eurospeech '97.