Prosody recognition

The prosody recognition (also prosody classification) is a branch of the automatic pattern recognition or the pattern classification. The patterns to be classified represent prosodic properties of speech. Therefore, prosodic features are often classified in combination with speech recognition .

Analyzed prosodic properties

Intonation (measurement of the fundamental frequency)

Intonation curve comparisons
Special features in the course of the intonation: After uttering a sentence and then taking a breath, there is often a so-called pitch reset , an increase in the basic frequency at the beginning of a new sentence. The fundamental frequency shows a falling tendency in the course of uttering a sentence, this is due to the exhalation. When uttering a sentence and inhaling at the same time, the basic frequency tends to increase.
Intonation at the end of a phrase: This has a special meaning in German, for example, yes-no questions often have increasing intonation, statements tend to fall in their end intonation.
Irony shows a completely different intonation than sentences intended in this way.
To emphasize (accentuate) syllables, words or phrases, e.g. B. to avoid ambiguity, the intonation can also be changed. The syllable, word or phrase is thus stressed differently.

Energy, volume and loudness

relative volume fluctuations

Duration, quantity, rhythm, speed of speech

Pauses between words (rhythm)
mean speech rate
Deviation from the mean speaking speed
Average phoneme length
Medium syllable length
Medium word length
Medium phrase length (until breath is taken again)

These features are often mapped to linguistic models of prosody, especially intonation, because only these allow statements about the meaning of the measurements. In other words, they provide the classes that are required for pattern recognition and pattern analysis .

Preprocessing

Smoothing out microprosodic effects

Jitter and shimmer , known from micro prosody , produce irregularities in amplitude and frequency and must be removed from the speech signal before an automatic classification (e.g. intonation). This can be done by smoothing, in that the discretely sampled speech signal is smoothed with a median filter .

Interpolations

Plosives create a brief glottic closure. During this time the vocal cords do not vibrate and there is therefore no measurable basic frequency. This means that there are small gaps in the scan where no information is available. This can mislead an intonation classifier into classifying into the wrong category. Interpolation can improve correct recognition.

Detection examples

The intonation roughly corresponds to the basic frequency on the acoustic level. This can be extracted automatically from an audio signal using so-called pitch trackers (the Praat program contains, for example, a pitch tracking function). Series of fundamental frequency values arise. These discrete series of values can be approximated by means of regression analysis after interpolation and median smoothing by means of polynomials, for example straight lines . The course of the fundamental frequency can then be modeled using several more or less small straight sections. From this approximated stress curve of the utterance, conclusions can be drawn about special prosodic events, for example steeply sloping straight lines can point to a peak in the contour, i.e. an accented word. This can be useful for a robot's understanding of dialogue, because pure speech recognition does not provide any accent information.

Areas of application

Emotion recognition

The changes in the suprasegmental properties of speech are used to "read" emotional states from the speech signal. Excited people speak faster, angry people speak louder, and frightened people speak more quietly. Sad people speak slower and more drawn out.

robotics

Prosody recognition can be used so that robots can resolve ambiguities in different linguistic levels. This improves the performance of speech recognition and increases the acceptance of the robot as a conversation or interaction partner in human-machine communication . A robot also appears more human when it can use the emotional characteristics of the voice to change its own voice in a suitable way (compassionate voice for people who sound sad, joyful voice for happy people) or to adapt its facial expressions to the emotions. A recognition of irony or humor also improves acceptance as a natural interaction partner.

Language understanding systems and dialogue systems

There are many speech-understanding systems (outside of robotics) in navigation devices , dictation machines , as an alternative control device for computers (e.g. speech recognition in Windows Vista) or in automatic, telephone information systems. The use of prosody recognition can also improve speech recognition there by resolving ambiguities (e.g. through elliptical sentences) or references to certain parts of sentences. Quotations in the middle of the sentence can also be recognized better ("As the professor mentioned in 'The History of the Vikings'": Actually not a valid grammatical sentence unless one recognizes 'The History of the Vikings' as a quote or as quoted Title of a book).

medicine

Among other things, prosody recognition modules are used in speech therapy to specifically measure and treat speech disorders .

Speaker recognition

In order to recognize which speaker has said what when there are many people speaking at the same time, the speaker's voice must be clearly distinguishable from the voices of other speakers. Typical features such as basic frequency, average speech speed, etc., but also features of micro prosody , for example jitter and shimmer , which are different and characteristic in each person , can help . The problem of tracking one of many voices often occurs with dictation systems that are used in company meetings or meetings to translate the entire conversation verbatim into text. Humans can easily focus on one of many voices speaking at the same time, but automatic systems find this very difficult. This problem is known as the cocktail party effect , among other things , and optimal solutions still do not exist.

Speaker verification

In high-security areas such as research centers, only authorized employees are allowed to enter certain areas. To ensure this, prosodic and microprosodic features are often used for verification in addition to biometric features. Often this is a passphrase.

Language recognition

In order to automatically recognize which language a speaker speaks, features of prosody can also be used in addition to features of speech recognition (see B-prosody ). Every language has a typical sound, a typical sequence of frequent sound combinations or even characteristic sounds (e.g. throaty sounds in Arabic).

Machine translation

In machine translation , prosody modules are used to improve speech recognition and to resolve syntactic, semantic and pragmatic ambiguities in order to be able to translate adequately into the target language. The Verbmobil project is a good example .

Web links

University of Bonn: INTARC prosody recognition module
Wolfgang Hess: Prosody
NIMITEK : Neurobiologically inspired, multimodal intention recognition for technical communication systems