The speech recognition or automatic speech recognition is a branch of applied computer science , the engineering and computational linguistics . It deals with the investigation and development of processes that make the spoken language of automatic data acquisition accessible to machines, especially computers . The speech recognition is to be distinguished from the voice or speaker identification, a biometric method of personal identification. However, the realizations of these processes are similar.
Research on speech recognition systems began in the 1960s, but was largely unsuccessful at the time: the systems developed by private companies made it possible to recognize a few dozen individual words under laboratory conditions . This was due on the one hand to the limited knowledge in this new research area, but also to the limited technical possibilities at the time.
It was not until the mid-1980s that the development continued. During this time it was discovered that one could distinguish homophones through context tests . By compiling and evaluating statistics on the frequency of certain word combinations, it was possible to decide which one was meant when words sounded similar or the same. These so-called trigram statistics then became an important part of all speech recognition systems. In 1984, IBM introduced the first speech recognition system that could recognize around 5,000 individual English words. However, the system required several minutes of computing time on a mainframe for a recognition process . On the other hand, a system developed by Dragon Systems was more advanced: This could be used on a portable PC.
Between 1988 and 1993 the European project SUNDIAL also demonstrated the voice recognition of train timetables in German. SUNDIAL also studied voice recognition assessment metrics.
In 1991 IBM presented a speech recognition system for the first time at CeBIT , which could recognize 20,000 to 30,000 German words. However, the presentation of the system called TANGORA 4 had to take place in a specially screened room, as the noise from the trade fair would otherwise have disrupted the system.
At the end of 1993, IBM presented the first speech recognition system developed for the mass market: The system called IBM Personal Dictation System ran on normal PCs and cost less than $ 1,000. When it was presented under the name IBM VoiceType Dictation System at CeBIT 1994, it met with great interest from visitors and the trade press.
In 1997, both the IBM ViaVoice software (successor to IBM VoiceType) and version 1.0 of the Dragon NaturallySpeaking software were released for PC end users . In 1998, Philips Speech Recognition Systems launched FreeSpeech 98, a speech recognition system for PC end users whose controls were adapted to the in-house digital voice recorder SpeechMike, but discontinued the product line after the second version, FreeSpeech 2000. In 2004, IBM released parts of its speech recognition applications as open source , causing a sensation. Industry insiders suspected tactical measures against Microsoft , which is also active in this area and since 2007 with the appearance of its Windows Vista PC operating system as an integral component for the first time, for the first time offered voice recognition functions for control as well as for dictation, which are still in Windows today 10 have been further developed.
While the development of IBM ViaVoice was discontinued, Dragon NaturallySpeaking became the most popular third-party speaker- dependent speech recognition software for Windows PCs today and has been manufactured and sold by Nuance Communications since 2005.
With the acquisition of Philips Speech Recognition Systems, Vienna, in 2008, Nuance also obtained the rights to the SpeechMagic software development kit (SDK) , which is particularly popular in the health sector. For iMac personal computers from Apple , MacSpeech has been selling third-party speech recognition software under the name iListen since 2006, which was based on Philips components. In 2008 this was replaced by MacSpeech Dictate using the core components of Dragon NaturallySpeaking and renamed to Dragon Dictate (version 2.0 - version 3.0 has been sold since 2012) after Nuance Communications acquired MacSpeech.
In 2007 the company Siri Inc. was founded and bought by Apple in April 2010. In October 2011, Apple presented the Siri speech recognition software for the iPhone 4s, which is used to recognize and process naturally spoken language (using Apple servers) and thus perform the functions of a personal assistant.
At present, a broad distinction can be made between two types of speech recognition:
- Speaker-independent speech recognition
- Speaker-dependent speech recognition
A characteristic of the “speaker-independent” speech recognition is the property that the user can immediately start with speech recognition without a previous training phase. However, the vocabulary is limited to a few thousand words.
"Speaker-dependent" speech recognizers are trained by the user on their own characteristics of pronunciation before use (in newer systems: during use). A central element is the possibility of individual interaction with the system in order to achieve an optimal speaker-dependent result (own terms, abbreviations, abbreviations, etc.). Use in applications with frequently changing users (e.g. call centers) does not make sense. In comparison, the vocabulary is much larger than that of the speaker-independent recognizers. Current systems contain more than 300,000 word forms. A distinction must also be made between:
- Front-end systems and
- Back-end systems.
In front-end systems , the language is processed and converted into text immediately, so that the user can read the result with practically no significant delay. The implementation can be done on the user's computer or cloud-based. The highest recognition quality is achieved here through the direct interaction between the user and the system. The system can also be controlled using commands and other components such as real-time assistance systems. In back-end systems , however, the implementation is carried out with a delay. This usually happens on a remote server . The text is only available after a delay. Such systems are still widespread in the medical field. Since there is no direct interaction between the speaker and the recognition result, outstanding quality can only be expected if the user already has experience with speech recognition.
"Speaker-independent" speech recognition is preferred in technical applications, for example in automatic dialogue systems such as timetable information. Wherever only a limited vocabulary is used, speaker-independent speech recognition is practiced with success. Systems for recognizing spoken English digits from 0 to 9 achieve a recognition rate of almost 100%.
With the use of "speaker-dependent" speech recognition, very high recognition rates can be achieved. However, even an accuracy of 95 percent can be perceived as too low, since too much needs to be improved. The interaction between the user and the system, which enables the user to directly or indirectly influence the personal recognition result, is decisive for the success of "speaker-dependent" speech recognition.
In the meantime, current systems achieve recognition rates of approx. 99 percent when dictating continuous texts on personal computers and thus meet the requirements of practice for many areas of application, e.g. B. for academic texts, business correspondence or legal briefs. Its use reaches its limits where the respective author constantly needs new words and word forms that are initially not recognizable by the software and that can be added manually, but are not efficient if they occur only once in texts by the same speaker. Therefore, z. B. Denser less from the use of speech recognition than z. B. Doctors and lawyers .
In addition to the size and flexibility of the dictionary, the quality of the acoustic recording also plays a decisive role. With microphones that are placed directly in front of the mouth (for example headsets or telephones), a significantly higher detection accuracy is achieved than with room microphones further away.
The most important influencing factors in practice, however, are precise pronunciation and the coherent fluent spoken dictation, so that word connections and word sequence probabilities can optimally flow into the recognition process.
The development of speech recognition is proceeding very quickly. Today (as of 2016) speech recognition systems u. a. used in smartphones z. B. with Siri , Google Now , Cortana and Samsung's S Voice . Current speech recognition systems no longer have to be trained. The plasticity of the system is decisive for a high level of accuracy outside of everyday language. In order to be able to meet high demands, professional systems offer the user the opportunity to influence the personal result by prescribing or auditioning.
In order to increase the recognition accuracy even further, attempts are sometimes made to use a video camera to film the speaker's face and read the lip movements from it . By combining these results with the results of acoustic detection, a significantly higher detection rate can be achieved, especially with noisy recordings.
Since communication with human language is usually a dialogue between two conversation partners, speech recognition is often found in connection with speech synthesis . In this way, the user of the system can be given acoustic feedback about the success of the speech recognition and information about any actions that may have been carried out. In the same way, the user can also be asked to give another voice input.
To understand how a speech recognition system works, one must first be clear about the challenges that must be overcome.
Discreet and continuous language
In a sentence in everyday language, the individual words are pronounced without a noticeable pause in between. As a human being, you can orientate yourself intuitively to the transitions between words - earlier speech recognition systems were not able to do this. They required a discreet (interrupted) language with artificial pauses between words.
However, modern systems are also able to understand continuous (fluent) language.
In discrete language, you can clearly see the pauses between the words, which are longer and clearer than the transitions between the syllables within the word encyclopedia .
With continuous language, the individual words merge into one another, there are no pauses.
Through inflection , i.e. the inflection of a word depending on its grammatical function, word stems ( lexemes ) result in a multitude of word forms. This is important for the size of the vocabulary, since all word forms must be viewed as independent words in speech recognition.
The size of the dictionary is highly dependent on the language. On the one hand, average German speakers have a significantly larger vocabulary of around 4000 words than English speakers of around 800 words. In addition, the inflection in the German language results in about ten times as many word forms as in the English language , where only four times as many word forms arise. (Cite sources)
In many languages there are words or word forms that have different meanings but are pronounced the same. The words “sea” and “more” sound identical, but still have nothing to do with each other. Such words are called homophones . Since a speech recognition system, in contrast to humans, generally has no knowledge of the world , it cannot differentiate between the various possibilities on the basis of the meaning.
The question of upper or lower case also falls into this area.
On the acoustic level, the position of the formants in particular plays a role: The frequency components of spoken vowels are typically concentrated on certain different frequencies, which are called formants. The two lowest formants are particularly important for distinguishing between vowels: The lower frequency is in the range from 200 to 800 Hertz , the higher frequency in the range from 800 to 2400 Hertz. The individual vowels can be distinguished by the position of these frequencies.
Consonants are comparatively difficult to recognize; For example, individual consonants (so-called plosives ) can only be determined through the transition to the neighboring sounds, as the following example shows:
One recognizes that within the word speak the consonant p (more precisely: the closing phase of the phoneme p ) is actually only silence and is only recognized by the transitions to the other vowels - removing it does not make an audible difference.
Other consonants can be recognized by their characteristic spectral patterns. For example, the sound s as well as the sound f ( fricatives ) are characterized by a high proportion of energy in higher frequency bands. It is noteworthy that the information relevant for the differentiation of these two sounds is largely outside the spectral range transmitted in telephone networks (up to approx. 3.4 kHz). This explains why spelling over the phone without the use of a special spelling alphabet is extremely laborious and error-prone, even in communication between two people.
Dialects and sociolects
Even if a speech recognition program is already well adjusted to a high-level language , this does not mean that it can understand every form of that language. Such programs often reach their limits, especially in the case of dialects and sociolects . People are usually able to quickly adjust to the possibly unknown dialect of their counterpart - recognition software is not able to do this easily. Dialects first have to be taught to the program in complex processes.
In addition, it must be taken into account that the meanings of words can change occasionally and depending on the region. For example, Bavaria and Berlin mean different desserts when they talk about “pancakes”. With their cultural background knowledge, a person can avoid and clarify such misunderstandings more easily than software can currently do.
Solution strategies for communication problems
If there are problems understanding a communication, people naturally tend to speak particularly loudly or to paraphrase misunderstood terms in more detail. However, this can have a counterproductive effect on a computer, as it is trained to handle normal conversation volume and also works with key words rather than understanding contexts.
A speech recognition system consists of the following components: Preprocessing, which breaks down the analog speech signals into the individual frequencies. Then the actual recognition takes place with the help of acoustic models, dictionaries and language models.
The most important task of the filtering step is to distinguish between ambient noises such as noise or e.g. B. Engine noise and language. For example, the energy of the signal or the zero crossing rate are used for this purpose.
It is not the time signal but the signal in the frequency range that is relevant for speech recognition . To do this, it is transformed using FFT . The frequency components present in the signal can be read from the result, the frequency spectrum.
A feature vector is created for the actual speech recognition . This consists of mutually dependent or independent features that are generated from the digital speech signal. In addition to the spectrum already mentioned, this includes above all the cepstrum. Feature vectors can be z. B. compare by means of a previously defined metric .
The cepstrum is obtained from the spectrum by forming the FFT of the logarithmic magnitude spectrum. This allows periodicities to be recognized in the spectrum. These are generated in the human vocal tract and by stimulating the vocal cords. The periodicities due to the vocal cord stimulation predominate and are therefore to be found in the upper part of the cepstrum, whereas the lower part depicts the position of the vocal tract. This is relevant for speech recognition, so only these lower parts of the cepstrum flow into the feature vector. Since the space transfer function - i.e. the change in the signal z. B. by reflections on walls - not changed over time, this can be represented by the mean value of the cepstrum. This is therefore often subtracted from the cepstrum in order to compensate for echoes. The first derivative of the cepstrum, which can also flow into the feature vector, must also be used to compensate for the space transfer function.
Hidden Markov Models
In the further course of the process, Hidden Markov Models (HMM) play an important role. These make it possible to find the phonemes that best match the input signals. To do this, the acoustic model of a phoneme is broken down into different parts: the beginning, a different number of middle sections depending on the length, and the end. The input signals are compared with these stored sections and possible combinations are sought using the Viterbi algorithm .
For the recognition of interrupted (discrete) speech (in which a pause is made after each word) it is sufficient to calculate one word together with a pause model within the HMM. Since the computing capacity of modern PCs has increased significantly, fluent (continuous) language can now also be recognized by creating larger Hidden Markov models that consist of several words and the transitions between them.
Alternatively, attempts have already been made to use neural networks for the acoustic model. With Time Delay Neural Networks , changes in the frequency spectrum over time should be used for detection. The development had initially brought positive results, but was then abandoned in favor of the HMMs. It was only in the last few years that this concept was rediscovered in the context of Deep Neural Networks. Speech recognition systems based on deep learning deliver recognition rates in the human range.
But there is also a hybrid approach in which the data obtained from preprocessing are pre-classified by a neural network and the output of the network is used as a parameter for the hidden Markov models. This has the advantage that you can also use data from shortly before and shortly after the period just processed without increasing the complexity of the HMMs. In addition, the classification of the data and the context-sensitive composition (formation of meaningful words / sentences) can be separated from each other.
The language model then tries to determine the probability of certain word combinations and thereby exclude false or improbable hypotheses. Either a grammar model using formal grammars or a statistical model using N-grams can be used for this purpose.
A bi- or trigram statistic stores the probability of occurrence of word combinations of two or more words. These statistics are obtained from large text corpora (example texts ). Each hypothesis determined by speech recognition is then checked and, if necessary, discarded if its probability is too low. This means that homophones, i.e. different words with identical pronunciation, can also be distinguished. “Thank you” would be more likely than “Thank you fell”, although both are pronounced the same.
With trigrams, theoretically more accurate estimates of the probability of occurrence of word combinations are possible compared with bigrams. However, the sample text databases from which the trigrams are extracted must be significantly larger than for bigrams, because all permitted word combinations of three words must appear in it in a statistically significant number (i.e.: each significantly more than once). Combinations of four or more words have not been used for a long time because, in general, it is no longer possible to find sample text databases that contain a sufficient number of all word combinations. An exception is Dragon, which from version 12 also uses pentagrams - which increases the recognition accuracy in this system.
When grammars are used, they are mostly context-free grammars . However, each word must be assigned its function within the grammar. For this reason, such systems are usually only used for a limited vocabulary and special applications, but not in common speech recognition software for PCs.
The quality of a speech recognition system can be indicated with different numbers. In addition to recognition speed - usually given as a real-time factor (EZF) - the recognition quality can be measured as word accuracy or word recognition rate .
For the integration of professional speech recognition systems, there are already predefined vocabularies that are intended to facilitate work with speech recognition. These vocabularies are mentioned in the area of SpeechMagic ConText and in the area of Dragon Datapack . The better the vocabulary is adapted to the vocabulary and dictation style (frequency of word sequences) used by the speaker, the higher the recognition accuracy. In addition to the speaker-independent lexicon (technical and basic vocabulary), a vocabulary also includes an individual word sequence model (language model). All words known to the software are stored in the vocabulary in phonetics and spelling. In this way, a spoken word is recognized by the system's sound. If words differ in meaning and spelling, but sound the same, the software uses the word sequence model. It defines the probability with which one word follows another for a particular user. Speech recognition in smartphones uses the same technical concepts, but without the user having any influence on the predefined vocabulary. Newer technologies are breaking away from the idea of a rigid stored word list, since compound words can be formed. What all systems have in common is that they can only learn individual words and phrases by making corrections by the respective user.
Speech recognition is nowadays u. a. used in smartphones z. B. with Siri , Google Now , Cortana , Amazon's Echo / Alexa and Samsung's S Voice . With the now high reliability in everyday language (e.g. smartphones) or in technical language (customizable professional systems), speech can be converted into text (speech to text) , commands and controls can be carried out (command and control) or semantic analyzes can be carried out ( language understanding) .
- Pirani, Giancarlo, ed .: Advanced algorithms and architectures for speech understanding. Vol. 1. Springer Science & Business Media, 2013. ISBN 978-3-642-84341-9 .
- Lawrence R. Rabiner, Ronald W. Schafer: Digital Processing of Speech Signals , 1978, ISBN 0-13-213603-1 .
- Matthias Woelfel, John McDonough: Distant Speech Recognition , 2009, ISBN 0-470-51704-2 .
- Lawrence R. Rabiner, Biing-Hwang Juang Juang: Fundamentals of Speech Recognition , 1993, ISBN 0-13-015157-2 .
- Ernst Günter Schukat-Talamazzini: Automatic speech recognition. Basics, statistical models and efficient algorithms , Vieweg, Braunschweig / Wiesbaden 1995, ISBN 3-528-05492-1 .
- Guideline "Accessible hearing and communication in the world of work": Speech recognition software - The Hörkomm.de project supports the inclusion of hard of hearing employees.
- Speech Understanding and Dialogue. Retrieved May 22, 2020 .
- Peckham, Jeremy: Speech Understanding and Dialogue over the telephone: an overview of the ESPRIT SUNDIAL project. LDS. 1991.
- Danieli, Morena; Elisabetta Gerbino: Metrics for evaluating dialogue strategies in a spoken language system. Proceedings of the 1995 AAAI spring symposium on Empirical Methods in Discourse Interpretation and Generation. Vol. 16. 1995.
- Ciaramella, Alberto: A prototype performance evaluation report. Sundial work package 8000 (1993).
- Charpentier, F., Micca, G., Schukat-Talamazzini, E., Thomas, T. (1995): The recognition component of the SUNDIAL project. In: Speech Recognition and Coding (pp. 345-348). Springer Berlin Heidelberg.
- Michael Spehr: Dictating is much faster than typing. In: FAZ.net . September 22, 2010, accessed October 13, 2018 .
- L. Lamel, J.-L. Gauvain: Speech Recognition. Oxford Handbooks Online (Vol. 14) . Oxford University Press, 2005. doi: 10.1093 / oxfordhb / 9780199276349.013.0016
- Malaka, Rainer; Butz, Andreas; Hußmann, Heinrich: Medieninformatik: An introduction. Pearson Studium, Munich 2009, ISBN 978-3-8273-7353-3 , p. 263.
- Ulf Schoenert: Speech recognition: The normality of conversation with machines. In: Zeit Online. February 14, 2012, accessed February 6, 2016 .