Speech dialogue system

from Wikipedia, the free encyclopedia

With a voice response system (Engl. Voice Portal ), and IVR system ( I nteractive V oice R esponse ) may caller over the telephone partially or other acoustic media or fully automated natural language dialogues lead.


Caller: "What is the daily high and the current price of the F company in Frankfurt?".

Answer of the speech dialogue system: "The daily high of F in Frankfurt is xxx, yy euros and currently F stands at xxx, yy euros."

In practice, IVR also includes other telephony input options, such as multi-frequency dialing ("For sales, please press '1' now, for service please press '2', ..."). In telecommunications , IVR systems allow customers to interact with a company's host system using the keypad of a telephone or through voice recognition so that information can be obtained using the IVR system. IVR systems can respond with pre-recorded or dynamically generated speech to guide users on how to proceed. IVR systems provided in a network are dimensioned for handling a large volume of calls.

Basic structure

IVR systems consist of the following components:

Architecture of IVR systems, Daniel Wimpff, 2008

Figure 1: Architecture of IVR systems

There are biometric procedures for speaker authentication ("the voice as password") available and certified as secure by the Federal Office for Information Security (BSI).

Due to the further development of speech recognition in recent years, dialogues consisting of entire sentences are possible. Natural language (Natural Language Understanding, NLU) requires intelligence on the part of the dialogue partner. In order to use NLU effectively, the artificial intelligence of the dialogue system must keep pace with the capabilities of the speech recognizer. Now that the core technology is considered to be largely mature, new disciplines are coming to the fore for developers of speech dialog systems, e.g. B. Dialog design .


IVR systems are used to handle high call volumes , reduce costs and improve the customer experience. IVR systems can be used for mobile purchases, bank payments and services, ordering from retailers, utilities , travel information, and weather reports. IVR systems enable callers to access data relatively anonymously. This is due to increased CPU performance and the migration of voice applications from proprietary code to the VoiceXML standard.

Fields of application

IVR systems make it possible to use language as an additional input / output medium in addition to keyboard, mouse and monitor.

The types of application can be technically divided into

  • pure language services: only offer interaction via language and
  • Multimodal applications: combine voice interaction with other input / output media (e.g. graphic interfaces).

In the following, the types of application are further divided according to user groups into commercial voice services (Business2Consumer, Business2Business), in-house voice services and device-integrated voice services (hardware and software control, computer games).

Commercial language services

As of 2009, pure voice services of a commercial nature are mostly still being rejected by German consumers. Since the user cannot be instructed personally, does not know how the systems work and feels annoyed by advertising that may have been brought in via the voice service, end customers often adopt a negative attitude towards voice services. The following fields of application are exemplary for the commercial sector:

  • Services for end customers ( business-to-consumer ):
    • Information and advice on the phone, e.g. B. Timetables and flight schedules
    • Automatic order / reservation on the phone, e.g. B. Ticket hotline, catalog orders, telephone banking
    • Automatic switchboard / operator
    • Prequalification / authorization of callers, e.g. B. Query the customer number or PIN
    • Intelligent waiting areas in call centers
    • Disturbance announcement management
    • Televoting , competitions on the phone

In-house language services (for employees)

In-house, language processing is currently hardly used, although there is great potential here: The in-house user can be instructed in the operation and he works regularly with the language service. This leads to efficient use with a high level of user acceptance. The process times of internal processes can be greatly accelerated while at the same time reducing the error rates when entering data through reduced media breaks .

  • Receipt of goods
  • Quality check, running check, final product acceptance
  • Inventory
  • Plant inspection
  • process-oriented event reporting
  • Remote and on-site diagnosis

Device-integrated voice services

As of 2009, device-integrated dialog systems are only slightly better received. However, qualitative speech recognition requires high computing power with a corresponding energy requirement, so that satisfactorily functioning solutions are initially only to be found in on-board systems of individual luxury cars, computer games or special application software. Examples of device-integrated speech recognition are:

  • Hands-free devices in motor vehicles
  • Navigation systems in motor vehicles
  • Dialing numbers in mobile phones using the personal name
  • Computer games
    • As of 2009, the first computer games existed that incorporate voice input and output into their user interface and the game concept. Since computer games are already a major technology driver in the graphics sector, they could perhaps play a similar role in language technology in the future.
  • Application software for the physically challenged
  • cooperative machine control
    • Closer cooperation between man and machine, e.g. B. for the use of industrial robots in craft businesses, is a current research subject.

Advantages and limits of interactive speech dialogue systems

In contrast to conventional graphical user interfaces, speech can be used to communicate directly and naturally:

  • Benefits of voice interaction
    • The hands and the view remain free (improves ergonomics and process time).
    • Language is directly accessible to people (major qualification measures and longer learning times for user interface operation are not required).
    • The demands on the end device are low (a telephone or headset with a good microphone is sufficient).
    • The general availability of (mobile) telephones allows new degrees of freedom when interacting with software applications.
    • Modern speaker-independent recognition understands utterances by different people without training (multilingual applications possible; dialects also tolerated to a certain extent).
    • All information elements are directly accessible (no tedious going through hierarchical menus and long lists).
    • Complex sentences can be understood and automatically processed within a specific context (for example, to reserve a company car via a telephone connection: "Hello. I would like a car for the Stuttgart - Darmstadt route on Thursday from 6 a.m. to 10 p.m.").
    • Visual tasks require a lot of attention. Dialogues can practically be conducted "on the side".

This enormous flexibility of language technology creates new innovation potential e.g. B. for integrated company processes and their coordination.

NLU is the most natural form of computer interaction, but the possibilities of presenting information are limited compared to visual media:

  • Limits of voice interaction
    • No 100 percent recognition
      • Very extensive vocabularies are problematic (increased similarities in the pronunciation of different terms).
      • Even in the foreseeable future, no perfect recognition (variability of the human voice).
    • Harsh environmental conditions
      • Nowadays, recurring environmental noises can be filtered out well in terms of signaling and software.
      • The filtering of human voices in the background, however, remains problematic.
    • Navigation and menu structures
      • The user must first become familiar with the navigation options and functions of a voice application. Solution: Graduated application modes for beginners and advanced users for efficient use.
      • Convincing process times are possible with regular use.
      • Human perception can visually understand long lists very well; Acoustically, however, listing a lot of information in one piece is difficult to understand.
      • Example: Most Internet users first use simple search terms and examine the results, then refine the search. This usually takes two to three quick iterations to get the desired set of results . This approach would be time-consuming in the case of “spoken results” and therefore not practical.
    • Unrealistic expectations
      • You have to know "the rules". Computers don't “understand” - it's just speech “recognition”.
      • Today's speech recognition techniques correlate the spoken words with a list of expected utterances, the size of which is limited to a few thousand entries. In developing a speech dialogue system, assumptions must be made about what might be asked. Based on this, question / answer dialogues must be developed that lead the caller to certain information. A dialogue could look like this, for example: “Are you looking for information about a company, a film, traffic information…?” “Company.” “What kind of company?” “Restaurant!” “What kind of restaurant?” “Chinese! ““ In which street, district or near which restaurant? ”Even if this procedure can work and can be helpful for the caller, it is far from the possibilities that one has with a free text entry in a search engine on the Internet .
  • New cultural technology
    • Linguistic interaction with computers is a new cultural technique. Both users and developers will only agree on common and generally known dialogue concepts (building blocks) over time.
    • You should therefore not be confused by poorly designed applications, but set up and use economical solutions.
    • “Language is the bicycle among the user interfaces. It's great fun […], but it only carries a small load. Sober advocates know that it will be difficult to replace the automobile: the graphical user interface. ”( Speech is the bicycle of user-interface design, it is great fun to use […], but it can only carry a light load. Sober advocates know that it will be tough to replace the automobile: graphic user-interfaces. Ben Shneiderman, 1998.).
  • Natural dialogue systems
    • Natural user interfaces should enable the user to get to the information they want in the simplest possible way (i.e. above all without special training or experience). However, current IVR interfaces usually require the user to be familiar with the operation of such a system. Furthermore, the power of natural language is often not used, as its interpretation is still extremely complex.
    • The naturalness (user-adapted operation) of a dialog system can be described using the following properties:
      • Adaptivity
      • Implicit confirmation
      • Inquiries and ambiguity resolution
      • Correction options
      • Over answer
      • Interpretation of negations
      • Discourse and references
      • Interpretation of colloquial language
      • Type of formulation / speech generation
      • Social behavior
      • Quality of speech recognition and synthesis
    • Limits due to a lack of development environments
      • In addition to the end user, the developer must also be considered. As long as there are no easy-to-use tools for creating dialogue systems, the results will not be user-friendly either: “When comparing the systems, however, it becomes apparent that many of the properties of natural dialogue systems have not yet been implemented. This is mainly due to the lack of a comprehensive dialog modeling and implementation tool. "

Criteria for the use of speech dialogue systems

The following criteria speak for the use of language technologies in business applications:

  • The employee ...
    • has little computer experience
    • has writing / reading weaknesses
    • only speaks foreign languages
  • The activity demands ...
    • free hands and a clear view
    • Input easy to put into words
    • mobility
    • frequently repeated tasks
  • The working environment results
    • visual perception difficult
    • Lack of space, no screen / keyboard
    • Switching between activity and computer workstation is unergonomic or time-consuming

See also

Web links

Individual evidence

  1. ^ Enhancing customer engagement with interactive voice response .
  2. ^ To: Jürgen Hoffmeister, Christel Müller, Engelbert Westkämper : Language technology in use - language portals. Springer Berlin Heidelberg 2008, ISBN 978-3-540-72435-3 , p. 85.
  3. after: Jürgen Henke, Ronny Egeler: Speech recording in industrial EDP systems. Slide 7, Lecture Fraunhofer IPA, Stuttgart 2008, PDF document ( Memento of the original from August 28, 2012 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / voice.fraunhofer.de
  4. IVR or speech dialogue systems .
  5. Suendermann, David: Advances in Commercial Deployment of spoken dialogue system . Springer Science + Business Media , Berlin 2011, ISBN 9781441996107 , pp. 9-11.
  6. Lam: Validation of interactive voice response system administration of the Short Inflammatory Bowel Disease Questionnaire . In: Inflammatory Bowel Diseases . 2009, pp. 599-607. doi : 10.1002 / ibd.20803 . PMID 19023897 .
  7. Compare: Susanne Feldt, Kai-Werner Fajga, Christoph Pause: Voice Business Yearbook 2009 , telepublic Verlag, Hannover 2008, ISBN 978-3-939752-01-1 , pp. 30–68.
  8. Christopher Parlitz: PowerMate - Unlimited human-robot cooperation. Fraunhofer IPA, 2005, PDF document ( Memento of the original from September 1, 2011 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / ipa.fraunhofer.de
  9. Ben Shneiderman : Designing the User Interface: Strategies for Effective Human-Computer Interaction , 3rd edition, Addison-Wesley, 1998.
  10. a b Markus Berg: Natural Language in Dialog Systems , Computer Science Spectrum 36/4, pp. 371–381, Springer, 2013, doi: 10.1007 / s00287-012-0650-3
  11. Matthias Peissner: Presentation - Success Factors for the Use of Voice Interaction , Slide 9, Stuttgart 2008, PDF document ( Memento of the original from August 28, 2012 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / voice.fraunhofer.de