VoiceXML

from Wikipedia, the free encyclopedia

VoiceXML (Voice Extensible Markup Language) is an XML application that is used to describe dialog processes in a speech dialog system. It has been specially developed for telephone applications. The current version, VoiceXML 2.1, has been a recommendation of the World Wide Web Consortium ( W3C ) since June 2007 and has the same status as a web standard as HTML . Applications that were developed in VoiceXML therefore run on any VoiceXML-compatible language platform. Due to the analogy to the HTML web browser, VoiceXML interpreters are also referred to as voice browsers.

As an extension of graphical user interfaces in the World Wide Web to include input and output options through natural language through to multimodal user interfaces, further dialog description languages ​​have been developed as a supplement or alternative to VoiceXML:

SALT was initiated by Microsoft and serves to strengthen the connection between voice applications and the content and procedures of the World Wide Web. X + V combines XHTML and VoiceXML elements to merge internet and telephony.

The Web Speech API enables the ECMAScript -controlled extension of websites to include voice input and output.

Development history

With the first speech applications, there was no separation between application and platform. Dialog processes were programmed and compiled just as “hard-wired” as, for example, the interfaces to the telephone system. This had the advantage that voice applications could usually be created quickly and run reliably, but had a rigidity that was unacceptable for today's terms. For example, if a dialog was to be changed, the application programmer had to intervene deep in the source code.

In newer language applications, the application was therefore separated from the platform in order to be able to maintain dialogues more easily. Script languages ​​or tools for describing these applications were (and in some cases still are today) proprietary - that is, they differed from provider to provider.

VoiceXML 2.0 is a standardization effort by the W3C with the aim of achieving a uniform description of speech applications. At the same time, it is an interface language that can be used for communication between the application and the platform. The standard has meanwhile found widespread use and is supported by numerous providers. In addition to the proprietary solutions and application platforms that are still very widespread in the market, there are competing standardization approaches, in particular the SALT standard promoted by a consortium led by Microsoft. The specification was published on March 16, 2004.

VoiceXML 2.1 was released on July 19, 2007 and extends version 2.0 by some additional features. These are intended to compensate for the inadequacies recognized while working with VoiceXML 2.0. Version 2.1 is fully downward compatible with version 2.0.

The specification for VoiceXML 3.0 is currently being worked on. This version is intended to entail a complete redesign of the specification in order to enable use as a domain specific language for the development of voice interfaces outside of telephony. The downward compatibility to VoiceXML 2.1 should be made possible by a special profile.

Analogies to the World Wide Web

When comparing VoiceXML with HTML , there are a number of parallels. Like HTML, VoiceXML is both a description language and an interface standard :

  • One can use VoiceXML directly to code speech applications, just as one can use HTML directly to code user interfaces.
  • You can also define the application with a proprietary tool and generate VoiceXML code from it (dynamically or statically). This corresponds to the use of a document management system for maintaining a website. In this case, VoiceXML is largely reduced to its property as an interface standard.

However, with the current state of technology, the analogy still lags at an important point: The VoiceXML browser (as part of the platform) is not yet located directly in the end customer's telephone , but is often (for reasons of efficiency) in the same server room as the application Server. Communication between the caller and the platform takes place via the public telephone network . This means that for the caller and often also for the operator the question of which standard platform and application communicate with is of little importance. The question of standardization for the caller (more precisely: the user of the voice application) is of real importance only when the browser (and with it, in particular, the speech recognizer and speech synthesis ) has a place on the phone due to increased computing power . The situation is still in a certain way comparable to the question of whether a user interface for a locally operated application should be implemented in the HTML language, or in Visual Basic or with a (proprietary) tool for GUI creation - is decisive especially the quality of the resulting user interface.

Limits

The functionality of the VoiceXML standard is a compromise. This means that desired features may not be supported or only supported in a later version. In this case, however, VoiceXML can be expanded with proprietary additions. This dilutes the advantages mentioned above a bit, but it is still more practical than putting the entire system on a proprietary script.

VoiceXML as a scripting language for application development is based on the basic concept that dialogues between humans and machines can be formalized using explicitly predefined flowcharts. In this conception, the caller “navigates” through the predefined dialog sequence, often even using explicit navigation commands such as “back” and “main menu”. This concept reaches its limits where the interaction approaches a free human-machine dialogue, in which the caller can take over the dialogue initiative by formulating entire sentences, e.g. B. “no, to Hamburg, in such a way that I am there around 6 p.m.” (so-called conversational or mixed-initiative dialogues ). It is true that there are constructs in VoiceXML which give the caller certain freedom when navigating through the dialog flow (e.g. so-called form filling ); However, due to the principle involved, the effort for application development increases dramatically with increasing freedom in the dialog process. The introduction of a so-called dialog manager, which dynamically determines the system reaction on the basis of the dialog history, has proven useful for the implementation of such dialogs. Such a dialog manager can be used to dynamically generate VoiceXML documents - as an interface to the language platform.

Multimodal applications - i.e. the connection of speech and graphic output - are currently only supported to a limited extent by VoiceXML. There are, however, tendencies to establish multimedia-based dialogue description languages. X + V (XHTML + Voice) is an attempt to merge VoiceXML with XHTML with the help of special synchronization elements. Another approach is offered by the SALT language, which is intended as an essay on HTML, but relies on a proprietary approach different from VoiceXML for the language functions. So far, however, the main problem with these technical solutions is that they lack a convincing use case for their practical use.

See also

Web links

Individual evidence

  1. https://www.w3.org/Voice