Speech pause detection

Speech pause detection ( English voice activity detection , VAD ) is a technology used in speech processing in which the presence or absence of human voice is recognized. The main uses for speech pause detection are in the areas of speech coding and speech recognition . It can facilitate speech processing and can be used to deactivate certain processes during a pause in speech: It can avoid unnecessary coding and transmission of empty data packets in IP telephony applications and thus save computing power and transmission capacity .

Speech pause detection is a key technology for a variety of speech-based applications. Therefore, different algorithms have been developed that have different characteristics and represent a trade-off between latency , sensitivity , precision and computational effort. Some algorithms also provide further analysis data, for example whether the speech is voiced , unvoiced or sustained . Speech pause detection is usually independent of language.

It was first investigated for use in systems for time- assigned speech interpolation (ZSI).

algorithm

The typical design of a VAD algorithm is as follows:

The first step can be a step towards noise reduction, for example through spectral subtraction.
Then some features or quantities are calculated for a section of the input signal.
A classification rule is used to classify the signal segment as speech or as a pause in speech - the classification process often checks whether a value exceeds a threshold value.

In this sequence there can be feedbacks in which the decision of the speech pause detection is used to adapt the background noise detection or to dynamically adapt the threshold value (s). These feedback mechanisms improve the detection performance in the case of changing background noises.

A representative set of recently published pause detection methods determine the decision rule from block to block using continuously measured deviation distance between speech and noise. The different measured variables that are used in the recognition of pause in speech include a drop in the spectral distribution curve , correlation coefficients , logarithmic probability quotient, cepstrum, weighted cepstrum and modified distance measures.

Regardless of the choice of the pause detection algorithm, you have to weigh up between the detection of background noise as speech or speech as background noise (between false positive and false negative ). A speech pause recognition operated in a mobile phone must be able to recognize speech signals in the presence of a range of very different types of acoustic background noise. Under these difficult recognition conditions, it is often desirable to have a conservative pause recognition which, in case of doubt, categorizes it as a speech signal in order to reduce the risk of lost speech segments. The greatest difficulty in recognizing the speech segments in this environment is the low signal-to-noise ratios encountered. If parts of the utterance of speech are drowned in noise, it may be impossible to distinguish between speech and noise due to simple level determination.

Applications

Speech pause detection is a fundamental part of various voice communication systems such as conference call applications , echo cancellation , speech recognition , voice signal coding and hands-free telephoning .
In the area of multimedia applications, pause detection enables simultaneous use of voice and data applications.
Similarly, it affects and reduces the average bit rate in Universal Mobile Telecommunications Systems (UMTS) and improves the overall voice quality.

In mobile radio systems (e.g. GSM and CDMA2000 ) with interrupted transmission (DTX), speech pause detection is essential for improving the overall capacity by reducing the disturbance of secondary channels and energy consumption of mobile devices.

For a wide range of applications such as digital voice radio, digital simultaneous voice and data (DSVD) or voice recordings, it is desirable to have an interrupted transmission of voice coding parameters. Advantages can be lower average energy consumption in mobile devices, higher average bit rate for simultaneous services such as data transmission or higher capacity on memory chips. However, the advantages depend on the proportion of pauses in conversations and the reliability of the speech pause detection used. On the one hand, it is advantageous to have a small proportion of speech sections. On the other hand, cuts in speech sections, i.e. the loss of speech sections, should be minimized in order to maintain quality. This is the crucial problem for a speech pause detection algorithm under the condition of strong background noise.

Use in telephone sales

A controversial application of pause detection is in conjunction with predictive dialers used by telephone sales companies . To maximize agent productivity, telephone sales companies set up predictive dialers to call more numbers than agents are available, knowing that most calls end up unanswered or on answering machines. When a person accepts, they usually speak briefly (“Hello”, “Good evening” etc.) and then a period of silence follows. Answering machine announcements typically contain 3 to 15 seconds of continuous flow of speech. With correctly selected speech pause detection parameters, dialers can determine whether a person or an answering machine has accepted the call and, if it is a person, transfer the call to an available agent. If an answering machine was recognized, the dialer hangs up. Often times, the system correctly detects acceptance by a person with no agent available.

Performance evaluation

To evaluate a speech pause detection method, its output is compared with the results of an “ideal” speech pause detection using test recordings - created by manually determining the presence and absence of speech in the recordings. The performance of speech pause detection is typically examined using the following four parameters:

FEC ( Front End Clipping ): truncated speech section at the transition from background noise to speech content;
MSC ( Mid Speech Clipping ): Interrupted speech segment due to incorrect classification of speech content as background noise;
OVER: Interfering noises interpreted as speech content due to the persistent pause status after the transition from speech to interfering noises;
NDS ( Noise Detected as Speech ): Interfering noises during a period of silence are interpreted as a speech signal.

Although the method described above provides useful objective information on the performance of a pause detection, it is only an approximate measure of the subjective effect. For example, depending on the type of comfort noise generator selected, the effects of truncated speech segments can sometimes be concealed by the presence of background noise, as a result of which some incisions in speech segments measured with objective tests are actually imperceptible. Therefore, it is important to subject pause detections to subjective tests, mainly to ensure the acceptability of the incisions perceived. This type of test requires a certain number of listeners to evaluate recordings with the recognition results of the method to be tested. The listeners must evaluate the following characteristics:

Quality;
Intelligibility;
Audibility of cuts.

These evaluations obtained by listening to some speech sequences are then used to calculate average results for the individual features listed above and thereby to obtain a general assessment of the behavior of the speech pause detection tested. So while objective methods are very useful in an initial development stage to check the quality of speech pause detection, subjective methods are more meaningful. However, since they are more expensive (because they require the participation of a certain number of people over a few days), they are generally only used when a proposal is in the process of being standardized.

Implementations

An early standardized speech pause detection is the method developed by British Telecom in 1991 for use in the pan-European digital cellular network. It uses trained inverse filtering on the basis of speech pause sections in order to filter out background noise and then to decide more reliably whether a voice is present on the basis of a simple level threshold value.
The G.729 standard calculates the following characteristics for its pause detection: Line Spectral Frequencies, Total Band Energy, Lower Part of Band Energy (<1 kHz), and Zero Crossing Rate . It employs a simple classification with a fixed decision threshold in the space these features define, and then smooths and dynamically corrects that estimate.
The GSM standard contains two options developed by ETSI for recognizing speech pauses. The first way calculates the signal-to-noise ratio in nine frequency bands and applies a threshold value to these values. The second option calculates different parameters: energy density of the channel, measurement parameters of the voice and energy density of the background noise. It then applies a threshold to the speech signal parameters that is varied with the estimated signal-to-noise ratio.
The Speex audio compressor library uses a procedure known as Improved Minima Controlled Recursive Averaging , which uses a smooth representation of the spectral energy distribution and then searches for the minima of a smoothed periodogram . From version 1.2 it was replaced by a botched solution (English original: "kludge"), according to the author.

literature

DMA minimum performance standards for discontinuous transmission operation of mobile stations TIA doc. and database IS-727. June 1998.
MY Appiah, M. Sasikath, R. Makrickaite & M. Gusaite: Robust Voice Activity Detection and Noise Reduction Mechanism Using Higher-Order Statistics . 2005, doi : 10.1109 / ICPR.2010.28 ( auc.dk [PDF] Institute of Electronics Systems, Aalborg University).
Xianglong Liu, Yuan Liang, Yihua Lou, He Li & Baosong Shan: Noise-Robust Voice Activity Detector Based on Hidden Semi-Markov Models . In: 2010 20th International Conference on Pattern Recognition (ICPR) . IEEE, 2010, ISBN 978-1-4244-7542-1 , pp. 81–84 , doi : 10.1109 / ICPR.2010.28 ( edu.cn [PDF]).

Footnotes

↑ ^a ^b ^c J. Ramírez, JM Górriz & JC Segura: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness . In: M. Grimm & K. Kroschel (Eds.): Robust Speech Recognition and Understanding . 2007, ISBN 978-3-902613-08-0 , pp. 1–22 ( i-techonline.com [PDF]).
↑ F. Beritelli, S. Casale, G. Ruggeri & S. Serrano: Performance evaluation and comparison of G.729 / AMR / fuzzy voice activity detectors . In: IEEE Signal Processing Letters . tape 9 , no. 3 , March 2002, p. 85-88 , doi : 10.1109 / 97.995824 .
↑ DK Freeman, G. Cozier, CB Southcott & I. Boyd: The voice activity detector for the Pan-European digital cellular mobile telephone service . In: 1989 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-89) . tape 1 , May 1989, pp. 369-372 , doi : 10.1109 / ICASSP.1989.266442 .
↑ A. Benyassine, E. Shlomot, H.-Y. Su, D. Massaloux, C. Lamblin & J.-P. Petit: ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications . In: IEEE Communications Magazine . tape 35 , no. 9 , September 1997, p. 64-73 , doi : 10.1109 / 35.620527 .
^ ETSI: Digital cellular telecommunications system (Phase 2+); Half rate speech; Voice Activity Detector (VAD) for half rate speech traffic channels (GSM 06.42 version = 8.0.1) . 1999.
^ I. Cohen: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging . In: IEEE Transactions on Speech and Audio Processing . tape 11 , no. 5 , September 2003, p. 466-475 , doi : 10.1109 / TSA.2003.811544 .
↑ Jean-Marc Valin : preprocess.c. In: Speex source code, version 1.2beta2. Xiph.org, accessed on January 17, 2012 (English): "FIXME: This VAD is a kludge"

[rgschapter-1] J. Ramírez, JM Górriz & JC Segura: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness . In: M. Grimm & K. Kroschel (Eds.): Robust Speech Recognition and Understanding . 2007, ISBN 978-3-902613-08-0 , pp. 1–22 ( i-techonline.com [PDF]).

[beritellietal-2] F. Beritelli, S. Casale, G. Ruggeri & S. Serrano: Performance evaluation and comparison of G.729 / AMR / fuzzy voice activity detectors . In: IEEE Signal Processing Letters . tape 9 , no. 3 , March 2002, p. 85-88 , doi : 10.1109 / 97.995824 .

[bt91-3] DK Freeman, G. Cozier, CB Southcott & I. Boyd: The voice activity detector for the Pan-European digital cellular mobile telephone service . In: 1989 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-89) . tape 1 , May 1989, pp. 369-372 , doi : 10.1109 / ICASSP.1989.266442 .

[g279b-4] A. Benyassine, E. Shlomot, H.-Y. Su, D. Massaloux, C. Lamblin & J.-P. Petit: ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications . In: IEEE Communications Magazine . tape 35 , no. 9 , September 1997, p. 64-73 , doi : 10.1109 / 35.620527 .

[gsmvad-5] ETSI: Digital cellular telecommunications system (Phase 2+); Half rate speech; Voice Activity Detector (VAD) for half rate speech traffic channels (GSM 06.42 version = 8.0.1) . 1999.

[speex-imcra-6] I. Cohen: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging . In: IEEE Transactions on Speech and Audio Processing . tape 11 , no. 5 , September 2003, p. 466-475 , doi : 10.1109 / TSA.2003.811544 .

[7] Jean-Marc Valin : preprocess.c. In: Speex source code, version 1.2beta2. Xiph.org, accessed on January 17, 2012 (English): "FIXME: This VAD is a kludge"