G.729

from Wikipedia, the free encyclopedia

G.729 is an audio codec (actually vocoder , voice coder, see parametric audio coding ) that has been free of license fees since the beginning of 2017 for compressing speech into digital signals. The technical name is "Conjugate Structure Algebraic Code Excited Linear Prediction " ( CS-ACELP ). G.729 Annex A and B are used for Internet telephony , for example, due to their high level of compression and low computing requirements .

Technical specifications

G.729 is a hybrid compression method based on the examination and transmission of speech parameters with a so-called vocoder, as well as difference information and subsequent speech synthesis. The codec breaks down the audio signal into frames of 10 milliseconds in length, which it examines for typical language properties. These are put into parameters for later synthesis. In addition, the codec transmits difference information that results from the artificially generated and the actual signal. In a voice packet, two frames of 10 milliseconds are typically transmitted together, which means that the delay is around 25 milliseconds.

This codec can only process audio signals that are not human language as sources. For example, it cannot adequately process the multi-frequency tones used in analog telephony . You can get around here by filtering out the multi-frequency tones from the signal and transmitting them in the information channel in accordance with RFC 2833 (“outband”).

G.729 also suppresses speech pauses. So that this does not sound like a disconnection to the listener, the decoder has the ability to fill up speech pauses with so-called comfort noise . The standard includes possible implementations both in fixed-point and in the technically more complex floating-point format , which facilitates use in various complex DSP platforms. For these reasons, G.729 is comparatively computationally intensive, depending on the variant used; depending on the implementation and the options it contains, it requires around 50  MIPS . The variants G.729A and G.729B have a low computational complexity and, for example, require around 10.3 million clock cycles for 80 audio samples in the non-optimized reference implementation of the ITU-T on the MicroBlaze microcontroller . The MIPS information can, however, differ from the specified values ​​depending on the architecture used and the type of optimization and are only rough guide values.

variants

G.729 is in different variants in the standard as attachments (English Annexes ) divided. These appendices are marked with different letters and other symbols to distinguish them. Each appendix describes various possible combinations that differ in the implementation effort, the required computing power and the functional scope of the codec. For correct decoding, the encoder and decoder must be matched to one another.

The following variants are available in the context of G.729:

  Annex
Functionality - A. B. C. D. E. F. G H I. C + J
Low complexity   X X                  
Fixed point arithmetic X X X   X X X X X X   X
Floating point arithmetic       X             X  
Data rate 8 kbit / s X X X X X X X X X X X X
Data rate 6.4 kbit / s         X   X   X X X  
Data rate 11.8 kbit / s           X   X X X X  
DTX     X       X X   X X  
Variable bit rate                       X

The acronym DTX stands for discontinuous transmission to German discontinuous transmission , in which the transmitter side speech pauses in which actually only empty free noise transfer would need to be detected and transmitted in the form of bandwidth-saving break signals are reproduced on the receiver side as locally generated comfort noise. In the Mean Opinion Score (MOS) , G.729 achieved a perceived quality of 3.98 out of 5 points, whereas the variant G.729A only achieved 3.7 out of 5 points.

The most popular variants of the codec are Annex A and B, which use a fixed bit rate of 8 kbit / s for the coded speech signal , but in some variants fixed bit rates of 6.4 kbit / s and 11.8 kbit / s are also possible. The frequency spectrum ranges from 300 to 3400  Hz , with the coding concept only accurately transmitting voice data.

G.729.1 (G.729J)

The last extension G.729J - this variant corresponds to the working name G.729.1 - has the capability of broadband voice and audio coding: The transmitted frequency bandwidth has been increased to the range 50 Hz to 7 kHz. The G.729J codec is organized hierarchically and the specific bit rate and thus also the voice / audio quality can be set to variable bit rates by simply “cutting” the bit stream.

Voice quality in comparison

To compare the transmission quality, the mean opinion score (MOS) method can be used, which records the subjective perception of the speech quality of a user (in a hearing situation). The MOS scale is not an absolute scale, but depends on the respective question and the listening samples offered in the so-called hearing test. The same codec can therefore achieve different values ​​in different tests. What is important, however, is the difference between the codec to be tested and known reference codecs (e.g. BG711). In typical tests, G.729 achieves a value of approx. 3.9 (on a five-point MOS scale). G.729 thus achieves a higher subjective voice quality than other codecs (e.g. BG728 and G.723), but is subject to the G.711 reference codec (ISDN). G.711 achieves a slightly higher MOS value of approx. 4.4, but requires a data transfer rate of 64 kbit / s that is eight times higher than G.729, which only requires 8 kbit / s.

Overhead when used with RTP on an IPv4 network

The mentioned data rate of 8 kbit / s is nominal, it relates exclusively to the audio data itself. If a data stream is now sent through a network, there is also the overhead of the switching data for the data packets in which the data stream is packed. When using RTP in an IPv4 network, this is 40 bytes per IPv4 data packet (60 bytes for IPv6). The frame length for G.729 is 10 ms and such a frame is coded with 10 bytes. Typically, 2 frames are sent per IPv4 data packet. Consequently, with this setting, you effectively need 60 bytes (40 + 10 + 10 bytes) for 20 ms of voice data. That is 3000 bytes per second, i.e. 24 kbit / s (3000 bytes * 8/1000 = 24 kbit). If you pack more than 2 frames in one packet, the relative share of IP data drops and the overhead becomes smaller. With 3 frames per packet, you would only need 18.7 kbit / s. The disadvantage, however, is a longer delay: If this is 25 ms with 2 frames per packet (10 ms per frame + 5 ms processing time), this is already 35 ms with three frames. If the delay becomes too great, it can be perceived as annoying by the users.

Frames / IPv4
packet
ms /
packet
Bytes / packet
nominal
Bytes / packet
effective
Packets /
second
kbit / s
nominal
kbit / s
effective
Overhead
%
Delay
ms
1 10 10 50 100.0 8th 40.0 400.00 15th
2 20th 20th 60 50.0 8th 24.0 200.00 25th
3 30th 30th 70 33.3 8th 18.7 133.33 35
4th 40 40 80 25.0 8th 16.0 100.00 45
5 50 50 90 20.0 8th 14.4 80.00 55
6th 60 60 100 16.7 8th 13.3 66.67 65
7th 70 70 110 14.3 8th 12.6 57.14 75
8th 80 80 120 12.5 8th 12.0 50.00 85
9 90 90 130 11.1 8th 11.6 44.44 95
10 100 100 140 10.0 8th 11.2 40.00 105

swell

  • ITU-T G.729 - The standard includes a complete reference implementation of the ITU-T in C for all G.729 variants.

Individual evidence

  1. https://www.mgraves.org/2017/03/its-official-the-patents-on-g-729-have-expired/
  2. Russell Klein, Rajat Moona: Migrating software to hardware on FPGAs . Indian Institute of Technology Kanpur, 2005 ( iitk.ac.in [PDF]).
  3. Recommendation G.729, Coding of speech at 8 kbit / s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP) . ITU-T, 2007 (SERIES G: TRANSMISSION SYSTEMS AND MEDIA, DIGITAL SYSTEMS AND NETWORKS - Digital terminal equipments - Coding of analogue signals by methods other than PCM).