Mel Frequency Cepstral Coefficients

The Mel Frequency Cepstral Coefficients ( MFCC ; German mel-frequency cepstrum coefficients ) are used for automatic speech recognition is used. They lead to a compact representation of the frequency spectrum . The mel in the name describes the perceived pitch.

MFCCs are also used to analyze music. In particular, they are used to identify pieces of music in order to be able to assign metadata to them.

The linear modeling of speech generation serves as the actual basis for the generation of MFCCs: A periodic excitation signal (vocal cords) is formed by a "linear filter" (mouth, tongue, nasal cavities, ...). The filter (or its impulse response ) is primarily important for speech recognition , since "what was said" and not "in which pitch" is of interest for the analysis. The calculation of the MFCC is an elegant way to separate the excitation signal and the impulse response of the filter.

The impulse response of the filter is mathematically formulated with the excitation signal folded to produce the voice signal. When calculating the cepstrum , the convolution operation is transformed on the basis of the logarithm into an addition that is easy to separate, with which the speech signal can be separated into excitation and source.

MFCCs are calculated through the following steps:

Subdivision of the input signal into blocks or windows (e.g. Hamming window function to avoid edge effects). Overlapping windows are common.
(Discrete) Fourier transformation of each individual window (this transforms the convolution of the excitation signal and impulse response into a multiplication).

Generation of the amount spectrum.

Logarithmizing the spectrum of amounts. This transforms the multiplication of the excitation signal and the impulse response into an addition.

Reduction of the number of frequency bands (e.g. 256) by combining them (to e.g. 40). ( Mapping on the Mel scale in discrete steps using triangular filters (effectively a band filter)).

Final decorrelation by either a discrete cosine transformation or a principal component analysis (also called Karhunen-Loève transformation ). (Originally the logarithmized Fourier coefficients (without Mel bandpass filtering) were inversely Fourier transformed. The excitation frequency is then a single peak and easy to identify or filter out. If this method is used, one speaks of a cepstrum . The main advantage is that a Convolution (e.g. filtering) in the time domain corresponds to an addition in the logarithmized frequency domain. The task of the coefficients is to represent the information of the audio signal in a decorrelated form (i.e. as effectively as possible). Therefore, the logarithmized frequencies are subjected to a DCT that is similarly good Has properties like the Karhunen-Loève transformation and is also easy to implement).

Web links

Paper with an introduction to MFCCs (English, PDF file; 167 kB)
Textbook on pattern recognition The (German) section on MFCCs (3.6.3) begins on page 213. (PDF file; 6.46 MB)