In concatenative speech synthesis, a diphone describes the short section (building block) of spoken language that begins in the middle of a phone and ends in the middle of the following phone. A diphone thus contains the transition between the two sounds created by co- articulation . The concatenation of building blocks, each comprising only one phone ( allophone synthesis), only leads to extremely unsatisfactory results in speech synthesis, since the co-articulation between the sounds cannot be taken into account in this case. In contrast to this, diphone synthesis already leads to surprisingly good results that sound understandable and sufficiently natural. The quality can be further increased by using longer building blocks instead of diphons (e.g. syllables, common words or sound sequences), which is often no longer practicable due to the size of the inventory.
The diphone components used are manipulated in their prosodic information ( strength , fundamental frequency , duration ) in the course of the synthesis, for example with the help of the PSOLA algorithm, in order to generate a natural speech melody.
Three speech synthesis systems that work on the basis of diphone synthesis are DreSS, SVOX and the free program Mbrola.
In natural languages, not all of the possible combinatorial diphones appear; z. For example, there is no word in German with the sound sequence [ p͡fœ̃ː ], since [ p͡f ] is only common in German and [ œ̃ː ] only in French or words borrowed from it.