Speech Morphing


 

Voice morphing is the process of producing intermediate or hybrid voices between the utterances of two speakers. It can also be defined as the process of gradually transforming the voice of one speaker to that of another. The ability to change the speaker’s individual characteristics and to produce high-quality voices can be used in many applications. Examples include multimedia and video entertainment, as well as enrichment of speech databases in text-to-speech systems.

The aim of this morphing algorithm, which was inspired by reading the paper “Automatic auditory morphing,” by M. Slaney, M. Covel, and B. Lassiter, is to produce natural sounding hybrid voices between two speakers, uttering the same content.

"In video, morphing is a process of generating a range of images that smoothly move from one image to another. In a good morph, the in-between images all show one object smoothly changing its shape and texture until it turns into another object. We would like the same thing to happen in an audio morph. A sound that is perceived as one object should change smoothly into another sound, maintaining the shared properties of the starting and ending sounds and smoothly changing the other parameters" (Slaney et al., Proc. ICASSP’96, 1996, vol. 2, pp. 1001–1004).

In this study we present a new technique, which enables the production of a desired number of intermediate voices between the original voices of two speakers, or one voice signal that changes gradually in time from one speaker to another. The latter means that in the beginning of the utterance the voice characteristics are from one speaker, so the voice is perceived as belonging to that speaker. The voice is modified gradually towards the characteristics of another speaker, so that in the end of the utterance it is perceived as belonging to the second speaker. This technique is based on two components. One is the creation of a 3-D prototype waveform interpolation (PWI) surface from the residual error signal, which is obtained by LPC analysis, to produce a new intermediate excitation signal. The second component is a representation of the vocal tract by a lossless tube area function, and interpolation of the two speakers’ parameters.

 

Examples

 

Cyclostationary morphing samples.

In this mode, two sentences, uttered by two speakers, are changed gradually, from one speaker to the other, repetitively and smoothly, in small steps, so that the first sentence is perceived as uttered by speaker 1 and the last sentence is perceived as produced by the other speaker (speaker 2). The sentences in-between are morphed so as to be perceived as uttered by an intermediate speaker, between 1 and 2.

   Sentence1:  "Dont ask me (SA12SA22)."

   Sentence2:  "Among_Us. "

 

Gradual Morphing samples.

In this mode, we start from the first (source) speaker (morphing factor=0), and the morphing factor is changed along the duration of the sentence, and its value is 1 at the end of the sentence. In these examples, Grad_1 and Grad_2 are the original sentences of a male and female speakers, respectively (excerpted with permission from BGU Hebrew speech database). Grad_3 is a morphing from the female to the male voice, and Grad_4 from the male to the female voice. Grad_5 and Grad_6 are demonstrations of a simple concatenation of the two sentences, in order to compare it to the morphing examples.

     Grad_1 - original(m)

     Grad_2 - original(f)

     Grad_3 - morphing(f-m)

     Grad_4 - morphing(m-f)

     Grad_5 - concatenated(m-f) 

     Grad_6 - concatenated(f-m)

----------------------------------

     Grad_13 - original(m1)

     Grad_14 - original(m2)

     Grad_15 - morphing(m1-m2)

--------------------------------------

     CLSA0059 - original(m1)

CGSA0057 - original(m2)

     CGSA0057_CLSA0059 - morphing(m2-m1)

      

When this mode of voice morphing is applied, informal subjective listening tests performed on different sets of morphing parameters have revealed that in order to have a “linear” perceptual change between the source’s and the target’s voices, the coefficient, (the relative part of the source parameters in the model) must vary nonlinearly in time. In the following two examples, the morphing coefficient changed linearly with time,

     file_so2ta_dont_linear.wav

     file_ta2so_dont_linear.wav

Hybrid morphing samples.

 

In each of these examples, the morphing factor is constant, from 0.00 (The sentence of the first speaker, without morphing), to 1.00 (The sentence of the second speaker, without morphing).

For example, the sample denoted with a 0.50 is a hybrid in between the two speakers.

 

    TIMIT Samples.

 

    SO12_60SA0039:  0.000.150.300.400.450.500.550.600.700.851.00.

    SA12SA22:         0.000.150.300.400.450.500.550.600.700.851.00.