This part presents the quality of compared systems, corresonding to Section 4.2 in the
paper.
Note that all the audio on this page was synthesized from HifiGAN implemented in this repo, except ground
truth.
All the utterances are from ESD speaker ID 0015.
GT: Ground truth recording. GT (voc.): Ground truth mel-spectrogram + HifiGAN. MixedEmotion: Proposed in this paper. We use the official code. It is an autoregressive model based on relative attributes rank to pre-calculate intensity values for training. It much resembles Emovox for intensity controllable emotion conversion. GradTTS w/ emo label: a conditional GradTTS model with hard emotion labels as input. It therefore does not have intensity controllability, but should have good sample quality, as a certified acoustic model. EmoDiff: Proposed intensity controllable emotional TTS model. |
GT | GT (voc.) | MixedEmotion | GradTTS w/ emo label | EmoDiff | |
---|---|---|---|---|---|
Surprise | |||||
Happy | |||||
Sad | |||||
Angry |
Here we synthesized speech with emotion intensity
from 0.0 to 1.0 with 0.2 interval.
Each row contains the same utterance.
The controlled emotion is listed at the beginning.
|
0.0 (Neutral) | 1.0 (100% intense) | |||||
---|---|---|---|---|---|---|
Surprise | ||||||
Surprise | ||||||
Angry | ||||||
Angry | ||||||
Happy | ||||||
Happy | ||||||
Sad | ||||||
Sad |
In this experiment, we synthesized each utterance four times with MixedEmotion and EmoDiff respectively. Each audio below contains four samples of the same utterance. Listeners can pay attention to the diversity within each emotion. |
MixedEmotion | EmoDiff | |
---|---|---|
Happy | ||
Surprise | ||
Angry | ||
Surprise | ||
Sad | ||
Angry |
@inproceedings{guo2023emodiff,
|