Audio Samples

from EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance, Accepted to ICASSP2023

Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

yiwei.guo@sjtu.edu.cn

Shanghai Jiao Tong University, China



1. Emotional TTS Quality

This part presents the quality of compared systems, corresonding to Section 4.2 in the paper. Note that all the audio on this page was synthesized from HifiGAN implemented in this repo, except ground truth. All the utterances are from ESD speaker ID 0015.

GT: Ground truth recording.
GT (voc.): Ground truth mel-spectrogram + HifiGAN.
MixedEmotion: Proposed in this paper. We use the official code. It is an autoregressive model based on relative attributes rank to pre-calculate intensity values for training. It much resembles Emovox for intensity controllable emotion conversion.
GradTTS w/ emo label: a conditional GradTTS model with hard emotion labels as input. It therefore does not have intensity controllability, but should have good sample quality, as a certified acoustic model.
EmoDiff: Proposed intensity controllable emotional TTS model.

GT GT (voc.) MixedEmotion GradTTS w/ emo label EmoDiff
Surprise
Happy
Sad
Angry


2. Controllability of Emotion Intensity

Here we synthesized speech with emotion intensity from 0.0 to 1.0 with 0.2 interval. Each row contains the same utterance. The controlled emotion is listed at the beginning.

0.0 (Neutral) 1.0 (100% intense)
Surprise
Surprise
Angry
Angry
Happy
Happy
Sad
Sad



3. Diversity of Emotional Samples

In this experiment, we synthesized each utterance four times with MixedEmotion and EmoDiff respectively. Each audio below contains four samples of the same utterance. Listeners can pay attention to the diversity within each emotion.

MixedEmotion EmoDiff
Happy
Surprise
Angry
Surprise
Sad
Angry


Citation

@inproceedings{guo2023emodiff,
     title={Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance},
     author={Guo, Yiwei and Du, Chenpeng and Chen, Xie and Yu, Kai},
     booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
     year={2023},
     organization={IEEE}
}