Audio Samples from

"Unsupervised Word-Level Prosody Tagging for Controllable Speech Synthesis", ICASSP2022

Yiwei Guo, Chenpeng Du, Kai Yu

yiwei.guo@sjtu.edu.cn

Shanghai Jiao Tong University



1. Naturalness

This part presents the naturalness of four compared systems, corresonding to Section 3.3 in the paper. Note that all the audio on this page is synthesized from MelGAN, including ground truth. All the utterances are from LJspeech with utterance IDs listed.

GT: Ground truth (resynthesized)
Raw_FSP: Typical FastSpeech2 model with no prosody modeling
PLP_MDN: Phone-level prosody modeling with mixture density network (proposed in this paper)
WLP_predict: Proposed word-level prosody tagging system. Here all the tags are automatically predicted from text.

GT Raw_FSP PLP_MDN WLP_predict
LJ050-0179
LJ050-0185
LJ050-0187
LJ050-0191
LJ050-0206
LJ050-0227


2. Prosody Controllability

This part presents the experiment of prosody controllability in Section 3.4 of the paper.
In this demo, all the controlled words belong to a same leaf node (i.e. the phonetic variation among them is reduced). For each utterance, we present the ground truth speech and five synthetic speeches we obtain by controlling with the five prosody tags. The word being controlled here is marked red. Listeners can verify that, when the specified prosody tag is equal to the ground-truth prosody tag, the word prosody in the synthetic speech should be most similar to the recordings.


International Business Machines Corporation,
and a panel of psychiatric and psychological experts.
White Both Director Hoover and Belmont expressed to the Commission the great
concern of the FBI, which is shared by the Secret Service
GT with tag 0 GT with tag 0
Synthetic Speech with tag 0 Synthetic Speech with tag 0
Synthetic Speech with tag 1 Synthetic Speech with tag 1
Synthetic Speech with tag 2 Synthetic Speech with tag 2
Synthetic Speech with tag 3 Synthetic Speech with tag 3
Synthetic Speech with tag 4 Synthetic Speech with tag 4


Determination to use a means, other than legal or peaceful, to satisfy his
grievance, end quote, within the meaning of the new criteria.
White Information is requested also concerning individuals
or groups who have demonstrated an interest in the President.
GT with tag 1 GT with tag 1
Synthetic Speech with tag 0 Synthetic Speech with tag 0
Synthetic Speech with tag 1 Synthetic Speech with tag 1
Synthetic Speech with tag 2 Synthetic Speech with tag 2
Synthetic Speech with tag 3 Synthetic Speech with tag 3
Synthetic Speech with tag 4 Synthetic Speech with tag 4


The Commission notes with approval several recent measures taken and
proposed by the Secret Service to improve its liaison arrangements.
White This matter is obviously beyond the jurisdiction of the Commission
GT with tag 2 GT with tag 2
Synthetic Speech with tag 0 Synthetic Speech with tag 0
Synthetic Speech with tag 1 Synthetic Speech with tag 1
Synthetic Speech with tag 2 Synthetic Speech with tag 2
Synthetic Speech with tag 3 Synthetic Speech with tag 3
Synthetic Speech with tag 4 Synthetic Speech with tag 4


To establish liaison with local intelligence gathering agencies and to
provide for the immediate evaluation of information received from them.
White Suggest that the Secret Service is trying to accomplish its job with
too few people and without adequate modern equipment.
GT with tag 3 GT with tag 3
Synthetic Speech with tag 0 Synthetic Speech with tag 0
Synthetic Speech with tag 1 Synthetic Speech with tag 1
Synthetic Speech with tag 2 Synthetic Speech with tag 2
Synthetic Speech with tag 3 Synthetic Speech with tag 3
Synthetic Speech with tag 4 Synthetic Speech with tag 4


So that these arrangements can be made permanent without
adversely affecting the operations of the Service's field offices.
White And should not be relied upon to the detriment of
the imaginative application of judgment in special cases.
GT with tag 4 GT with tag 4
Synthetic Speech with tag 0 Synthetic Speech with tag 0
Synthetic Speech with tag 1 Synthetic Speech with tag 1
Synthetic Speech with tag 2 Synthetic Speech with tag 2
Synthetic Speech with tag 3 Synthetic Speech with tag 3
Synthetic Speech with tag 4 Synthetic Speech with tag 4


In addition to the experiments mentioned in the paper, we also present a similar demo for word-level prosody control, which might be easier to understand. Here, the controlled words can come from different leaf nodes. For each of those words, we synthesize it using the five prosody tags in its leaf node. For each tag, we concatenate some words in the training set of recordings that have this specific ground truth tag. These concatenated words are called reference words. Listeners can verify that the controlled word in the right column has the similar prosody with the reference words at left.
The mel-spectrograms are also shown so as to depict the prosodic variations.

And the respective responsibilities for any further investigation that may be required.
Reference Words Synthesized Speech Synthesized Mel-Spectrogram
0
1
2
3
4


Detailed formal agreements embodying these arrangements should be worked out between the secret service and
both of these agencies.
Reference Words Synthesized Speech Synthesized Mel-Spectrogram
0
1
2
3
4


Since these agencies are already obliged constantly to evaluate the activities of such groups.
Reference Words Synthesized Speech Synthesized Mel-Spectrogram
0
1
2
3
4


This is especially necessary with regards to the FBI and CIA.
Reference Words Synthesized Speech Synthesized Mel-Spectrogram
0
1
2
3
4


Since the assassination, both the secret service and the FBI have recognized.
Reference Words Synthesized Speech Synthesized Mel-Spectrogram
0
1
2
3
4



Appendix

We present the model structure of our prosody extractor and prosody predictor here.
Both the extractor and the predictor is auto-regressive, i.e. the prosody embedding or tag embedding at current step is conditioned on the output of the previous step. In the prosody predictor, each word will go through a pre-GRU separately (with its phone encodings as inputs). This pre-GRU can be regarded as a granularity-converter. In each timestamp, a word is fed into the GRU, and the output is a vector of logits. The dimension of this vector is just the number of prosody tags in total (50 in our case). The corresponding segment of a word's leaf node (i.e. dimension 25-30) is taken for softmax. Then the index of the predicted tag can either be sampled from this distribution or manually specified. This index will go through an embedding table and finally become the "tag embedding" in our language.
        
Prosody Extractor (image from paper)          Prosody Predictor


Citation

@inproceedings{guo2022unsupervised,
     title={Unsupervised word-level prosody tagging for controllable speech synthesis},
     author={Guo, Yiwei and Du, Chenpeng and Yu, Kai},
     booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
     pages={7597--7601},
     year={2022},
     organization={IEEE}
}