Audio Samples from "<b>Unsupervised Word-Level Prosody Tagging for Controllable Speech Synthesis</b>"

Audio Samples from
"Unsupervised Word-Level Prosody Tagging for Controllable Speech Synthesis", ICASSP2022

Yiwei Guo, Chenpeng Du, Kai Yu

yiwei.guo@sjtu.edu.cn

Shanghai Jiao Tong University

1. Naturalness

This part presents the naturalness of four compared systems, corresonding to Section 3.3 in the paper. Note that all the audio on this page is synthesized from MelGAN, including ground truth. All the utterances are from LJspeech with utterance IDs listed.

GT: Ground truth (resynthesized)
Raw_FSP: Typical FastSpeech2 model with no prosody modeling
PLP_MDN: Phone-level prosody modeling with mixture density network (proposed in this paper)
WLP_predict: Proposed word-level prosody tagging system. Here all the tags are automatically predicted from text.

	GT	Raw_FSP	PLP_MDN	WLP_predict
LJ050-0179
LJ050-0185
LJ050-0187
LJ050-0191
LJ050-0206
LJ050-0227

2. Prosody Controllability

This part presents the experiment of prosody controllability in Section 3.4 of the paper.
In this demo, all the controlled words belong to a same leaf node (i.e. the phonetic variation among them is reduced). For each utterance, we present the ground truth speech and five synthetic speeches we obtain by controlling with the five prosody tags. The word being controlled here is marked red. Listeners can verify that, when the specified prosody tag is equal to the ground-truth prosody tag, the word prosody in the synthetic speech should be most similar to the recordings.

International Business Machines Corporation, and a panel of psychiatric and psychological experts.		White	Both Director Hoover and Belmont expressed to the Commission the great concern of the FBI, which is shared by the Secret Service

GT with tag 0			GT with tag 0
Synthetic Speech with tag 0			Synthetic Speech with tag 0
Synthetic Speech with tag 1			Synthetic Speech with tag 1
Synthetic Speech with tag 2			Synthetic Speech with tag 2
Synthetic Speech with tag 3			Synthetic Speech with tag 3
Synthetic Speech with tag 4			Synthetic Speech with tag 4

Determination to use a means, other than legal or peaceful, to satisfy his grievance, end quote, within the meaning of the new criteria.		White	Information is requested also concerning individuals or groups who have demonstrated an interest in the President.

GT with tag 1			GT with tag 1
Synthetic Speech with tag 0			Synthetic Speech with tag 0
Synthetic Speech with tag 1			Synthetic Speech with tag 1
Synthetic Speech with tag 2			Synthetic Speech with tag 2
Synthetic Speech with tag 3			Synthetic Speech with tag 3
Synthetic Speech with tag 4			Synthetic Speech with tag 4

The Commission notes with approval several recent measures taken and proposed by the Secret Service to improve its liaison arrangements.		White	This matter is obviously beyond the jurisdiction of the Commission

GT with tag 2			GT with tag 2
Synthetic Speech with tag 0			Synthetic Speech with tag 0
Synthetic Speech with tag 1			Synthetic Speech with tag 1
Synthetic Speech with tag 2			Synthetic Speech with tag 2
Synthetic Speech with tag 3			Synthetic Speech with tag 3
Synthetic Speech with tag 4			Synthetic Speech with tag 4

To establish liaison with local intelligence gathering agencies and to provide for the immediate evaluation of information received from them.		White	Suggest that the Secret Service is trying to accomplish its job with too few people and without adequate modern equipment.

GT with tag 3			GT with tag 3
Synthetic Speech with tag 0			Synthetic Speech with tag 0
Synthetic Speech with tag 1			Synthetic Speech with tag 1
Synthetic Speech with tag 2			Synthetic Speech with tag 2
Synthetic Speech with tag 3			Synthetic Speech with tag 3
Synthetic Speech with tag 4			Synthetic Speech with tag 4

So that these arrangements can be made permanent without adversely affecting the operations of the Service's field offices.		White	And should not be relied upon to the detriment of the imaginative application of judgment in special cases.

GT with tag 4			GT with tag 4
Synthetic Speech with tag 0			Synthetic Speech with tag 0
Synthetic Speech with tag 1			Synthetic Speech with tag 1
Synthetic Speech with tag 2			Synthetic Speech with tag 2
Synthetic Speech with tag 3			Synthetic Speech with tag 3
Synthetic Speech with tag 4			Synthetic Speech with tag 4

In addition to the experiments mentioned in the paper, we also present a similar demo for word-level prosody control, which might be easier to understand. Here, the controlled words can come from different leaf nodes. For each of those words, we synthesize it using the five prosody tags in its leaf node. For each tag, we concatenate some words in the training set of recordings that have this specific ground truth tag. These concatenated words are called reference words. Listeners can verify that the controlled word in the right column has the similar prosody with the reference words at left.
The mel-spectrograms are also shown so as to depict the prosodic variations.

And the respective responsibilities for any further investigation that may be required.
	Reference Words	Synthesized Speech	Synthesized Mel-Spectrogram
0
1
2
3
4

Detailed formal agreements embodying these arrangements should be worked out between the secret service and both of these agencies.
	Reference Words	Synthesized Speech	Synthesized Mel-Spectrogram
0
1
2
3
4

Since these agencies are already obliged constantly to evaluate the activities of such groups.
	Reference Words	Synthesized Speech	Synthesized Mel-Spectrogram
0
1
2
3
4