This part presents the naturalness of four compared systems, corresonding to Section 3.3 in
the paper.
Note that all the audio on this page is synthesized from MelGAN, including ground truth.
All the utterances are from LJspeech
with utterance IDs listed.
GT: Ground truth (resynthesized) Raw_FSP: Typical FastSpeech2 model with no prosody modeling PLP_MDN: Phone-level prosody modeling with mixture density network (proposed in this paper) WLP_predict: Proposed word-level prosody tagging system. Here all the tags are automatically predicted from text. |
GT | Raw_FSP | PLP_MDN | WLP_predict | |
---|---|---|---|---|
LJ050-0179 | ||||
LJ050-0185 | ||||
LJ050-0187 | ||||
LJ050-0191 | ||||
LJ050-0206 | ||||
LJ050-0227 |
This part presents the experiment of prosody
controllability in Section 3.4 of the paper. In this demo, all the controlled words belong to a same leaf node (i.e. the phonetic variation among them is reduced). For each utterance, we present the ground truth speech and five synthetic speeches we obtain by controlling with the five prosody tags. The word being controlled here is marked red. Listeners can verify that, when the specified prosody tag is equal to the ground-truth prosody tag, the word prosody in the synthetic speech should be most similar to the recordings. |
International Business Machines Corporation, and a panel of psychiatric and psychological experts. |
White | Both Director Hoover and Belmont expressed to the
Commission the great concern of the FBI, which is shared by the Secret Service |
||
---|---|---|---|---|
GT with tag 0 | GT with tag 0 | |||
Synthetic Speech with tag 0 | Synthetic Speech with tag 0 | |||
Synthetic Speech with tag 1 | Synthetic Speech with tag 1 | |||
Synthetic Speech with tag 2 | Synthetic Speech with tag 2 | |||
Synthetic Speech with tag 3 | Synthetic Speech with tag 3 | |||
Synthetic Speech with tag 4 | Synthetic Speech with tag 4 |
Determination to use a means, other than legal or peaceful, to
satisfy his grievance, end quote, within the meaning of the new criteria. |
White | Information is requested
also concerning individuals or groups who have demonstrated an interest in the President. |
||
---|---|---|---|---|
GT with tag 1 | GT with tag 1 | |||
Synthetic Speech with tag 0 | Synthetic Speech with tag 0 | |||
Synthetic Speech with tag 1 | Synthetic Speech with tag 1 | |||
Synthetic Speech with tag 2 | Synthetic Speech with tag 2 | |||
Synthetic Speech with tag 3 | Synthetic Speech with tag 3 | |||
Synthetic Speech with tag 4 | Synthetic Speech with tag 4 |
The Commission notes with approval several recent
measures taken and proposed by the Secret Service to improve its liaison arrangements. |
White | This matter is obviously beyond the jurisdiction of the Commission | ||
---|---|---|---|---|
GT with tag 2 | GT with tag 2 | |||
Synthetic Speech with tag 0 | Synthetic Speech with tag 0 | |||
Synthetic Speech with tag 1 | Synthetic Speech with tag 1 | |||
Synthetic Speech with tag 2 | Synthetic Speech with tag 2 | |||
Synthetic Speech with tag 3 | Synthetic Speech with tag 3 | |||
Synthetic Speech with tag 4 | Synthetic Speech with tag 4 |
To establish liaison
with local intelligence gathering agencies and to provide for the immediate evaluation of information received from them. |
White | Suggest that the Secret Service is trying to accomplish its job with too few people and without adequate modern equipment. |
||
---|---|---|---|---|
GT with tag 3 | GT with tag 3 | |||
Synthetic Speech with tag 0 | Synthetic Speech with tag 0 | |||
Synthetic Speech with tag 1 | Synthetic Speech with tag 1 | |||
Synthetic Speech with tag 2 | Synthetic Speech with tag 2 | |||
Synthetic Speech with tag 3 | Synthetic Speech with tag 3 | |||
Synthetic Speech with tag 4 | Synthetic Speech with tag 4 |
So that these arrangements can be made permanent without adversely affecting the operations of the Service's field offices. |
White | And should not be relied upon to the detriment of the imaginative application of judgment in special cases. |
||
---|---|---|---|---|
GT with tag 4 | GT with tag 4 | |||
Synthetic Speech with tag 0 | Synthetic Speech with tag 0 | |||
Synthetic Speech with tag 1 | Synthetic Speech with tag 1 | |||
Synthetic Speech with tag 2 | Synthetic Speech with tag 2 | |||
Synthetic Speech with tag 3 | Synthetic Speech with tag 3 | |||
Synthetic Speech with tag 4 | Synthetic Speech with tag 4 |
In addition to the experiments mentioned in the paper, we also present a similar demo for
word-level prosody control,
which might be easier to understand.
Here, the controlled words can come from different leaf nodes.
For each of those words, we synthesize it using the five prosody tags in its leaf node.
For each tag, we concatenate some words in the training set of recordings that have this
specific ground truth tag.
These concatenated words are called reference words.
Listeners can verify that the controlled word in the right column has the similar prosody
with the reference words at left.
The mel-spectrograms are also shown so as to depict the prosodic variations. |
And the respective responsibilities for any further investigation that may be required. | |||
Reference Words | Synthesized Speech | Synthesized Mel-Spectrogram | |
---|---|---|---|
0 | ![]() |
||
1 | |||
2 | |||
3 | |||
4 |
Detailed formal agreements embodying these arrangements
should be worked out between the secret service and both of these agencies. |
|||
Reference Words | Synthesized Speech | Synthesized Mel-Spectrogram | |
---|---|---|---|
0 | ![]() |
||
1 | |||
2 | |||
3 | |||
4 |
Since these agencies are already obliged constantly to evaluate the activities of such groups. | |||
Reference Words | Synthesized Speech | Synthesized Mel-Spectrogram | |
---|---|---|---|
0 | ![]() |
||
1 | |||
2 | |||
3 | |||
4 |
This is especially necessary with regards to the FBI and CIA. | |||
Reference Words | Synthesized Speech | Synthesized Mel-Spectrogram | |
---|---|---|---|
0 | ![]() |
||
1 | |||
2 | |||
3 | |||
4 |
Since the assassination, both the secret service and the FBI have recognized. | |||
Reference Words | Synthesized Speech | Synthesized Mel-Spectrogram | |
---|---|---|---|
0 | ![]() |
||
1 | |||
2 | |||
3 | |||
4 |
We present the model structure of our prosody extractor and prosody predictor here.
Both the extractor and the predictor is auto-regressive, i.e. the prosody embedding |
![]() |
         | ![]() |
Prosody Extractor (image from paper) |          | Prosody Predictor |
@inproceedings{guo2022unsupervised,
|