Audio Samples

from VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching, Submitted to ICASSP2024

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

yiwei.guo@sjtu.edu.cn

Shanghai Jiao Tong University, China

code link



1. Comparison with Diffusion-based TTS

This part presents the quality of compared systems, corresonding to Table. 1 in the paper. Note that all the audio on this page was synthesized from HifiGAN vocoder implemented in this repo. Sampling rate of audio files is 16kHz.

GT (voc.): Ground truth mel-spectrogram + HifiGAN.
GradTTS: Proposed in this paper. We use the official code to train on our data configurations.
VoiceFlow: Proposed rectified flow matching TTS model. It has the same model architecture with GradTTS, hence nearly identical inference cost.

LJSpeech Samples
and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
and she must have run down the stairs ahead of Oswald and would probably have seen or heard him.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
accordingly they committed to him the command of their whole army, and put the keys of their city into his hands.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow

LibriTTS Samples
"All the same, Freddy, I am glad you dropped in: I won't forget it."
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
Suddenly he withdrew his arm, turned quickly to the window and stood facing it. Another moment passed.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
"Yes, yes, mamma!" Snegiryov suddenly recollected, "they'll take away the bed, they'll take it away," he added as though alarmed that they really would.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
I said sharply, "Don't be simple, Moa!" I shook off her grip.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
I am absolutely confident about it, and I believe you are too.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow
Then, when the decreasing speed of the train gave his legs the advantage, Fuller was ahead, heaving ties from the road.
GT (voc.)
2 steps 10 steps 100 steps
GradTTS
VoiceFlow


2. Comparison with only flow matching (no rectified flow)

Here we explore the impact of rectified flow in VoiceFlow. This corresponds to Section 4.4 in the paper.
Note that all samples are synthesized at 2 steps.
Text VoiceFlow VoiceFlow w/o ReFlow
and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others.
His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to.
On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President.
and she must have run down the stairs ahead of Oswald and would probably have seen or heard him.
accordingly they committed to him the command of their whole army, and put the keys of their city into his hands.
Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
Text VoiceFlow VoiceFlow w/o ReFlow
"All the same, Freddy, I am glad you dropped in: I won't forget it."
Suddenly he withdrew his arm, turned quickly to the window and stood facing it. Another moment passed.
"Yes, yes, mamma!" Snegiryov suddenly recollected, "they'll take away the bed, they'll take it away," he added as though alarmed that they really would.
I said sharply, "Don't be simple, Moa!" I shook off her grip.
I am absolutely confident about it, and I believe you are too.
Then, when the decreasing speed of the train gave his legs the advantage, Fuller was ahead, heaving ties from the road.




Citation


@INPROCEEDINGS{guo2024voiceflow,
   author={Guo, Yiwei and Du, Chenpeng and Ma, Ziyang and Chen, Xie and Yu, Kai},
   booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   title={{VoiceFlow}: Efficient Text-To-Speech with Rectified Flow Matching},
   year={2024},
   volume={},
   number={},
   pages={11121-11125},
   keywords={Signal processing algorithms;Signal processing;Acoustics;Mathematical models;Vectors;Trajectory;Speech processing;Text-to-speech;flow matching;rectified flow;efficiency;speed-quality tradeoff},
   doi={10.1109/ICASSP48485.2024.10445948}
}