vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo1, Zhihan Li1, Junjie Li1, Chenpeng Du1, Hankun Wang1, Shuai Wang2, Xie Chen1, Kai Yu1

1MoE Key Lab of Artificial Intelligence, AI Institute
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Shanghai, China
2Shenzhen Research Institute of Big Data, Shenzhen, China

Submitted to ICASSP 2025

📧 Email: yiwei.guo@sjtu.edu.cn

[Paper] [Code]

Try online Interactive Demo at 🤗Huggingface Space!

We propose a new speech discrete token vocoder, vec2wav 2.0, that advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus, pushing the frontiers of VC.

English Any-to-Any VC

We conduct English any-to-any VC comparisons using the unseen speakers in the LibriTTS test-clean split. We randomly select 10 speakers, from each of whom 2 utterances are chosen to be the source utterances. Another 10 speakers are selected as target speakers with one 3-second reference utterance for each. This yields a test set of 200 any-to-any VC cases.
The test set metadata can be found here. Every line of this file means "{source}_to_{target} {target}".
To enable a fair comparison between models, we downsample all the audio to 16kHz (although vec2wav 2.0 generates 24kHz audio).

Source Utterance Target Speaker vec2wav 2.0 YourTTS DiffVC Diff-HierVC UUVC FACodec
[Show transcript]If a layman in giving baptism pour the water before saying the words is the child baptized?
[Show transcript]All this I could hardly explain to him, as I should thus be giving to him the strongest evidence against my own philosophy.
[Show transcript]No class of society, not even the most abjectly poor, forgoes all customary conspicuous consumption.
[Show transcript]Every scheme known to politicians must be worked, and none knew the intricate game better than Hopkins.
[Show transcript]The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights.
Source Utterance Target Speaker vec2wav 2.0 YourTTS DiffVC Diff-HierVC UUVC FACodec
[Show transcript]When one kidney ceases to act from disease, the other increases in size, and does double work.
[Show transcript]The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights.
[Show transcript]"But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.
[Show transcript]Every minister should make much of his calling and impress upon others the fact that he has been delegated by God to preach the Gospel.
[Show transcript]Every scheme known to politicians must be worked, and none knew the intricate game better than Hopkins.

Cross-Lingual Any-to-Any VC

We use the same set of English source utterances as in English VC tests, but convert to target speakers in other languages. We collect reference utterances from five languages (Spanish, German, Dutch, Italian, French.) in MLS. The test set is the full combination of source and target utterances, which has also 200 pairs.
The test set metadata can be found here. Every line of this file means "english_{source}_to_{language}_{target} {language}_{target}", and "source" and "target" refers to LibriTTS and MLS utterance IDs respectively.

Source Utterance Target Speaker Target Language vec2wav 2.0 YourTTS Diff-HierVC
[Show transcript]Emerald and black and russet and olive, it moved beneath the current, swaying and turning.
Dutch
[Show transcript]My father is a clergyman of limited, though competent income, and I am the eldest of his children.
Italian
[Show transcript]But Algiers, like all the other coast towns, had apparently been absorbed into the bowels of the earth.
Spanish
[Show transcript]When one kidney ceases to act from disease, the other increases in size, and does double work.
German
[Show transcript]This was a pretty little speech enough, and received the eager compliments of the officers of the john Bright.
French
Source Utterance Target Speaker Target Language vec2wav 2.0 YourTTS Diff-HierVC
[Show transcript]If a layman in giving baptism pour the water before saying the words is the child baptized?
French
[Show transcript]As for the rest of mankind, stagnating night rests upon them.
German
[Show transcript]All this I could hardly explain to him, as I should thus be giving to him the strongest evidence against my own philosophy.
Dutch
[Show transcript]"He was soft hearted and impetuous," said Beth; "and, being in love, he didn't stop to count the cost."
Spanish
[Show transcript]He was dressed in a tunic of flowers, which set off his graceful and well formed figure to advantage.
Italian

Ablation Study

Apart from vq-wav2vec, we train CTX-vec2wav and vec2wav 2.0 on HuBERT tokens and wav2vec 2.0 tokens.
We additionally train two variants of vec2wav 2.0 on vq-wav2vec inputs: vec2wav 2.0-ab1 that replaces the adaptive Snake activations in BigVGAN by the original Snakes; and vec2wav 2.0-ab2 that further replaces BigVGAN with HifiGAN. Thus the comparison between vec2wav2.0 and "ab1" indicates the effect of adaptive Snake activation, while that between CTX-vec2wav and "ab2" shows the difference made by prompt feature and modules.
Perfect VC systems should lie on the top right of the following chart.
Source Utterance Target Speaker
[Show transcript]Emerald and black and russet and olive, it moved beneath the current, swaying and turning.
HuBERT-Large
(Layer 24 kmeans 2048)
wav2vec 2.0 Large
(quantizer output)
vq-wav2vec
(quantizer output)
CTX-vec2wav
vec2wav 2.0

Pitch Contour: Case Study

We present the pitch contours of a conversion sample (LibriTTS 2830_3980_000021_000000 as source, 237_134493_000015_000004 as target).
Source Utterance Target Speaker Converted Sample
[Show transcript]Every minister should make much of his calling and impress upon others the fact that he has been delegated by God to preach the Gospel.

Citing our Work

@misc{guo2024vec2wav20advancingvoice,
	title={vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders}, 
	author={Yiwei Guo and Zhihan Li and Junjie Li and Chenpeng Du and Hankun Wang and Shuai Wang and Xie Chen and Kai Yu},
	year={2024},
	eprint={2409.01995},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	url={https://arxiv.org/abs/2409.01995}, 
}