Although discrete speech tokens exhibit strong potential for language model-based speech generation, the high bandwidths and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low-bandwidth and speaker-decoupling ability. LSCodec adopts a three-stage training framework with a simple speaker perturbation technique. A continuous bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details in LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz variant of LSCodec also achieves the lowest bitrate of codecs so far. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.
These are the speech reconstruction samples from LibriTTS testset-B. All audio files are 24kHz.
Notes:
Ground truth | LSCodec-50Hz (0.45kbps) |
LSCodec-25Hz (0.25kbps) |
TiCodec-1VQ (0.75kbps) |
WavTokenizer (0.48kbps) |
wav2vec
2.0 (0.83kbps) |
HuBERT L24
km2048 (0.55kbps) |
---|---|---|---|---|---|---|
[Show transcript]"But," Cresswell added significantly, "capacity differs enormously between races." |
||||||
[Show transcript]We stood amazed, thunderstruck, at the presence of such a herd of marine monsters. |
||||||
[Show transcript]"Don't know; 'most everything she says sounds like the Bible or Shakespeare to me." |
||||||
[Show transcript]"There cannot be a doubt he received you kindly, for, in fact, you returned without his permission." |
||||||
[Show transcript]It is a fact of common observance that in this lower middle class there is no pretense of leisure on the part of the head of the household. |
Ground truth | LSCodec-50Hz (0.45kbps) |
LSCodec-25Hz (0.25kbps) |
TiCodec-1VQ (0.75kbps) |
WavTokenizer (0.48kbps) |
wav2vec
2.0 (0.83kbps) |
HuBERT L24
km2048 (0.55kbps) |
---|---|---|---|---|---|---|
[Show transcript]Hence the Edison electrolytic meter is no longer used, despite its excellent qualities. |
||||||
[Show transcript]Moreover, he was not in the situation or the surroundings which one is wont to associate with ducks. |
||||||
[Show transcript]She was so unreserved, it seemed, and yet in this directness there was something almost contemptuous. |
||||||
[Show transcript]But if he put the inference by without a smile it was also without irritation. |
||||||
[Show transcript]The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights. |
These are the any-to-any voice conversion samples.
The test set are constructed by shuffling the testset-B prompts so that source and target prompt belong
to different speakers.
Notes:
Source Utterance | Target Speaker | LSCodec-50Hz | LSCodec-25Hz | TiCodec-1VQ | FACodec | wav2vec 2.0 | HuBERT L24 km2048 |
---|---|---|---|---|---|---|---|
[Show transcript]"Why dost thou, tyrant, boast thyself, Thy wicked deeds to praise?" |
|||||||
[Show transcript]The air and the earth are curiously mated and intermingled, as if the one were the breath of the other. |
|||||||
[Show transcript]"There cannot be a doubt he received you kindly, for, in fact, you returned without his permission." |
|||||||
[Show transcript]Rodolfo was now impatient to get rid of Leocadia, and made up his mind to lay her in the street, insensible as she was. |
|||||||
[Show transcript]I had, however, promised to take tea in a friend's rooms, so I left the proof upon my desk. |
Source Utterance | Target Speaker | LSCodec-50Hz | LSCodec-25Hz | TiCodec-1VQ | FACodec | wav2vec 2.0 | HuBERT L24 km2048 |
---|---|---|---|---|---|---|---|
[Show transcript]He is said to have baptized as many as ten thousand idolaters in one month. |
|||||||
[Show transcript]The investors in the enterprise were ready and anxious to meet the extra cost of putting the wires underground. |
|||||||
[Show transcript]But Miss Branwell persevered; urged economical motives; pressed on his love for his daughters. |
|||||||
[Show transcript]And God created them according to the patterns or species of them which existed in the divine original. |
|||||||
[Show transcript]He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable. |
Source Utterance | Target Speaker |
---|---|
[Show transcript]But that wise and placid woman understood the sweet rebel a great deal better than ruth understood herself. |
Task | Model | S1 (VAE) | S2 (VQ-VAE) | S3 (Vocoder) |
---|---|---|---|---|
Reconstruction | 50Hz (0.45kbps) | |||
Reconstruction | 25Hz (0.25kbps) | |||
Voice Conversion | 50Hz (0.45kbps) | |||
Voice Conversion | 25Hz (0.25kbps) |
@article{guo2024lscodec,
author={Yiwei Guo and Zhihan Li and Chenpeng Du and Hankun Wang and Xie Chen and Kai Yu},
title={{LSCodec}: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec},
journal={arXiv preprint arXiv:2410.15764},
year={2024},
}