LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec

Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang , Xie Chen, Kai Yu

MoE Key Lab of Artificial Intelligence, AI Institute
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Shanghai, China

Submitted to ICASSP 2025

📧 Email: yiwei.guo@sjtu.edu.cn

[Paper]

Although discrete speech tokens exhibit strong potential for language model-based speech generation, the high bandwidths and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low-bandwidth and speaker-decoupling ability. LSCodec adopts a three-stage training framework with a simple speaker perturbation technique. A continuous bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details in LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz variant of LSCodec also achieves the lowest bitrate of codecs so far. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.

Speech Reconstruction

These are the speech reconstruction samples from LibriTTS testset-B. All audio files are 24kHz.
Notes:

  • WavTokenizer refers to the small-40Hz (600 times downsample).
  • wav2vec 2.0 refers to the quantizer outputs.
  • HuBERT refers to the layer 24 + kmeans 2048 centroids.
  • EnCodec (1VQ) samples are hidden for poor quality.
Semantic tokens are reconstructed using CTX-vec2wavα vocoder described in Section II.D.

Ground truth LSCodec-50Hz
(0.45kbps)
LSCodec-25Hz
(0.25kbps)
TiCodec-1VQ
(0.75kbps)
WavTokenizer
(0.48kbps)
wav2vec 2.0
(0.83kbps)
HuBERT L24 km2048
(0.55kbps)
[Show transcript] "But," Cresswell added significantly, "capacity differs enormously between races."
[Show transcript]We stood amazed, thunderstruck, at the presence of such a herd of marine monsters.
[Show transcript] "Don't know; 'most everything she says sounds like the Bible or Shakespeare to me."
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript]It is a fact of common observance that in this lower middle class there is no pretense of leisure on the part of the head of the household.
Ground truth LSCodec-50Hz
(0.45kbps)
LSCodec-25Hz
(0.25kbps)
TiCodec-1VQ
(0.75kbps)
WavTokenizer
(0.48kbps)
wav2vec 2.0
(0.83kbps)
HuBERT L24 km2048
(0.55kbps)
[Show transcript] Hence the Edison electrolytic meter is no longer used, despite its excellent qualities.
[Show transcript] Moreover, he was not in the situation or the surroundings which one is wont to associate with ducks.
[Show transcript]She was so unreserved, it seemed, and yet in this directness there was something almost contemptuous.
[Show transcript]But if he put the inference by without a smile it was also without irritation.
[Show transcript]The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights.

Any-to-Any Voice Conversion

These are the any-to-any voice conversion samples. The test set are constructed by shuffling the testset-B prompts so that source and target prompt belong to different speakers.
Notes:

Source Utterance Target Speaker LSCodec-50Hz LSCodec-25Hz TiCodec-1VQ FACodec wav2vec 2.0 HuBERT L24 km2048
[Show transcript] "Why dost thou, tyrant, boast thyself, Thy wicked deeds to praise?"
[Show transcript] The air and the earth are curiously mated and intermingled, as if the one were the breath of the other.
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript] Rodolfo was now impatient to get rid of Leocadia, and made up his mind to lay her in the street, insensible as she was.
[Show transcript] I had, however, promised to take tea in a friend's rooms, so I left the proof upon my desk.
Source Utterance Target Speaker LSCodec-50Hz LSCodec-25Hz TiCodec-1VQ FACodec wav2vec 2.0 HuBERT L24 km2048
[Show transcript] He is said to have baptized as many as ten thousand idolaters in one month.
[Show transcript] The investors in the enterprise were ready and anxious to meet the extra cost of putting the wires underground.
[Show transcript] But Miss Branwell persevered; urged economical motives; pressed on his love for his daughters.
[Show transcript] And God created them according to the patterns or species of them which existed in the divine original.
[Show transcript] He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable.

Stage-wise Comparisons

As the training of LSCodec involves three stages, we put a case study here to see the effect made by this sequential training.
Source Utterance Target Speaker
[Show transcript] But that wise and placid woman understood the sweet rebel a great deal better than ruth understood herself.
Task Model S1 (VAE) S2 (VQ-VAE) S3 (Vocoder)
Reconstruction 50Hz (0.45kbps)
Reconstruction 25Hz (0.25kbps)
Voice Conversion 50Hz (0.45kbps)
Voice Conversion 25Hz (0.25kbps)

Citing our Work


@article{guo2024lscodec,
	author={Yiwei Guo and Zhihan Li and Chenpeng Du and Hankun Wang and Xie Chen and Kai Yu},
	title={{LSCodec}: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec},
	journal={arXiv preprint arXiv:2410.15764},
	year={2024},
}