LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Yiwei Guo1,2, Zhihan Li1,2, Chenpeng Du1,2, Hankun Wang1,2 , Xie Chen1,2, Kai Yu1,2

1X-LANCE Lab, MoE Key Lab of Artificial Intelligence,
School of Computer Science, Shanghai Jiao Tong University, China
2Jiangsu Key Lab of Language Computing, China

Accepted to Interspeech 2025

[Paper]

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.

Speech Reconstruction

These are the speech reconstruction samples from LibriTTS testset-B. All audio files are 24kHz.
Notes:

  • WavTokenizer refers to the small-40Hz (600 times downsample).
  • wav2vec 2.0 refers to the quantizer outputs of the Large version.
  • WavLM refers to the Large layer 24 + kmeans 2048 centroids.
  • DAC (First VQ only) samples are hidden for poor quality.
Semantic tokens are reconstructed using CTX-vec2wavα vocoder described in Section 2.4.

Ground truth LSCodec-50Hz
(0.45kbps)
LSCodec-25Hz
(0.25kbps)
TiCodec-1VQ
(0.75kbps)
WavTokenizer
(0.48kbps)
wav2vec 2.0
(0.83kbps)
WavLM L24 km2048
(0.55kbps)
[Show transcript] "But," Cresswell added significantly, "capacity differs enormously between races."
[Show transcript]We stood amazed, thunderstruck, at the presence of such a herd of marine monsters.
[Show transcript] "Don't know; 'most everything she says sounds like the Bible or Shakespeare to me."
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript]It is a fact of common observance that in this lower middle class there is no pretense of leisure on the part of the head of the household.
Ground truth LSCodec-50Hz
(0.45kbps)
LSCodec-25Hz
(0.25kbps)
TiCodec-1VQ
(0.75kbps)
WavTokenizer
(0.48kbps)
wav2vec 2.0
(0.83kbps)
WavLM L24 km2048
(0.55kbps)
[Show transcript] Hence the Edison electrolytic meter is no longer used, despite its excellent qualities.
[Show transcript] Moreover, he was not in the situation or the surroundings which one is wont to associate with ducks.
[Show transcript]She was so unreserved, it seemed, and yet in this directness there was something almost contemptuous.
[Show transcript]But if he put the inference by without a smile it was also without irritation.
[Show transcript]The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights.

Any-to-Any Voice Conversion

These are the any-to-any voice conversion samples. The test set are constructed by shuffling the testset-B prompts so that source and target prompt belong to different speakers.
Notes:

Source Utterance Target Speaker LSCodec-50Hz LSCodec-25Hz TiCodec-1VQ FACodec wav2vec 2.0 WavLM L24 km2048
[Show transcript] "Why dost thou, tyrant, boast thyself, Thy wicked deeds to praise?"
[Show transcript] The air and the earth are curiously mated and intermingled, as if the one were the breath of the other.
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript] Rodolfo was now impatient to get rid of Leocadia, and made up his mind to lay her in the street, insensible as she was.
[Show transcript] I had, however, promised to take tea in a friend's rooms, so I left the proof upon my desk.
Source Utterance Target Speaker LSCodec-50Hz LSCodec-25Hz TiCodec-1VQ FACodec wav2vec 2.0 WavLM L24 km2048
[Show transcript] He is said to have baptized as many as ten thousand idolaters in one month.
[Show transcript] The investors in the enterprise were ready and anxious to meet the extra cost of putting the wires underground.
[Show transcript] But Miss Branwell persevered; urged economical motives; pressed on his love for his daughters.
[Show transcript] And God created them according to the patterns or species of them which existed in the divine original.
[Show transcript] He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable.

Stage-wise Comparisons

As the training of LSCodec involves three stages, we put a case study here to see the effect made by this sequential training.
Source Utterance Target Speaker
[Show transcript] But that wise and placid woman understood the sweet rebel a great deal better than ruth understood herself.
Task Model S1 (VAE) S2 (VQ-VAE) S3 (Vocoder)
Reconstruction 50Hz (0.45kbps)
Reconstruction 25Hz (0.25kbps)
Voice Conversion 50Hz (0.45kbps)
Voice Conversion 25Hz (0.25kbps)

Citing our Work

@article{guo2024lscodec,
	author={Yiwei Guo and Zhihan Li and Chenpeng Du and Hankun Wang and Xie Chen and Kai Yu},
	title={{LSCodec}: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec},
	journal={arXiv preprint arXiv:2410.15764},
	year={2024},
}