LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Yiwei Guo^1,2, Zhihan Li^1,2, Chenpeng Du^1,2, Hankun Wang^1,2 , Xie Chen^1,2, Kai Yu^1,2

¹X-LANCE Lab, MoE Key Lab of Artificial Intelligence,
School of Computer Science, Shanghai Jiao Tong University, China
²Jiangsu Key Lab of Language Computing, China

Accepted to Interspeech 2025

[Paper][Code]

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework.

Speech Reconstruction

These are the speech reconstruction samples from LibriTTS testset-B. All audio files are 24kHz.
Notes:

WavTokenizer refers to the small-40Hz (600 times downsample).
wav2vec 2.0 refers to the quantizer outputs of the Large version.
WavLM refers to the Large layer 24 + kmeans 2048 centroids.
DAC (First VQ only) samples are hidden for poor quality.

Semantic tokens are reconstructed using CTX-vec2wav^α vocoder described in Section 2.4.

Ground truth	LSCodec-50Hz (0.45kbps)	LSCodec-25Hz (0.25kbps)	TiCodec-1VQ (0.75kbps)	WavTokenizer (0.48kbps)	wav2vec 2.0 (0.83kbps)	WavLM L24 km2048 (0.55kbps)
[Show transcript] "But," Cresswell added significantly, "capacity differs enormously between races."
[Show transcript] We stood amazed, thunderstruck, at the presence of such a herd of marine monsters.
[Show transcript] "Don't know; 'most everything she says sounds like the Bible or Shakespeare to me."
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript] It is a fact of common observance that in this lower middle class there is no pretense of leisure on the part of the head of the household.

Ground truth	LSCodec-50Hz (0.45kbps)	LSCodec-25Hz (0.25kbps)	TiCodec-1VQ (0.75kbps)	WavTokenizer (0.48kbps)	wav2vec 2.0 (0.83kbps)	WavLM L24 km2048 (0.55kbps)
[Show transcript] Hence the Edison electrolytic meter is no longer used, despite its excellent qualities.
[Show transcript] Moreover, he was not in the situation or the surroundings which one is wont to associate with ducks.
[Show transcript] She was so unreserved, it seemed, and yet in this directness there was something almost contemptuous.
[Show transcript] But if he put the inference by without a smile it was also without irritation.
[Show transcript] The music, hautboys, flutes, and viols, was delightfully descriptive of rural delights.

Any-to-Any Voice Conversion

These are the any-to-any voice conversion samples. The test set are constructed by shuffling the testset-B prompts so that source and target prompt belong to different speakers.
The test set can be found here. Every line is formatted as "source_utt target_prompt_utt".
Notes:

In FACodec, detail tokens are discarded for better VC performance.
Semantic tokens (wav2vec 2.0 and WavLM), are vocoded using the same method as previous experiment.
All samples are downsampled to 16kHz for fair comparison.

Source Utterance	Target Speaker	LSCodec-50Hz	LSCodec-25Hz	TiCodec-1VQ	FACodec	wav2vec 2.0	WavLM L24 km2048
[Show transcript] "Why dost thou, tyrant, boast thyself, Thy wicked deeds to praise?"
[Show transcript] The air and the earth are curiously mated and intermingled, as if the one were the breath of the other.
[Show transcript] "There cannot be a doubt he received you kindly, for, in fact, you returned without his permission."
[Show transcript] Rodolfo was now impatient to get rid of Leocadia, and made up his mind to lay her in the street, insensible as she was.
[Show transcript] I had, however, promised to take tea in a friend's rooms, so I left the proof upon my desk.

Source Utterance	Target Speaker	LSCodec-50Hz	LSCodec-25Hz	TiCodec-1VQ	FACodec	wav2vec 2.0	WavLM L24 km2048
[Show transcript] He is said to have baptized as many as ten thousand idolaters in one month.
[Show transcript] The investors in the enterprise were ready and anxious to meet the extra cost of putting the wires underground.
[Show transcript] But Miss Branwell persevered; urged economical motives; pressed on his love for his daughters.
[Show transcript] And God created them according to the patterns or species of them which existed in the divine original.
[Show transcript] He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable.