Talk in the University of Cambridge: Reducing Speaker and Temporal Redundancy in Discrete Speech Tokenization
Date:

Discrete speech tokens have emerged as a fundamental representation for various downstream speech processing tasks, particularly in speech generation. However, most existing tokens encode dense, fixed-rate acoustic information, which introduces substantial redundancy and limits their efficiency. In this talk, I will first provide a brief review on the taxonomy of current discrete speech tokens, then present our works exploring the reduction of this information redundancy in two critical directions:
(1) Speaker timbre disentanglement, introducing a low-bitrate, single-codebook and speaker-decoupled codec for speech.
(2) Variable-rate temporal compression, exploring methods to dynamically adjust the frame rate of discrete tokens for better compactness and bitrate-performance tradeoff.
Together, these efforts highlight pathways toward more efficient and controllable discrete speech representations, paving the way for the next generation of speech technologies.
