Publications

Below are some selected publications. You can find a full list of my articles on my Google Scholar profile.

Conference Papers

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

Published in AAAI, 2026

This paper proposes to simply mask some attention heads in an LALM (large audio language model) to achieve reliable task specification. This is because selectively masking some attention heads in an LALM can trigger its specific task functionalities well.

Recommended citation: Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu (2026). "AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions." In Proc. AAAI, 2026.

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Published in ISCA Interspeech, 2025

This paper proposes LSCodec, a low-bitrate (50Hz/0.45kbps and 25Hz/0.25kbps, single codebook), speaker-decoupled discrete speech codec.

Recommended citation: Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu (2025) LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec. Proc. Interspeech 2025, 5018-5022, doi: 10.21437/Interspeech.2025-1106

VoiceFlow: Efficient text-to-speech with rectified flow matching

Published in IEEE ICASSP, 2024

This paper applies the rectified flow matching algorithm to improve the efficiency of TTS system in the differential equation family (e.g. diffusion and flow matching).

Recommended citation: Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu. (2024). "VoiceFlow: Efficient text-to-speech with rectified flow matching." In Proc. IEEE ICASSP, 2024, pp. 11121-11125.

UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

Published in AAAI, 2024

This paper proposes a context-aware TTS system with strong zero-shot TTS and speech editing abilities, by a contextual token vocoder CTX-vec2wav and discrete diffusion-based CTX-txt2vec.

Recommended citation: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu. (2024). "UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding." Proc. AAAI, 2024, vol. 38, No. 16, pp. 17924-17932.

EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance

Published in IEEE ICASSP, 2023

This paper is about designing a emotion intensity-controllable TTS model by a new soft-label guidance algorithm in the diffusion paradigm.

Recommended citation: Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu. (2023). "EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance." In Proc. IEEE ICASSP, 2023.

VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature

Published in ISCA Interspeech, 2022

This paper is the first to successfully integrate discrete SSL features in TTS that produces a competitive high-fidelity TTS system.

Recommended citation: Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu. (2022). "VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature." In Proc. ISCA Interspeech, 2022, pp.1596-1600.

Unsupervised word-level prosody tagging for controllable speech synthesis

Published in IEEE ICASSP, 2022

This paper aims at enhancing word-level prosody controllability in TTS models by decision tree-based clustering.

Recommended citation: Yiwei Guo, Chenpeng Du, Kai Yu. (2022). "Unsupervised word-level prosody tagging for controllable speech synthesis." In Proc. IEEE ICASSP, 2022, pp.7597-7601.

Journal Articles

Recent Advances in Discrete Speech Tokens: A Review

Published in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

This review provides a comprehensive in-depth summary and analysis of recent discrete speech tokenization methods.

Recommended citation: Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu. (2025). "Recent Advances in Discrete Speech Tokens: A Review." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature

Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

This paper proposes TN-VQTTS that leverages timbre-normalized vector-quantized acoustic feature for TTS speaker adaptation with little data.

Recommended citation: Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu. (2023). "Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature." IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, vol. 31, pp. 3446-3456.