信息解耦式自监督预训练语音大模型
Information disentanglement based self-supervised learning speech pre-trained large model
- 2025年 页码:1-14
收稿日期:2024-10-07,
修回日期:2025-02-23,
录用日期:2025-03-03,
网络出版日期:2025-03-04
DOI: 10.11834/jig.240607
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-10-07,
修回日期:2025-02-23,
录用日期:2025-03-03,
网络出版日期:2025-03-04,
移动端阅览
目的
2
本文探讨了一种基于语音信息解耦策略的语音预训练大模型,旨在利用海量无标注语音数据训练模型,从中提取出语言信息、副语言信息和非语言信息,并促使各个表征相互独立。该模型能够为下游的大语言模型和生成模型提供完备且可控的语音信息,从而支持言语交互系统的研发。
方法
2
本文提出了一种基于信息解耦的自监督语音表征学习大模型方案,利用海量无标注数据,实现了高质量语音信息解耦。在编码器风格的自监督预训练策略的基础上,引入了两个轻量化模块,以增强韵律和说话人信息的提取能力。同时为避免已提取的信息干扰内容信息的学习,模型通过残差的方式将其从主分支中去除,并采用语音掩码预测机制训练主分支,以保证模型的深层特征在语言处理任务上的优越表现。通过这种方式,模型能够逐步地提取输入语音中的韵律、说话人和内容特征。通过结合多层特征并调整权重,模型能够获取适用于各类下游任务的特定特征。此外,文中提出的渐进式解码器优化了预训练大模型在语音生成任务中的适应性。
结果
2
实验结果表明,本文方法在不同数量音频训练出的两个版本模型(Base和Large),对语音识别、说话人验证、情感识别以及情感音色转换等任务中均表现出显著的优势。与HuBERT模型相比,Base版本在语音识别、说话人验证和情感识别任务中的准确率分别提高了5.65%、13.02%和2.43%;Large版本则分别提高了2.53%、5.76%和1.78%。在情感音色转换任务中,本文模型相较于基线模型ConsistencyVC和wav2vec-vc展示了更优的性能,具体表现为在说话人相似度、情感相似度、词错率和感知质量评分等指标上均有所提升,进一步验证了模型的有效性。
结论
2
这一成果通过将信息解耦思路融入自监督预训练特征提取大模型,有效提升了模型对语音信息的解析与重构能力,为言语交互大模型提供了新的研究视角与实用工具。
Objective
2
With the advancement of technology, the demand for large models in speech interaction is growing increasingly. This paper delves into a speech information disentanglement technique based self-supervised pre-trained large model, aiming to train model that extracts linguistic, paralinguistic, and non-linguistic information from speech. This approach leverages vast amounts of unannotated data, enabling the model to learn effectively even in the absence of labels. By achieving the independence of the extracted speech representations, downstream models can clearly distinguish between different types of information, which is crucial for enhancing the accuracy and controllability of speech processing. Specifically, the core of the disentanglement technique lies in effectively extracting and separating different layers of information within speech signals. Linguistic information conveys specific content, while paralinguistic information includes the speaker's emotions, intonation, and other nuances. Non-linguistic information may encompass the speaker's physiological state, environmental sounds, or background noise. By modularizing these types of information, the model not only gains a better understanding of speech content but also allows for flexible adjustments or replacements of these elements as needed. This process provides comprehensive and detailed speech information for downstream language models and generative models, significantly enhancing their support for complex verbal interactions. Furthermore, this technique facilitates multi-task learning. In speech interaction tasks, different application scenarios have varying demands for speech information. Through disentanglement, the model can adapt to these diverse requirements, achieving higher performance across multiple tasks, such as speech recognition, emotion recognition, and speech synthesis. This not only offers flexibility for practical applications but also opens up new avenues for further research.
Method
2
To address this challenge, we propose an information disentanglement-based self-supervised speech representation learning model that effectively leverages vast amounts of unannotated data, achieving high-quality speech information disentanglement. Specifically, we build upon an encoder-style self-supervised learning (SSL) framework and introduce two lightweight specialized modules. These modules enhance the model's capacity to extract pitch variation and speaker identity from speech signals, which are crucial for achieving expressive and contextually rich speech generation. To ensure that the extracted pitch variation and speaker information do not interfere with the learning of content information, we employ a residual removal approach, systematically disentangling these components from the main processing branch. The main branch is then trained using HuBERT's speech masking prediction mechanism, which optimizes the deep layers of the encoder for superior performance in linguistic tasks. This method allows us to progressively extract and refine representations of pitch variation, speaker identity, and content from the input speech, thereby fostering a more nuanced understanding of the speech signal. Furthermore, we combine the diverse representations obtained from different layers, strategically adjusting their weights to generate task-specific representations tailored for various downstream speech processing applications. This flexibility is essential for effectively addressing the distinct requirements of tasks such as speech recognition, emotion recognition, and voice conversion. Additionally, we introduce a progressive decoder that builds upon these representations, enabling seamless execution of downstream speech generation tasks. This comprehensive approach not only enhances the model's adaptability and performance across multiple tasks but also paves the way for more sophisticated and context-aware speech interaction systems.
Results
2
Experimental results demonstrate that our proposed method shows significant advantages across various tasks, including speech recognition, speaker verification, speech enhancement, emotion recognition, and voice conversion tasks. Notably, in the disentanglement-based emotion and voice conversion task, our model achieves substantial improvements in emotional similarity, speaker similarity, word accuracy, and audio quality ratings compared to the second best model. These enhancements reflect the model's capability to effectively disentangle and manipulate various speech information components, which contributes to a natural and expressive speech output. This underscores the effectiveness of our approach in not only improving the quality of synthesized speech but also enhancing the controllability of emotional and speaker characteristics, ultimately paving the way for more sophisticated applications in human-computer interaction.
Conclusion
2
This achievement enhances the analysis and synthesis capabilities of speech information by incorporating information disentanglement into the pre-trained extraction model. This enables the model to clearly identify and manipulate aspects of speech, such as linguistic content, emotional state, and speaker identity. Improved clarity of speech features is crucial for advancing large models focused on verbal interaction. Additionally, this approach offers insights and practical tools applicable in various fields, from enhancing speech recognition to improving emotional expressiveness in generated speech. It fosters more nuanced human-computer communication and encourages research into complex speech dynamics, potentially leading to innovations in personalized virtual assistants and adaptive language learning tools, ultimately enriching user experiences in human-computer interactions.
Adigwe A , Tits N , Haddad K E , Ostadabbas S , Dutoit T . 2018 . The Emotional Voices Database: Towards controlling the emotion dimension in voice generation systems [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/1806.09514 https://arxiv.org/abs/1806.09514
Baevski A , Hsu W N , Xu Q , et al . 2022 . Data2Vec: A general framework for self-supervised learning in speech, vision and language // International Conference on Machine Learning . PMLR: 1298 - 1312
Busso C , Bulut M , Lee C-C , Kazemzadeh A , Mower E , Kim S , Chang J N , Lee S , Narayanan S S . 2008 . IEMOCAP: Interactive emotional dyadic motion capture database . Language Resources and Evaluation , 42 : 335 - 359 [ DOI: 10.1007/s10579-008-9076-6 http://dx.doi.org/10.1007/s10579-008-9076-6 ]
Cao H , Cooper D G , Keutmann M K , Gur R C , Nenkova A , Verma R . 2014 . CREMA-D: Crowd-sourced emotional multimodal actors dataset . IEEE Transactions on Affective Computing , 5 ( 4 ): 377 - 390 [ DOI: 10.1109/TAFFC.2014.2336244 http://dx.doi.org/10.1109/TAFFC.2014.2336244 ]
Chen S , Wang C , Chen Z , et al . 2022 . WavLM: Large-scale self-supervised pre-training for full stack speech processing . IEEE Journal of Selected Topics in Signal Processing , 16 ( 6 ): 1505 - 1518 [ DOI: 10.1109/JSTSP.2022.3188113 http://dx.doi.org/10.1109/JSTSP.2022.3188113 ]
Défossez A , Copet J , Synnaeve G , et al . 2022 . High fidelity neural audio compression . arXiv preprint arXiv: 2210.13438 .
Desplanques B , Thienpondt J , Demuynck K . 2020 . ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2005.07143 https://arxiv.org/abs/2005.07143
Dupuis K , Pichora-Fuller M K . 2010 . Toronto emotional speech set (TESS)-younger talker happy . [ DOI: 10.5683/SP2/E8H2MF http://dx.doi.org/10.5683/SP2/E8H2MF ]
Fujisaki H . 2004 . Information , prosody, and modeling-with emphasis on tonal features of speech // Proceedings of Speech Prosody 2004, International Conference. 2004. [ DOI: 10.21437/speechprosody.2004-1 http://dx.doi.org/10.21437/speechprosody.2004-1 ]
Guo H , Liu C , Ishi C T , Ishiguro H . 2023 . Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion // Proceedings of IEEE Automatic Speech Recognition and Understanding (ASRU) . IEEE : 1 - 8 [ DOI: 10.1109/ASRU57964.2023.10389651 http://dx.doi.org/10.1109/ASRU57964.2023.10389651 ]
Han B , Huang W , Chen Z , et al . 2023 . Improving DINO-based self-supervised speaker verification with progressive cluster-aware training // 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) . IEEE : 1 - 5 [ DOI: 10.1109/ICASSPW59220.2023.10192957 http://dx.doi.org/10.1109/ICASSPW59220.2023.10192957 ]
Hsu W-N , Bolte B , Tsai Y H , Lakhotia K , Salakhutdinov R , Mohamed A . 2021 . HuBERT: Self-supervised speech representation learning by masked prediction of hidden units . IEEE Transactions on Audio, Speech, and Language Processing , 29 : 3451 - 3460 [ DOI: 10.1109/TASLP.2021.3122291 http://dx.doi.org/10.1109/TASLP.2021.3122291 ]
James J , Tian L , Watson C . 2018 . An open source emotional speech corpus for human robot interaction applications // Proceedings of INTERSPEECH . 2018 [DOI: 10.21437/Interspeech.2018-1349]
Le Roux J , Wisdom S , Erdogan H , Hershey J R . SDR–half-baked or well done? [C]// ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE , 2019 : 626 - 630 . [ DOI: 10.1109/ICASSP.2019.8683855 http://dx.doi.org/10.1109/ICASSP.2019.8683855 ]
Li J , Li D , Savarese S , et al . 2023 . BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International Conference on Machine Learning . PMLR: 19730 - 19742 [ DOI: 10.48550/arXiv.2301.12597 http://dx.doi.org/10.48550/arXiv.2301.12597 ]
Lim J , Kim K . 2024 . Wav2vec-VC: Voice conversion via hidden representations of wav2vec 2.0 // Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE : 10326 - 10330 [ DOI: 10.1109/ICASSP48485.2024 http://dx.doi.org/10.1109/ICASSP48485.2024 ]
Liu A T , Li S W , Lee H . 2021 . TERA: Self-supervised learning of transformer encoder representation for speech . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29 : 2351 - 2366 [ DOI: 10.1109/TASLP.2021.3095662 http://dx.doi.org/10.1109/TASLP.2021.3095662 ]
Liu T T , Liu Z , Chai Y J , Wang J and Wang Y Y . 2021 . Agent affective computing in human-computer interaction . Journal of Image and Graphics , 26 ( 12 ): 2767 - 2777
刘婷婷 , 刘箴 , 柴艳杰 , 王瑾 , 王媛怡 . 2021 . 人机交互中的智能体情感计算研究 . 中国图象图形学报 , 26 ( 12 ): 2767 - 2777 [ DOI: 10.11834/jig.200498 http://dx.doi.org/10.11834/jig.200498 ]
Livingstone S R , Russo F A . 2018 . The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English . PLoS ONE , 13 ( 5 ): e 0196391 [ DOI: 10.1371/journal.pone.0196391 http://dx.doi.org/10.1371/journal.pone.0196391 ]
Ma Z , Zheng Z , Ye J , Li J , Gao Z , Zhang S , Chen X . 2023 . emotion2vec : Self-supervised pre-training for speech emotion representation // arXiv preprint arXiv : 2312 . 15185
Martin O , Kotsia I , Macq B , Pitas I . 2006 . The eNTERFACE’05 audio-visual emotion database // Proceedings of the ICDEW . IEEE : 8 - 8 [ DOI: 10.1109/ICDEW.2006.145 http://dx.doi.org/10.1109/ICDEW.2006.145 ]
Nagrani A , Chung J S , Zisserman A . 2017 . VoxCeleb: A large-scale speaker identification dataset // Proceedings of INTERSPEECH . 2017: 2616 - 2620 [ DOI: 10.21437/Interspeech.2017-950 http://dx.doi.org/10.21437/Interspeech.2017-950 ]
Martinez-Lucas L , Abdelwahab M , Busso C . 2020 . The MSP-conversation corpus // Proceedings of INTERSPEECH . 2020 [DOI: 10.21437/Interspeech.2020-2444]
Noriy K A , Yang X , Zhang J J . 2023 . EMNS/Imz/Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2305.13137 https://arxiv.org/abs/2305.13137
Panayotov V , Chen G , Povey D , Khudanpur S . 2015 . Librispeech: An ASR corpus based on public domain audio books // Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE : 5206 - 5210 [ DOI: 10.1109/ICASSP.2015.7178964 http://dx.doi.org/10.1109/ICASSP.2015.7178964 ]
Poria S , Hazarika D , Majumder N , Naik G , Cambria E , Mihalcea R . 2018 . MELD: A multimodal multi-party dataset for emotion recognition in conversations [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/1810.02508 https://arxiv.org/abs/1810.02508
Pratap V , Xu Q , Sriram A , Synnaeve G , Collobert R . 2020 . MLS: A large-scale multilingual dataset for speech research [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2012.03411 https://arxiv.org/abs/2012.03411
Radford A , Kim J W , Xu T , Brockman G , McLeavey C , Sutskever I . 2022 . Robust speech recognition via large-scale weak supervision // arXiv preprint arXiv:2212 . 04356
Reddy C K , Gopal V , Cutler R . 2021 . DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors // Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE : 6493 - 6497 [ DOI: 10.1109/ICASSP39728.2021. 9414878 http://dx.doi.org/10.1109/ICASSP39728.2021.9414878 ]
Rubenstein P K , Asawaroengchai C , Nguyen D D , et al . 2023 . AudioPalm: A large language model that can speak and listen [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2306.12925 https://arxiv.org/abs/2306.12925
Schneider S , Baevski A , Collobert R , et al . 2019 . wav2vec: Unsupervised pre-training for speech recognition [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/1904.05862 https://arxiv.org/abs/1904.05862
Tao J H , Fan C H , Lian Z , Lyu Z , Shen Y , Liang S . 2024 . Development of multimodal sentiment recognition and understanding . Journal of Image and Graphics , 29 ( 06 ): 1607 - 1627
陶建华 , 范存航 , 连政 , 吕钊 , 沈莹 , 梁山 . 2024 . 多模态情感识别与理解发展现状及趋势 . 中国图象图形学报 , 29 ( 06 ): 1607- 1627 DOI: 10.11834/jig.240017. [ DOI: 10.11834/jig.240017 http://dx.doi.org/10.11834/jig.240017 ]
Tsai H S , Chang H J , Huang W C , et al . 2022 . SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2203.06849 https://arxiv.org/abs/2203.06849
Wan L , Wang Q , Papir A , Moreno I L . 2018 . Generalized end-to-end loss for speaker verification // Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE : 4879 - 4883 [ DOI: 10.1109/ICASSP.2018.8462665 http://dx.doi.org/10.1109/ICASSP.2018.8462665 ]
Wang K , Wu Q , Song L , Yang Z , Wu W , Qian C , He R , Qiao Y , Loy C C . 2020 . MEAD: A large-scale audio-visual dataset for emotional talking-face generation // Proceedings of ECCV . 2020 [DOI: 10.1007/978-3-030-58589-1_42]
Wang T , Li J , Ma Z , et al . 2024 . Progressive Residual Extraction based Pre-training for Speech Representation Learning [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2409.00387 https://arxiv.org/abs/2409.00387
Wang T , Zhou L , Zhang Z , Wu Y , Liu S , Gaur Y , Chen Z , Li J , and Wei F . 2024 . VioLA: Conditional language models for speech recognition, synthesis, and translation . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 32 : 3709 - 3716 [ DOI: 10.1109/TASLP.2024.3434425 http://dx.doi.org/10.1109/TASLP.2024.3434425 ]
Xu Y X , Li B , Tan S Q , Huang J W . 2024 . Research progress on speech deepfake and its detection techniques . Journal of Image and Graphics , 29 ( 08 ): 2236 - 2268
许裕雄 , 李斌 , 谭舜泉 , 黄继武 . 2024 . 语音深度伪造及其检测技术研究进展 . 中国图象图形学报 , 29 ( 08 ): 2236 - 2268 [ DOI: 10.11834/jig.230476 http://dx.doi.org/10.11834/jig.230476 ]
Yan H , Liu Y L , Jin L W , Bai X . 2023 . The development, application, and future of LLM similar to ChatGPT . Journal of Image and Graphics , 28 ( 09 ): 2749 - 2762
严昊 , 刘禹良 , 金连文 , 白翔 . 2023 . 类ChatGPT大模型发展、应用和前景 . 中国图象图形学报 , 28 ( 09 ): 2749 - 2762 [ DOI: 10.11834/jig.230536 http://dx.doi.org/10.11834/jig.230536 ]
Yang S , Chi P H , Chuang Y S , et al . 2021 . SUPERB: Speech processing universal performance benchmark [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2105.01051 https://arxiv.org/abs/2105.01051
Zeghidour N , Luebs A , Omran A , et al . 2021 . SoundStream: An end-to-end neural audio codec . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 30 : 495 - 507 [ DOI: 10.1109/TASLP.2021.312999 http://dx.doi.org/10.1109/TASLP.2021.312999 ]
Zhang B , Lv H , Guo P , Shao Q , Yang C , Xie L , Xu X , Bu H , Chen X , Zeng Cet al . 2022 . Wenetspeech: A 10000+ hours multi-domain Mandarin corpus for speech recognition // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE : 6182 - 6186 [ DOI: 10.1109/ICASSP43922.2022.9746682 http://dx.doi.org/10.1109/ICASSP43922.2022.9746682 ]
Zhang D , Li S , Zhang X , et al . 2023 . SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities [EB/OL]. [ 2024-09-15 ]. https://arxiv.org/abs/2305.11000 https://arxiv.org/abs/2305.11000
Zhou K , Sisman B , Liu R , Li H . 2021 . Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset // Proceedings of ICASSP . IEEE : 920 - 924 [ DOI: 10.1109/ICASSP39728.2021.9413391 http://dx.doi.org/10.1109/ICASSP39728.2021.9413391 ]
相关作者
相关机构