面向克隆语音的目标说话人鉴别方法
Target speaker verification method for cloned speech
- 2025年 页码:1-10
收稿日期:2024-11-12,
修回日期:2025-02-20,
录用日期:2025-03-03,
网络出版日期:2025-03-04
DOI: 10.11834/jig.240686
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-11-12,
修回日期:2025-02-20,
录用日期:2025-03-03,
网络出版日期:2025-03-04,
移动端阅览
随着文本到语音(Text To Speech,TTS)、语音转换(Voice Conversion,VC)等克隆语音技术的快速发展,如何在司法实践中准确识别克隆语音,即克隆语音是否来源于目标说话人特征,成为了一个极具挑战性的难题。虽然现有说话人识别技术可以通过声纹特征比对确认自然语音的说话人身份,但由于克隆语音不仅与目标说话人音色相似,但又包含源说话人的特点,使得传统说话人识别技术难以去除原说话人音色的干扰,难以直接应用于深度克隆语音。基于此,本文研究了一种面向克隆语音的目标说话人鉴别方法。具体来说,首先基于Res2Block设计组渐进信道融合模块(Group Progressive Channel Fusion, GPCF),以有效提取自然语音与克隆语音之间的公共有效声纹特征信息;其次,设计基于K独立的动态滤波器组(Dynamic global filter, DGF),以有效抑制源说话人的影响,提高模型表征和泛化能力;然后,设计了基于多尺度层注意力的特征融合机制,以有效融合不同层次GPCF模块和DGF模块的深浅层特征;最后,设计注意力统计池(Attentive Statistics Pooling,ASP)层,进一步增强表示特征张量中的目标说话人信息。实验在所设计的数据集上与3种较新的方法进行了比较,相对于其他3种方法,EER分别降低了1.38%、0.92%、0.61%,minDCF分别降低了0.0125、0.0067、0.0445。在FastSpeech2、TriAANVC、FreeVC和KnnVC四种语音克隆数据集的对比实验结果表明,所提方法在处理面向克隆语音的声纹认定任务时更具有优势,可以有效提取克隆语音中的目标说话人特征,为克隆语音的声纹认定提供方法指导。
With the rapid development of artificial intelligence technology, especially the revolutionary progress in the field of deep learning, voice cloning technologies such as Text To Speech (TTS) and Voice Conversion (VC) have crossed the boundaries of traditional technologies and can quickly generate digital voices that "sound like one's own voice", bringing unprecedented convenience and fun to ordinary users. However, at the same time, criminals may use their cloned voices to commit crimes such as fraud, forgery, spreading false information, malicious harassment, and illegally obtaining personal sensitive information. These behaviors will undoubtedly pose an unprecedented challenge and threat to personal voice rights. Therefore, under the clear guidance of the law, the protection of voice rights as an important part of personal identity has become the focus of attention from all walks of life. And how to accurately determine the sound source of the cloned speech in judicial practice, i.e., whether the cloned speech originates from the target speaker's characteristics, has become a challenging dilemma. Although the existing speaker identification technology can confirm the speaker identity of human speech through voiceprint feature comparison, the cloned speech is not only similar to the target speaker's timbre, but also contains the characteristics of the source speaker. Therefore, the interference of the source speaker's timbre makes the traditional speaker recognition technology difficult to be directly applied to the cloned speech. Against this backdrop, the target speaker verification method for cloned speech is proposed in this paper. Specifically, the Group Progressive Channel Fusion (GPCF) module is firstly designed based on Res2Block in order to extract the common effective voiceprint features between human speech and cloned speech. This module is critical for distinguishing subtle differences between human speech and cloned speech that are generally indistinguishable to the human ear alone. Subsequently, the K-independent Dynamic global filter (DGF) module is adopted to suppress the influence of the source speaker improving the model representation and generalization ability. The DGF plays a key role in filtering out the unwanted features of the source speaker, ensuring that the target speaker's voiceprint remains the focus of the analysis. After that, feature aggregation mechanism is present based on multi-scale layer attention to fuse deep and shallow features from different levels of GPCF modules and DGF modules. This mechanism is designed to capture the intricate details of the voiceprint that may be missed by traditional methods, providing a more comprehensive analysis of the speech features. Finally, the attentive statistics pooling (ASP) layer is used to further enhance the target speaker information in the representation feature tensor. The ASP layer is instrumental in refining the feature set, ensuring that the most relevant and discriminative features are emphasized, thus improving the accuracy of the voiceprint identification results. Experiments on the aishell3 human speech dataset and the cloned speech dataset created based on four speech synthesis and conversion algorithms, including FastSpeech2, TriAANVC, FreeVC and KnnVC, show that the equal error rate (EER) is reduced by 1.38%, 0.92% and 0.61% respectively, while the minimum detection cost function (minDCF) is reduced by 0.0125, 0.0067 and 0.0445 respectively compared with the three advanced methods. These comparative experimental results show that the proposed method can significantly reduce the error rate associated with voiceprint identification, leading to more reliable and accurate results. They also show the advantages of the proposed method in solving the task of cloned speech voiceprint identification. The proposed method effectively extracts the target speaker's features from cloned speech, offering a methodological guide for the voiceprint identification of cloned speech.
Li J X , Zhang L H and Li Y T . 2023 . Speaker feature extraction based on timbre consistency for voice cloning [J]. Journal of Signal Processing , 39 ( 4 ): 719 - 729
李嘉欣 , 张连海 , 李宜亭 . 2023 . 基于音色一致的语音克隆说话人特征提取方法 . 信号处理 , 39 ( 4 ): 719 - 729 [ DOI: 10.16798/j.issn.1003-0530.2023.04.013 http://dx.doi.org/10.16798/j.issn.1003-0530.2023.04.013 ]
Li Y X , Li B , Tan S Q and Huang J W . 2024 . Research progress on speech deepfake and its detection techniques [J]. Journal of Image and Graphics , 29 ( 8 ): 2236 - 2268
许裕雄 , 李斌 , 谭舜泉 , 黄继武 . 2024 . 语音深度伪造及其检测技术研究进展 [J]. 中国图象图形学报 , 29 ( 8 ): 2236 - 2268 [ DOI: 10.11834/jig.230476 http://dx.doi.org/10.11834/jig.230476 ]
Qiao Z . 2023 . Risks and countermeasures of artificial intelligence generated content technology in content security governance [J]. Telecommunications Science , 39 ( 10 ): 136 - 146
乔喆 . 2023 . 人工智能生成内容技术在内容安全治理领域的风险和对策 [J]. 电信科学 , 39 ( 10 ): 136 - 146 [ DOI: 10.11959/j.issn.1000-0801.2023190 http://dx.doi.org/10.11959/j.issn.1000-0801.2023190 ]
Yin M K and Li Y B . 2024 . Private law regulation of sound infringement in the age of artificial intelligence [J]. Journal of Harbin University , 45 ( 4 ): 73 - 77
尹懋锟 , 李云滨 . 2024 . 人工智能时代声音侵权的私法规制 [J]. 哈尔滨学院学报 , 45 ( 4 ): 73 - 77 [ DOI: 10.3969/j.issn.1004-5856.2024.04.016 http://dx.doi.org/10.3969/j.issn.1004-5856.2024.04.016 ]
Booth I , Barlow M and Watson B . 1993 . Enhancements to DTW and VQ decision algorithms for speaker recognition [J]. Speech Communication , 13 ( 3-4 ): 427 - 433 [ DOI: 10.1016/0167-6393(93)90041-I http://dx.doi.org/10.1016/0167-6393(93)90041-I ]
Jiang L Y , Zheng Y F , Chen C , Li G H an d Zhang W J . 2023 . Review of optimization methods for supervised deep learning [J]. Journal of Image and Graphics , 28 ( 4 ): 963 - 983
江铃燚 , 郑艺峰 , 陈澈 , 李国和 , 张文杰 . 2023 . 有监督深度学习的优化方法研究综述 [J]. 中国图象图形学报 , 28 ( 4 ): 963 - 983 [ DOI: 10.11834/jig.211139 http://dx.doi.org/10.11834/jig.211139 ]
Gersho A and Gray R M . 2012 . Vector quantization and signal compression [M]. Germany : Springer Science & Business Media
Huang G , Liu Z , Van Der Maaten L and Weinberger K Q . 2017 . Densely connected convolutional networks [C]// Proceedings of the IEEE conference on computer vision and pattern recognition : 4700 - 4708 [ DOI: 10.1109/cvpr.2017.243 http://dx.doi.org/10.1109/cvpr.2017.243 ]
Malykh E , Novoselov S and Kudashev O . 2017 . On residual CNN in text-dependent speaker verification task [C]// Speech and Computer:19th International Conference , SPECOM 2017, Hatfield, UK , September 12-16 , 2017, Proceedings 19. Springer International Publishing : 593 - 601 [ DOI: 10.1007/978-3-319-66429-3_59 http://dx.doi.org/10.1007/978-3-319-66429-3_59 ]
Snyder D , Garcia-Romero D , Povey D and Khudanpur S . 2017 . Deep neural network embeddings for text-independent speaker verification [C]// Interspeech : 999 - 1003 [ DOI: 10.21437/interspeech.2017-620 http://dx.doi.org/10.21437/interspeech.2017-620 ]
Lee J and Nam J . 2017 . Multi-level and multi-scale feature aggregation using sample-level deep convolutional neural networks for music classification [J]. arXiv preprint arXiv : 1706 . 06810 [ DOI: 10.48550/arXiv.1706.06810 http://dx.doi.org/10.48550/arXiv.1706.06810 ]
Gao Z , Song Y , McLoughlin I , Li P , Jiang Y and Dai L R . 2019 . Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System [C]// Interspeech : 361 - 365 [ DOI: 10.21437/interspeech.2019-1489 http://dx.doi.org/10.21437/interspeech.2019-1489 ]
Desplanques B , Thienpondt J and Demuynck K . 2020 . ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc . Interspeech 2020 , 3830 - 3834 [ DOI: 10.21437/Interspeech.2020-2650 http://dx.doi.org/10.21437/Interspeech.2020-2650 ]
Hu J , Shen L and Sun G . 2018 . Squeeze-and-excitation networks [C]// Proceedings of the IEEE conference on computer vision and pattern recognition : 7132 - 7141 [ DOI: 10.1109/cvpr.2018.00745 http://dx.doi.org/10.1109/cvpr.2018.00745 ]
Li Y , Gan J , Lin X , Qiu Y , Zhan H and Tian H . 2024 . DS-TDNN: Dual-stream time-delay neural network with global-aware filter for speaker verification [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing [ DOI: 10.1109/taslp.2024.3402072 http://dx.doi.org/10.1109/taslp.2024.3402072 ]
Yu Y Q and Li W J . 2020 . Densely Connected Time Delay Neural Network for Speaker Verification [C]// Interspeech : 921 - 925 [ DOI: 10.21437/interspeech.2020-1275 http://dx.doi.org/10.21437/interspeech.2020-1275 ]
Wang H , Zheng S , Chen Y , Cheng L and Chen Q . 2023 . CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking. Proc . INTERSPEECH , 5301 - 5305 [ DOI: 10.21437/Interspeech.2023-1513 http://dx.doi.org/10.21437/Interspeech.2023-1513 ]
Ren Y , Hu C , Tan X , Qin T , Zhao S , Zhao Z and Liu T Y . 2020 . Fastspeech 2: Fast and high-quality end-to-end text to speech [J]. arXiv preprint arXiv : 2006 . 04558 [ DOI: 10.48550/arXiv.2006.04558 http://dx.doi.org/10.48550/arXiv.2006.04558 ]
Park H J , Yang S W , Kim J S , Shin W and Han S W . 2023 . TriAAN-VC: Triple adaptive attention normalization for any-to-any voice conversion [C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE : 1 - 5 [ DOI: 10.1109/icassp49357.2023.10096642 http://dx.doi.org/10.1109/icassp49357.2023.10096642 ]
Li J , Tu W and Xiao L . 2023 . Freevc: Towards high-quality text-free one-shot voice conversion [C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) , Rhodes Island, Greece : 1 - 5 [ DOI: 10.1109/icassp49357.2023.10095191 http://dx.doi.org/10.1109/icassp49357.2023.10095191 ]
Baas M , van Niekerk B and Kamper H . 2023 . Voice Conversion With Just Nearest Neighbors. Proc . INTERSPEECH , 2053 - 2057 [ DOI: 10.21437/interspeech.2023-419 http://dx.doi.org/10.21437/interspeech.2023-419 ]
Shi Y , Bu H , Xu X , Zhang S and Li M . 2021 . AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. Proc . Interspeech 2021 , 2756 - 2760 [ DOI: 10.21437/interspeech.2021-755 http://dx.doi.org/10.21437/interspeech.2021-755 ]
Okabe K , Koshinaka T and Shinoda K . 2018 . Attentive Statistics Pooling for Deep Speaker Embedding. Proc . Interspeech 2018 , 2252 - 2256 [ DOI: 10.21437/interspeech.2018-993 http://dx.doi.org/10.21437/interspeech.2018-993 ]
Zhao Z , Li Z , Wang W and Zhang P . 2023 . PCF: ECAPA-TDNN with progressive channel fusion for speaker verification [C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE : 1 - 5 [ DOI: 10.1109/icassp49357.2023.10095051 http://dx.doi.org/10.1109/icassp49357.2023.10095051 ]
Hinton G E , Srivastava N , Krizhevsky A , Sutskever I and Salakhutdinov R R . 2012 . Improving neural networks by preventing co-adaptation of feature detectors [J]. arXiv preprint arXiv : 1207 . 0580 [ DOI: 10.48550/arXiv.1207.0580 http://dx.doi.org/10.48550/arXiv.1207.0580 ]
Deng J , Guo J , Xue N and Zafeiriou S . 2019 . Arcface: Additive angular margin loss for deep face recognition [C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition : 4690 - 4699 [ DOI: 10.1109/cvpr.2019.00482 http://dx.doi.org/10.1109/cvpr.2019.00482 ]
Xiang X , Wang S , Huang H , Qian Y and Yu K . 2019 . Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition [C]// 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA ASC) . IEEE : 1652 - 1656 [ DOI: 10.1109/apsipaasc47483.2019.9023039 http://dx.doi.org/10.1109/apsipaasc47483.2019.9023039 ]
Kingma D P . 2014 . Adam: a method for stochastic optimization [J]. arXiv preprint arXiv : 1412 . 6980 [ DOI: 10.48550/arXiv.1412.6980 http://dx.doi.org/10.48550/arXiv.1412.6980 ]
Choi H Y , Lee S H and Lee S W . 2024 . Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion [C]// Proceedings of the AAAI Conference on Artificial Intelligence , 38 ( 16 ): 17862 - 17870 [ DOI: 10.1609/aaai.v38i16.29740 http://dx.doi.org/10.1609/aaai.v38i16.29740 ]
相关文章
相关作者
相关机构