数字人风格化、多模态驱动与交互进展
Advances in digital characters stylization, multimodal animation and interaction
- 2024年 页码:1-24
网络出版日期: 2024-10-22
DOI: 10.11834/jig.230639
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-10-22 ,
移动端阅览
潘烨,李韶旭,谭帅等.数字人风格化、多模态驱动与交互进展[J].中国图象图形学报,
Pan Ye,Li Shaoxu,Tan Shuai,et al.Advances in digital characters stylization, multimodal animation and interaction[J].Journal of Image and Graphics,
风格化数字人是在计算机图形学、视觉艺术和游戏设计等领域中迅速发展的一个领域。近年来,数字人物的设计和制作技术取得了显著的进步,使得数字人物能够具有更加逼真的外观和行为,同时也可以更好地适应各种艺术风格和情境。本文围绕风格化数字人任务,围绕数字人的风格化生成、多模态驱动与用户交互三个核心研究方向的发展现状、前沿动态、热点问题等进行系统性综述。针对数字人的风格化生成,从显式三维模型和隐式三维模型两种数字人的三维表达方式对于方法进行分类,其中,显式三维数字人风格化以基于优化的方法、基于生成对抗网络的方法、基于引擎的方法为主要分析对象,隐式三维数字人风格化从通用隐式场景风格化方法以及针对人脸的隐式风格化进行回顾。针对数字人的驱动,根据驱动源的不同,本文从显式音频驱动,文本驱动和视频驱动三个方面进行回顾。根据驱动实现算法的不同,本文从基于中间变量、基于编码-解码结构等方面进行回顾,此外算法还根据中间变量的不同可分为基于关键点、三维人脸和光流的方法。针对数字人的用户交互,目前主流的交互方式是语音交互,本文对于语音交互模块从自动语音识别和文本转语音合成两方面进行了回顾,对于数字人的对话系统模块,从自然语言理解和自然语言生成等方面进行了回顾。在此基础上,展望了风格化数字人研究的未来发展趋势,为后续的相关研究提供参考。
Stylized digital characters have emerged as a fundamental force in reshaping the landscape of computer graphics, visual arts, and game design. Their unparalleled ability to mimic human appearances and behaviors, coupled with their flexibility in adapting to a wide array of artistic styles and narrative frameworks, has underscored their growing importance in crafting immersive and engaging digital experiences. This comprehensive exploration delves deep into the complex world of stylized digital humans, exploring their current development status, identifying the latest trends, and addressing the pressing challenges that lie ahead in three foundational research domains: creation of stylized digital humans, multi-modal driving mechanisms, and user interaction modalities. The first domain, creation of stylized digital humans, examines the innovative methodologies employed in generating lifelike yet stylistically diverse characters that can seamlessly integrate into various digital environments. From advancements in 3D modeling and texturing to the integration of artificial intelligence for dynamic character development, this section provides a thorough analysis of the tools and technologies that are pushing the boundaries of what digital characters can achieve. In the realm of multi-modal driving mechanisms, the paper investigates the evolving techniques in animating and controlling digital humans through a range of inputs such as voice, gesture, and real-time motion capture. This section delves into how these mechanisms not only enhance the realism of character interactions but also open up new avenues for creators to involve users in interactive narratives in more meaningful ways. Lastly, the discussion on user interaction modalities explores the various ways in which end-users can engage with and influence the behavior of digital humans. From immersive virtual and augmented reality experiences to interactive web and mobile platforms, this segment evaluates the effectiveness of different modalities in creating a two-way interaction that enriches the user's experience and deepens their connection to the digital characters.At the heart of this exploration lies the creation of stylized digital humans, a field that has witnessed remarkable progress in recent years. The generation of these characters can be broadly classified into two categories: explicit 3D models and implicit 3D models. Explicit 3D digital human stylization encompasses a range of methodologies, including optimization-based approaches that meticulously refine digital meshes to conform to specific stylistic attributes. These techniques often involve iterative processes that adjust geometric details, textures, and lighting to achieve the desired aesthetic. Generative adversarial networks (GANs), as a cornerstone of deep learning, have revolutionized this landscape by enabling the automatic generation of novel stylized forms that capture intricate nuances of various artistic styles. Furthermore, engine-based methods harness the power of advanced rendering engines to apply artistic filters and effects in real-time, offering unparalleled flexibility and control over the final visual output. Implicit 3D digital human stylization, on the other hand, draws inspiration from the realm of implicit scene stylization, particularly through the lens of neural implicit representations. These approaches offer a more holistic and flexible way to represent and manipulate 3D geometry and appearance, enabling stylization that transcends traditional mesh-based limitations. Within this framework, facial stylization holds a special place, requiring a profound understanding of facial anatomy, expression dynamics, and cultural nuances. Specialized methods have been developed to capture and manipulate facial features in a nuanced and artistic manner, fostering a level of realism and emotional expressiveness that is crucial for believable digital humans.Animating and controlling the behavior of stylized digital humans necessitates the use of diverse driving signals, which serve as the lifeblood of these virtual beings. This paper delves into three primary sources of these signals: explicit audio drivers, text drivers, and video drivers. Audio drivers leverage speech recognition and prosody analysis to synchronize digital human movements with spoken language, enabling them to lip-sync and gesture in a natural and expressive manner. Text drivers, on the contrary, rely on natural language processing (NLP) techniques to interpret textual commands or prompts and convert them into coherent actions, allowing for a more directive form of control. Video drivers, perhaps the most advanced in terms of realism, employ computer vision algorithms to track and mimic the movements of real-world actors, providing a seamless bridge between the virtual and the physical worlds. Supporting these drivers are sophisticated implementation algorithms, many of which rely on intermediate variable-driven coding-decoding structures. Keypoint-based methods play a pivotal role in capturing and transferring motion, allowing for the precise replication of movements across different characters. 3D face-based approaches, meanwhile, focus on facial animation, utilizing detailed facial models and advanced animation techniques to achieve unparalleled realism in expressions and emotions. Optical flow-based techniques, on the other hand, offer a holistic approach to motion estimation and synthesis, capturing and reproducing complex motion patterns across the entire digital human body.The true magic of stylized digital humans lies in their ability to engage with users in meaningful and natural interactions. Voice interaction, currently the mainstream mode of communication, relies heavily on automatic speech recognition (ASR) for accurate speech-to-text conversion and text-to-speech synthesis (TTS) for generating natural-sounding synthetic speech. The dialogue system module, a cornerstone of virtual human interaction, emphasizes the importance of natural language understanding (NLU) for interpreting user inputs and natural language generation (NLG) for crafting appropriate responses. When these capabilities are seamlessly integrated, stylized digital humans are capable of engaging in fluid and contextually relevant conversations with users, fostering a sense of intimacy and connection.Looking ahead, the study of stylized digital characters promises to continue its ascendancy, fueled by advancements in deep learning, computer vision, and NLP. Future research may delve into integrating multiple modalities for richer and more nuanced interactions, pushing the boundaries of what is possible in virtual human communication. Innovative stylization techniques that bridge the gap between reality and fiction will also be explored, enabling the creation of digital humans that are both fantastical and relatable. Moreover, the development of intelligent agents capable of autonomous creativity and learning will revolutionize the way stylized digital humans can contribute to various industries, including entertainment, education, healthcare, and beyond. As technology continues to evolve, stylized digital humans will undoubtedly play an increasingly significant role in shaping how we engage with digital content and each other, ushering in a new era of digital creativity and expression. This paper serves as a valuable resource for researchers and practitioners alike, offering a comprehensive overview of the current state of the art and guiding the way forward in this dynamic and exciting field.
风格化数字人人脸驱动人机交互三维建模深度学习神经网络
stylizationdigital charactersface drivenhuman-computer interaction3D modelingdeep learningneural network
Abdal, R, Lee, H-Y, Zhu, P, Chai, M, Siarohin, A, Wonka, P, and Tulyakov, S. 2023. 3DAvatarGan: Bridging domains for personalized editable avatars//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada:IEEE: 4552-62 [DOI: 10.1109/CVPR52729.2023.00442http://dx.doi.org/10.1109/CVPR52729.2023.00442]
Almahairi, A, Rajeshwar, S, Sordoni, A, Bachman, P, and Courville, A. 2018. Augmented Cyclegan: Learning many-to-many mappings from unpaired data[EB/OL]. [2018-06-18]. https://arxiv.org/abs/1802.10151.pdfhttps://arxiv.org/abs/1802.10151.pdf
Amodei, D, Anubhai, R, Battenberg, E, Case, C, Casper, J, Catanzaro, B, Chen, J, Chrzanowski, M, Coates, A, and Diamos, G. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv preprint arXiv:1512.02595]
Aneja, S, Thies, J, Dai, A, and Niessner, M. 2023. ClipFace: Text-guided Editing of Textured 3D Morphable Models. In ACM SIGGRAPH 2023 Conference Proceedings, Article 70. Los Angeles, CA, USA: Association for Computing Machinery.
Averbuch-Elor, H, Cohen-Or, D, Kopf, J, and Cohen, M F. 2017. Bringing portraits to life. ACM transactions on graphics (TOG),36(6): 1-13 [DOI: 10.1145/3130800.3130818http://dx.doi.org/10.1145/3130800.3130818]
Baevski, A, Zhou, H, Mohamed, A, Auli, M, and Ai, F. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.]
Bengio, Y, Ducharme, R, and Vincent, P. 2000. A neural probabilistic language model. Advances in neural information processing systems,13: [DOI: 10.1109/tnn.2007.912312]
Blanz, V, and Vetter, T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 187–94. ACM Press/Addison-Wesley Publishing Co.
———. 2023. "A morphable model for the synthesis of 3D faces." In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 157-64.
Brooks, T, Holynski, A, and Efros, A A. 2023. Instructpix2pix: Learning to follow image editing instructions//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada:IEEE: 18392-402 [DOI: 10.1109/CVPR52729.2023.01764http://dx.doi.org/10.1109/CVPR52729.2023.01764]
Cai, Q, Ma, M X, Wang, C, and Li, H S. 2023. Image neural style transfer: A review*. Computers & Electrical Engineering,108: [DOI: ARTN 108723
1016/j.compeleceng.2023.108723] Cao, C, Simon, T, Kim, J K, Schwartz, G, Zollhoefer, M, Saito, S-S, Lombardi, S, Wei, S-E, Belko, D, and Yu, S-I. 2022. Authentic volumetric avatars from a phone scan. ACM transactions on graphics TOG),414): 1-19 [DOI: 10.1145/3528223.3530143] Cao, K, Liao, J, and Yuan, L. 2018. CariGANs: Unpaired photo-to-caricature translation. ACM Trans. Graph.,376): NoArticle.: 244 pp 1–14 [DOI: 10.1145/3272127.3275046] Cao, Y, Tien, W C, Faloutsos, P, and Pighin, F. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics: 1283-302 [DOI: 10.1145/1095878.1095881] Chan, E R, Lin, C Z, Chan, M A, Nagano, K, Pan, B, DeMello, S, Gallo, O, Guibas, L J, Tremblay, J, and Khamis, S. 2022. Efficient geometry-aware 3D generative adversarial networks//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA:IEEE: 16123-33 [DOI: 10.1109/CVPR52688.2022.01565] Chatziagapi, A, Athar, S, Jain, A, Rohith, M, Bhat, V, and Samaras, D. 2023. LipNeRF: What is the right feature space to lip-sync a NeRF?//2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition FG). IEEE: 1-8 [DOI: Chen, D D, Liao, J, Yuan, L, Yu, N H, and Hua, G. 2017. Coherent Online Video Style Transfer. 2017 Ieee International Conference on Computer Vision Iccv): 1114-23 [DOI: 10.1109/Iccv.2017.126] Chen, L, Li, Z, Maddox, R K, Duan, Z, and Xu, C. 2018. Lip movements generation at a glance//Proceedings of the European conference on computer vision ECCV). Springer: 520-35 [DOI: 10.1007/978-3-030-01234-2_32] Chen, L, Maddox, R K, Duan, Z, and Xu, C. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR: 7832-41 [DOI: 10.1109/CVPR.2019.00802] Chen, Z, Xu, X, Yan, Y, Pan, Y, Zhu, W, Wu, W, Dai, B, and Yang, X. 2023. HyperStyle3D: Text-Guided 3D portrait stylization via hypernetworks[EB/OL]. [2023-04-19]. https://arxiv.org/abs/2304.09463.pdf Cudeiro, D, Bolkart, T, Laidlaw, C, Ranjan, A, and Black, M J. 2019. Capture, learning, and synthesis of 3D speaking styles//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR: 10101-11 [DOI: 10.1109/CVPR.2019.01034] Das, D, Biswas, S, Sinha, S, and Bhowmick, B. 2020. Speech-driven facial animation using cascaded gans for learning of motion and texture//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer: 408-24 [DOI: 10.1007/978-3-030-58577-8_25] Edwards, P, Landreth, C, Fiume, E, and Singh, K. 2016. Jali: an animator-centric viseme model for expressive lip synchronization. ACM transactions on graphics TOG),354): 1-11] Eskimez, S E, Zhang, Y, and Duan, Z. 2021. Speech driven talking face generation from a single image and an emotion condition. IEEE Transactions on Multimedia,24: 3480-90 [DOI: 10.1109/TMM.2021.3099900] Ezzat, T, and Poggio, T. 2002. MikeTalk: a talking facial display based on morphing visemes. In Computer Animation '98. Philadelphia, PA, USA. Fan, Y, Lin, Z, Saito, J, Wang, W, and Komura, T. 2022. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR). New Orleans, LA, USA. Fan, Y, Qian, Y, Xie, F-L, and Soong, F K. 2014. TTS synthesis with bidirectional LSTM based recurrent neural networks//Fifteenth annual conference of the international speech communication association. [DOI: 10.21437/interspeech.2014-443] Fried, O, Tewari, A, Zollhöfer, M, Finkelstein, A, Shechtman, E, Goldman, D B, Genova, K, Jin, Z, Theobalt, C, and Agrawala, M. 2019. Text-based editing of talking-head video. ACM transactions on graphics TOG),384): 1-14 [DOI: 10.1145/3306346.3323028] Gal, R, Patashnik, O, Maron, H, Chechik, G, and Cohen-Or, D. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators[EB/OL]. [2021-12-16]. Gao, W, Lie, Y J, Yin, Y H, and Yang, M H. 2020. Fast Video Multi-Style Transfer. 2020 Ieee Winter Conference on Applications of Computer Vision Wacv): 3211-9] Gatys, L A, Ecker, A S, and Bethge M. 2016. Image Style Transfer Using Convolutional Neural Networks. 2016 Ieee Conference on Computer Vision and Pattern Recognition Cvpr): 2414-23 [DOI: 10.1109/Cvpr.2016.265] Geng, J, Shao, T, Zheng, Y, Weng, Y, and Zhou, K. 2018. Warp-guided gans for single-photo facial animation. ACM transactions on graphics TOG),376): 1-12 [DOI: 10.1145/3272127.3275043] Genova, K, Cole, F, Maschinot, A, Sarna, A, Vlasic, D, and Freeman, W T. 2018. Unsupervised training for 3d morphable model regression//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR: 8377-86 [DOI: 10.1109/CVPR.2018.00874] Gulati, A, Qin, J, Chiu, C-C, Parmar, N, Zhang, Y, Yu, J, Han, W, Wang, S, Zhang, Z, and Wu, Y. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100] Guo, Y, Chen, K, Liang, S, Liu, Y-J, Bao, H, and Zhang, J. 2021. Ad-nerf: Audio driven neural radiance fields for talking head synthesis//Proceedings of the IEEE/CVF international conference on computer vision. 5784-94 [DOI: Guo, Y, Jiang, L, Cai, L, and Zhang, J. 2019. 3D magic mirror: Automatic video to 3D caricature translation[EB/OL]. [2019-06-03]. https://arxiv.org/abs/1906.00544.pdf Ha, S, Kersner, M, Kim, B, Seo, S, and Kim, D. 2020. Marionette: Few-shot face reenactment preserving identity of unseen targets//Proceedings of the AAAI conference on artificial intelligence. AAAI: 10893-900 [DOI: 10.1609/AAAI.V34I07.6721] Han, F, Ye, S, He, M, Chai, M, and Liao, J. 2021. Exemplar-based 3d portrait stylization. IEEE Transactions on Visualization and Computer Graphics,292): 1371-83 [DOI: 10.1109/TVCG.2021.3114308] Haque, A, Tancik, M, Efros, A A, Holynski, A, and Kanazawa, A. 2023. Instruct-NeRF2NeRF: Editing 3D scenes with instructions[EB/OL]. [2023-06-01]. https://arxiv.org/abs/2303.12789.pdf Ho, J, Salimans, T, Gritsenko, A, Chan, W, Norouzi, M, and Fleet, D J. 2022. Video Diffusion Models[EB/OL]. [2022-06-22]. Hu, L, Qi, J, Zhang, B, Pan, P, and Xu, Y. 2021. Text-driven 3d avatar animation with emotional and expressive behaviors//Proceedings of the 29th ACM International Conference on Multimedia. ACM: 2816-8 [DOI: 10.1145/3474085.3478569] Jamaludin, A, Chung, J S, and Zisserman A. 2019. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision,127: 1767-79 [DOI: 10.1007/S11263-019-01150-Y] Ji, X, Zhou, H, Wang, K, Wu, Q, Wu, W, Xu, F, and Cao, X. 2022. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model//ACM SIGGRAPH 2022 Conference Proceedings. ACM: 1-10 [DOI: 10.1145/3528233.3530745] Ji, X, Zhou, H, Wang, K, Wu, W, Loy, C C, Cao, X, and Xu, F. 2021. Audio-driven emotional video portraits//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR: 14080-9 [DOI: 10.1109/CVPR46437.2021.01386] Jiang, K, Chen, S-Y, Liu, F-L, Fu, H, and Gao, L. 2022. NeRFFaceEditing: Disentangled Face Editing in Neural Radiance Fields. In SIGGRAPH Asia 2022 Conference Papers, Article 31. Daegu, Republic of Korea: Association for Computing Machinery. Kalberer, G A, and Van Gool L. 2002. Face animation based on observed 3D speech dynamics. In Computer Animation 2001. Fourteenth Conference on Computer Animation. Seoul, South Korea. Kalchbrenner, N, Elsen, E, Simonyan, K, Noury, S, Casagrande, N, Lockhart, E, Stimberg, F, Oord, A v d, Dieleman, S, and Kavukcuoglu, K. 2018. Efficient Neural Audio Synthesis. arXiv preprint arXiv:1802.08435] Karras, T, Aila, T, Laine, S, Herva, A, and Lehtinen, J. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM transactions on graphics TOG),364): 1-12 [DOI: 10.1145/3072959.3073658] Karras, T, Laine, S, and Aila T. 2019. A style-based generator architecture for generative adversarial networks//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE: 4401-10 [DOI: 10.1109/CVPR.2019.00453] Kim, H, Elgharib, M, Zollhöfer, M, Seidel, H-P, Beeler, T, Richardt, C, and Theobalt, C. 2019. Neural style-preserving visual dubbing. ACM transactions on graphics TOG),386): 1-13 [DOI: 10.1145/3355089.3356500] Kim, J, Kim, S, Kong, J, and Yoon, S. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in neural information processing systems,33: 8067-77 [DOI: 10.48550/arXiv.2005.11129] Kim, S, Lee, S-g, Song, J, Kim, J, and Yoon, S. 2018. FloWaveNet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155] KR, P, Mukhopadhyay, R, Philip, J, Jha, A, Namboodiri, V, and Jawahar, C. 2019. Towards automatic face-to-face translation//Proceedings of the 27th ACM international conference on multimedia. ACM: 1428-36 [DOI: 10.1145/3343031.3351066] Kulkarni, T D, Narasimhan, K R, Saeedi, A, and Tenenbaum, J B. 2016. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation[EB/OL]. [2016-05-31]. https://arxiv.org/pdf/1604.06057.pdf Kumar, R, Sotelo, J, Kumar, K, De Brebisson, A, and Bengio, Y. 2017. Obamanet: Photo-realistic lip-sync from text[EB/OL]. [2017-12-06]. https://arxiv.org/pdf/1801.01442.pdf Li, J, Zhang, J, Bai, X, Zhou, J, and Gu, L. 2023. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis//Proceedings of the IEEE/CVF International Conference on Computer Vision. 7568-78 [DOI: Li, L, Wang, S, Zhang, Z, Ding, Y, Zheng, Y, Yu, X, and Fan, C. 2021. Write-a-speaker: Text-based emotional and rhythmic talking-head generation//Proceedings of the AAAI Conference on Artificial Intelligence. AAAI: 1911-20 [DOI: 10.1609/AAAI.V35I3.16286] Li, N, Liu, S, Liu, Y, Zhao, S, and Liu, M. 2019. Neural speech synthesis with transformer network//Proceedings of the AAAI conference on artificial intelligence. 6706-13 [DOI: 10.1609/aaai.v33i01.33016706] Li S. 2023. Instruct-Video2Avatar: Video-to-avatar generation with instructions[EB/OL]. [2023-06-05]. https://arxiv.org/abs/2306.02903.pdf Li, S, and Pan Y. 2023. Rendering and reconstruction based 3D portrait stylization//2023 IEEE International Conference on Multimedia and Expo ICME). Brisbane, Australia:IEEE: 912-7 [DOI: 10.1109/ICME55011.2023.00161] Lin, J, Yuan, Y, and Zou Z. 2021. Meingame: Create a game character face from a single portrait//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, California USA:AAAI: 311-9 [DOI: 10.1609/AAAI.v35i1.16106] Liu, J, Zhu, Z, Ren, Y, Huang, W, Huai, B, Yuan, N, and Zhao, Z. 2022. Parallel and High-Fidelity Text-to-Lip Generation//Proceedings of the AAAI Conference on Artificial Intelligence. AAAI: 1738-46 [DOI: 10.1609/AAAI.V36I2.20066] Liu, N, Li, S, Du, Y L, Torralba, A, and Tenenbaum, J B. 2022. Compositional Visual Generation with Composable Diffusion Models. Computer Vision - Eccv 2022, Pt Xvii,13677: 423-39 [DOI: 10.1007/978-3-031-19790-1_26] Liu, X, Xu, Y, Wu, Q, Zhou, H, Wu, W, and Zhou, B. 2022. Semantic-aware implicit neural audio-driven video portrait generation//European conference on computer vision. Springer: 106-25 [DOI: Mehri, S, Kumar, K, Gulrajani, I, Kumar, R, Jain, S, Sotelo, J, Courville, A, and Bengio, Y. 2016. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837] Miao, C, Liang, S, Chen, M, Ma, J, Wang, S, and Xiao, J. 2020. Flow-TTS: A non-autoregressive network for text to speech based on flow//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP). IEEE: 7209-13 [DOI: 10.1109/icassp40776.2020.9054484] Mikolov, T, Karafiát, M, Burget, L, Cernocký, J, and Khudanpur, S. 2010. Recurrent neural network based language model//Interspeech. Makuhari: 1045-8 [DOI: 10.21437/interspeech.2010-343] Mildenhall, B, Srinivasan, P P, Tancik, M, Barron, J T, Ramamoorthi, R, and Ng, R. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM,651): 99-106 [DOI: 10.1007/978-3-030-58452-8_24] Müller, T, Evans, A, Schied, C, and Keller, A. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics TOG),414): 1-15] Nguyen-Phuoc, T, Schwartz, G, Ye, Y, Lombardi, S, and Xiao, L. 2023. AlteredAvatar: Stylizing dynamic 3D avatars with fast style adaptation[EB/OL]. [2023-05-30]. https://arxiv.org/abs/2305.19245v1.pdf Olivier, N, Kerbiriou, G, Arguelaguet, F, Avril, Q, Danieau, F, Guillotel, P, Hoyet, L, and Multon, F. 2022. Study on automatic 3D facial caricaturization: from rules to deep learning. Frontiers in Virtual Reality,2: 162 [DOI: 10.3389/FRVIR.2021.785104] Oord, A v d, Dieleman, S, Zen, H, Simonyan, K, Vinyals, O, Graves, A, Kalchbrenner, N, Senior, A, and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio[EB/OL]. [2016-09-19]. https://arxiv.org/pdf/1609.03499.pdf Oord, A v d, Li, Y, Babuschkin, I, Simonyan, K, Vinyals, O, Kavukcuoglu, K, Driessche, G v d, Lockhart, E, Cobo, L C, and Stimberg, F. 2017. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv preprint arXiv:1711.10433] Ouyang, L, Wu, J, Jiang, X, Almeida, D, Wainwright, C L, Mishkin, P, Zhang, C, Agarwal, S, Slama, K, and Ray, A. 2022. Training language models to follow instructions with human feedback[EB/OL]. [2022-03-04]. https://arxiv.org/pdf/2203.02155.pdf Park, D S, Chan, W, Zhang, Y, Chiu, C-C, Zoph, B, Cubuk, E D, and Le, Q V. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779] Park, S J, Kim, M, Hong, J, Choi, J, and Ro, Y M. 2022. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory//Proceedings of the AAAI Conference on Artificial Intelligence. AAAI: 2062-70 [DOI: 10.1609/AAAI.V36I2.20102] Peddinti, V, Povey, D, and Khudanpur S. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts//Interspeech Dresden, Germany:ISCA: 3214-8 [DOI: 10.21437/interspeech.2015-647] Povey, D, Cheng, G, Wang, Y, Li, K, Xu, H, Yarmohammadi, M, and Khudanpur, S. 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks//Interspeech. Hyderabad, India:ISCA: 3743-7 [DOI: 10.21437/interspeech.2018-1417] Prajwal, K, Mukhopadhyay, R, Namboodiri, V P, and Jawahar, C. 2020. A lip sync expert is all you need for speech to lip generation in the wild//Proceedings of the 28th ACM international conference on multimedia. ACM: 484-92 [DOI: 10.1145/3394171.3413532] Radford, A, Kim, J W, Xu, T, Brockman, G, McLeavey, C, and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision: PMLR[EB/OL]. [2022-12-06]. https://arxiv.org/pdf/2212.04356.pdf Ren, Y, Ruan, Y, Tan, X, Qin, T, Zhao, S, Zhao, Z, and Liu, T-Y. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv preprint arXiv:1905.09263] Richard, A, Zollhofer, M, Wen, Y, de la Torre, F, and Sheikh, Y. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. In 2021 IEEE/CVF International Conference on Computer Vision ICCV). Montreal, QC, Canada. Rombach, R, Blattmann, A, Lorenz, D, Esser, P, and Ommer, B. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 Ieee/Cvf Conference on Computer Vision and Pattern Recognition Cvpr): 10674-85 [DOI: 10.1109/Cvpr52688.2022.01042] Ruder, M, Dosovitskiy, A, and Brox, T. 2016. Artistic Style Transfer for Videos. Pattern Recognition, Gcpr 2016,9796: 26-36 [DOI: 10.1007/978-3-319-45886-1_3] Sadoughi, N, and Busso, C. 2019. Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Transactions on Affective Computing,124): 1031-44 [DOI: 10.1109/TAFFC.2019.2916031] Sak, H, Senior, A, and Beaufays F. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[EB/OL]. [2014-02-05]. https://arxiv.org/pdf/1402.1128.pdf Schneider, S, Baevski, A, Collobert, R, and Auli, M. 2019. wav2vec: Unsupervised Pre-training for Speech Recognition. In Interspeech 2019. Shao, R, Sun, J, Peng, C, Zheng, Z, Zhou, B, Zhang, H, and Liu, Y. 2023. Control4D: Dynamic portrait editing by learning 4gan fromD 2D diffusion-based editor[EB/OL]. [2023-05-31]. https://arxiv.org/abs/2305.20082.pdf Shao, R, Zheng, Z, Tu, H, Liu, B, Zhang, H, and Liu, Y. 2023. Tensor4d: Efficient neural 4D decomposition for high-fidelity dynamic reconstruction and rendering//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada:IEEE: 16632-42 [DOI: 10.1109/CVPR52729.2023.01596] Shen, S, Li, W, Zhu, Z, Duan, Y, Zhou, J, and Lu, J. 2022. Learning dynamic facial radiance fields for few-shot talking head synthesis//European conference on computer vision. Springer: 666-82 [DOI: Shen, T, Zuo, J, Shi, F, Zhang, J, Jiang, L, Chen, M, Zhang, Z, Zhang, W, He, X, and Mei, T. 2021. ViDA-MAN: Visual Dialog with Digital Humans//Proceedings of the 29th ACM International Conference on Multimedia. New York, NY, USA:ACM: 2789-91 [DOI: 10.1145/3474085.3478560] Shi, T, Yuan, Y, Fan, C, Zou, Z, Shi, Z, and Liu, Y. 2019. Face-to-parameter translation for game character auto-creation//Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE: 161-70 [DOI: 10.1109/ICCV.2019.00025] Siarohin, A, Lathuilière, S, Tulyakov, S, Ricci, E, and Sebe, N. 2019. First order motion model for image animation[EB/OL]. [2020-10-01]. https://arxiv.org/pdf/2003.00196.pdf Sohl-Dickstein, J, Weiss, E A, Maheswaranathan, N, and Ganguli, S. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. International Conference on Machine Learning, Vol 37,37: 2256-65] Song, H-K, Woo, S H, Lee, J, Yang, S, Cho, H, Lee, Y, Choi, D, and Kim, K-w. 2022. Talking face generation with multilingual tts//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR: 21425-30 [DOI: 10.1109/CVPR52688.2022.02074] Song, L, Wu, W, Qian, C, He, R, and Loy, C C. 2022. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security,17: 585-98 [DOI: 10.1109/TIFS.2022.3146783] Song, W, Yuan, X, Zhang, Z, Zhang, C, Wu, Y, He, X, and Zhou, B. 2021. Dian: Duration informed auto-regressive network for voice cloning//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP). Toronto, Ontario, Canada:IEEE: 8598-602 [DOI: 10.1109/icassp39728.2021.9414727] Song, Y, Zhu, J, Li, D, Wang, X, and Qi, H. 2018. Talking face generation by conditional recurrent adversarial network[EB/OL]. [2019-07-25]. https://arxiv.org/pdf/1804.04786.pdf Sun, Y, He, R, Tan, W, and Yan, B. 2023. Instruct-NeuralTalker: Editing audio-driven talking radiance fields with instructions[EB/OL]. [2023-08-16]. https://arxiv.org/abs/2306.10813.pdf Tang, J, Wang, K, Zhou, H, Chen, X, He, D, Hu, T, Liu, J, Zeng, G, and Wang, J. 2022. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368] Thies, J, Elgharib, M, Tewari, A, Theobalt, C, and Nießner, M. 2020. Neural voice puppetry: Audio-driven facial reenactment//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer: 716-31 [DOI: 10.1007/978-3-030-58517-4_42] Tian, G Z, Yuan, Y, and Liu, Y. 2019. Audio2face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional Lstm Networks. 2019 Ieee International Conference on Multimedia & Expo Workshops Icmew): 366-71 [DOI: 10.1109/Icmew.2019.00069] Valin, J-M, Isik, U, Smaragdis, P, and Krishnaswamy, A. 2022. Neural speech synthesis on a shoestring: Improving the efficiency of LPCNet//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP). Singapore:IEEE: 8437-41 [DOI: 10.1109/icassp43922.2022.9746103] Valin, J-M, and Skoglund J. 2019. LPCNet: Improving neural speech synthesis through linear prediction//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP). Brighton, UK:IEEE: 5891-5 [DOI: 10.1109/icassp.2019.8682804] Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, A N, Kaiser, Ł, and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems,30: [DOI: 10.48550/arXiv.1706.03762] Verma, A, Rajput, N, and Subramaniam, L V. 2003. Using viseme based acoustic models for speech driven lip synthesis. In 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings Cat. No.03TH8698). Baltimore, MD, USA. Wang, C, Jiang, R, Chai, M, He, M, Chen, D, and Liao, J. 2023. Nerf-Art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics: 1-15 [DOI: 10.1109/TVCG.2023.3283400] Wang, H, Lin, G, Hoi, S C, and Miao, C. 2022. 3D cartoon face generation with controllable expressions from a single GAN image[EB/OL]. [2022-07-29]. https://arxiv.org/abs/2207.14425.pdf Wang, K, Wu, Q, Song, L, Yang, Z, Wu, W, Qian, C, He, R, Qiao, Y, and Loy, C C. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation//European Conference on Computer Vision. Springer: 700-17 [DOI: 10.1007/978-3-030-58589-1_42] Wang, S, Li, L, Ding, Y, Fan, C, and Yu, X. 2021. Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion//International Joint Conference on Artificial Intelligence. IJCAI: [DOI: 10.24963/IJCAI.2021/152] Wang, S, Li, L, Ding, Y, and Yu, X. 2022. One-shot talking face generation from single-speaker audio-visual correlation learning//Proceedings of the AAAI Conference on Artificial Intelligence. AAAI: 2531-9 [DOI: 10.1609/AAAI.V36I3.20154] Wang, S, Zeng, W, Wang, X, Yang, H, Chen, L, Zhang, C, Wu, M, Yuan, Y, Zeng, Y, and Zheng, M. 2023. SwiftAvatar: Efficient auto-creation of parameterized stylized character on arbitrary avatar engines//Proceedings of the AAAI Conference on Artificial Intelligence. Washington, D.C. USA:AAAI: 6101-9 [DOI: 10.1609/AAAI.v37i5.25753] Wen, X, Wang, M, Richardt, C, Chen, Z-Y, and Hu, S-M. 2020. Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics,2612): 3457-66 [DOI: 10.1109/TVCG.2020.3023573] Wiles, O, Koepke, A, and Zisserman A. 2018. X2face: A network for controlling face generation using images, audio, and pose codes//Proceedings of the European conference on computer vision (ECCV). Springer: 670-86 [DOI: 10.1007/978-3-030-01261-8_41] (Wood, E, Baltrušaitis, T, Hewitt, C, Johnson, M, Shen, J, Milosavljević, N, Wilde, D, Garbin, S, Sharp, T, and Stojiljković, I. 2022. 3d face reconstruction with dense landmarks//European Conference on Computer Vision. Springer: 160-77 [DOI: 10.48550/ARXIV.2204.02776] (Wu, J Z, Ge, Y, Wang, X, Lei, W, Gu, Y, Shi, Y, Hsu, W, Shan, Y, Qie, X, and Shou, M Z. 2022. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation[EB/OL]. [2023-03-17]. (Wu, Z, Watts, O, and King S. 2016. Merlin: An Open Source Neural Network Speech Synthesis System//SSW. SSW: 202-7 [DOI: 10.21437/SSW.2016-33] (Yang, D, Liu, S, Huang, R, Lei, G, Weng, C, Meng, H, and Yu, D. 2023. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt[EB/OL]. [2023-06-25]. https://arxiv.org/pdf/2301.13662.pdf (Yang, S, Zhou, Y, Liu, Z, and Loy, C C. 2023. Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation//ACM SIGGRAPH Asia Conference Proceedings. [DOI: (Yao, S, Zhong, R, Yan, Y, Zhai, G, and Yang, X. 2022. Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv preprint arXiv:2201.00791] (Ye, Z, Jiang, Z, Ren, Y, Liu, J, He, J, and Zhao, Z. 2023. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430] (Ye, Z, Xia, M, Sun, Y, Yi, R, Yu, M, Zhang, J, Lai, Y-K, and Liu, Y-J. 2021. 3D-CariGAN: An end-to-end solution to 3D caricature generation from normal face photos. IEEE Transactions on Visualization and Computer Graphics,29(4): 2203-10 [DOI: 10.1109/tvcg.2021.3126659] (Yi, R, Ye, Z, Zhang, J, Bao, H, and Liu, Y-J. 2020. Audio-driven talking face video generation with learning-based personalized head pose[EB/OL]. [2020-03-05]. https://arxiv.org/abs/2002.10137.pdf (Yin, F, Zhang, Y, Cun, X, Cao, M, Fan, Y, Wang, X, Bai, Q, Wu, B, Wang, J, and Yang, Y. 2022. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan//European conference on computer vision. Springer: 85-101 [DOI: 10.1007/978-3-031-19790-1_6] (Yu, L, Yu, J, and Ling Q. 2019. Mining audio, text and visual information for talking face generation//2019 IEEE International Conference on Data Mining (ICDM). IEEE: 787-95 [DOI: 10.1109/ICDM.2019.00089] (Zaremba, W, Sutskever, I, and Vinyals O. 2014. Recurrent neural network regularization[EB/OL]. [2015-02-19]. https://arxiv.org/pdf/1409.2329.pdf (Zeng, Z, Wang, J, Cheng, N, Xia, T, and Xiao, J. 2020. Aligntts: Efficient feed-forward text-to-speech system without explicit alignment//ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE: 6714-8 [DOI: https://doi.org/10.1109/icassp40776.2020.9054119] (Zhang, B, Qi, C, Zhang, P, Zhang, B, Wu, H, Chen, D, Chen, Q, Wang, Y, and Wen, F. 2023. Metaportrait: Identity-preserving talking head generation with fast personalized adaptation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR: 22096-105 [DOI: 10.1109/CVPR52729.2023.02116] (Zhang, C, Zhao, Y, Huang, Y, Zeng, M, Ni, S, Budagavi, M, and Guo, X. 2021. Facial: Synthesizing dynamic talking face with implicit attribute learning//Proceedings of the IEEE/CVF international conference on computer vision. CVPR: 3867-76 [DOI: 10.1109/ICCV48922.2021.00384] (Zhang, H, Yuan, T, Chen, J, Li, X, Zheng, R, Huang, Y, Chen, X, Gong, E, Chen, Z, and Hu, X. 2022. Paddlespeech: An easy-to-use all-in-one speech toolkit[EB/OL]. [2022-05-20]. https://arxiv.org/pdf/2205.12007.pdf (Zhang, L, and Agrawala M. 2023. Adding conditional control to text-to-image diffusion models[EB/OL]. [2023-02-10]. https://arxiv.org/abs/2302.05543.pdf (Zhang, L, Qiu, Q, Lin, H, Zhang, Q, Shi, C, Yang, W, Shi, Y, Yang, S, Xu, L, and Yu, J. 2023. DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance[EB/OL]. (Zhang, Q, Lu, H, Sak, H, Tripathi, A, McDermott, E, Koo, S, and Kumar, S. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE: 7829-33 [DOI: 10.1109/icassp40776.2020.9053896] (Zhang, S, Yuan, J, Liao, M, and Zhang, L. 2022. Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE: 2659-63 [DOI: 10.1109/ICASSP43922.2022.9747380] (Zhang, W, Cun, X, Wang, X, Zhang, Y, Shen, X, Guo, Y, Shan, Y, and Wang, F. 2023. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR: 8652-61 [DOI: 10.1109/CVPR52729.2023.00836] (Zhang, Y, He, W, Li, M, Tian, K, Zhang, Z, Cheng, J, Wang, Y, and Liao, J. 2022. Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE: 4848-52 [DOI: 10.1109/ICASSP43922.2022.9747284] (Zhen, R, Song, W, and Cao J. 2022. Research on the Application of Virtual Human Synthesis Technology in Human-Computer Interaction//2022 IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS). Zhuhai, China:IEEE: 199-204 [DOI: 10.1109/ICIS54925.2022.9882355] (Zhen, R, Song, W, He, Q, Cao, J, Shi, L, and Luo, J. 2023. Human-computer interaction system: A survey of talking-head generation. Electronics,12(1): 218 [DOI: 10.3390/electronics12010218] (Zhou, H, Liu, Y, Liu, Z, Luo, P, and Wang, X. 2019. Talking face generation by adversarially disentangled audio-visual representation//Proceedings of the AAAI conference on artificial intelligence. AAAI: 9299-306 [DOI: 10.1609/AAAI.V33I01.33019299] (Zhou, H, Sun, Y, Wu, W, Loy, C C, Wang, X, and Liu, Z. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR: 4176-86 [DOI: 10.1109/CVPR46437.2021.00416] (Zhou, Y, Han, X, Shechtman, E, Echevarria, J, Kalogerakis, E, and Li, D. 2020. Makelttalk: speaker-aware talking-head animation. ACM transactions on graphics (TOG),39(6): 1-15 [DOI: 10.1145/3414685.3417774] (Zielonka, W, Bolkart, T, and Thies J. 2023. Instant volumetric head avatars//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada:IEEE: 4574-84 [DOI: 10.1109/CVPR52729.2023.00444] (郝琮晖, 杜悠扬, 王璐, and 王贝贝. 2024. 数字人脸渲染与外观恢复方法综述. 中国图象图形学报,29: 1-34 [DOI: 10.11834/jig.230683] (廖远鸿, 钱文华, and 曹进德. 2023. 风格强度可变的人脸风格迁移网络. 中国图象图形学报,28(12): 3784-96 [DOI: 10.11834/jig.221149] (刘安安, 苏育挺, 王岚君, 李斌, 钱振兴, 张卫明, 周琳娜, et al. 2024. AIGC视觉内容生成与溯源研究进展. 中国图象图形学报,29(6): 1535-54 [DOI: 10.11834/jig.240003] (陶建华, 范存航, 连政, 吕钊, 沈莹, and 梁山. 2024. 多模态情感识别与理解发展现状及趋势. 中国图象图形学报,29(6): 1607-27 [DOI: 10.11834/jig.240017] (陶建华, 龚江涛, 高楠, 傅四维, 梁山, and 喻纯. 2023. 面向虚实融合的人机交互. 中国图象图形学报,28(6): 1513-42 [DOI: 10.11834/jig.230020] (陶建华, 巫英才, 喻纯, 翁冬冬, 李冠君, 韩腾, 王运涛, and 刘斌. 2022. 多模态人机交互综述. 中国图象图形学报,27(6): 1956-87 [DOI: 10.11834/jig.220151]
相关作者
相关机构