数字人风格化、多模态驱动与交互进展
Advancements in digital character stylization, multimodal animation, and interaction
- 2025年30卷第2期 页码:334-360
收稿日期:2023-09-12,
修回日期:2024-09-06,
纸质出版日期:2025-02-16
DOI: 10.11834/jig.230639
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2023-09-12,
修回日期:2024-09-06,
纸质出版日期:2025-02-16
移动端阅览
风格化数字人是在计算机图形学、视觉艺术和游戏设计等领域中迅速发展的一个领域。数字人物的设计和制作技术取得了显著的进步,使得数字人物能够具有更加逼真的外观和行为,同时也可以更好地适应各种艺术风格和情境。本文围绕风格化数字人任务,围绕数字人的风格化生成、多模态驱动与用户交互3个核心研究方向的发展现状、前沿动态、热点问题等进行系统性综述。针对数字人的风格化生成,从显式三维模型和隐式三维模型两种数字人的三维表达方式对于方法进行分类。显式三维数字人风格化以基于优化的方法、基于生成对抗网络的方法、基于引擎的方法为主要分析对象;隐式三维数字人风格化从通用隐式场景风格化方法以及针对人脸的隐式风格化进行回顾。针对数字人的驱动,根据驱动源的不同,从显式音频驱动、文本驱动和视频驱动3个方面进行回顾。根据驱动实现算法的不同,从基于中间变量、基于编码—解码结构等方面进行回顾。此外,算法还根据中间变量的不同可分为基于关键点、三维人脸和光流的方法。针对数字人的用户交互,目前主流的交互方式是语音交互,本文对语音交互模块从自动语音识别和文本转语音合成两方面进行了回顾,对于数字人的对话系统模块,从自然语言理解和自然语言生成等方面进行了回顾。在此基础上,展望了风格化数字人研究的未来发展趋势,为后续的相关研究提供参考。
Stylized digital characters have emerged as a fundamental force in reshaping the landscape of computer graphics, visual arts, and game design. Their unparalleled ability to mimic human appearance and behavior, coupled with their flexibility in adapting to a wide array of artistic styles and narrative frameworks, underscores their growing importance in crafting immersive and engaging digital experiences. This comprehensive exploration delves deeply into the complex world of stylized digital humans, explores their current development status, identifies the latest trends, and addresses the pressing challenges that lie ahead in three foundational research domains: the creation of stylized digital humans, multimodal driving mechanisms, and user interaction modalities. The first domain, creation of stylized digital humans, examines the innovative methodologies employed in generating lifelike but stylistically diverse characters that can seamlessly integrate into various digital environments. From advancements in 3D modeling and texturing to the integration of artificial intelligence for dynamic character development, this section provides a thorough analysis of the tools and technologies that are pushing the boundaries of what digital characters can achieve. In the realm of multimodal driving mechanisms, this study investigates evolving techniques for animating and controlling digital humans by using a range of inputs, such as voice, gesture, and real-time motion capture. This section delves into how these mechanisms not only enhance the realism of character interactions but also open new avenues for creators to involve users in interactive narratives in more meaningful ways. Finally, the discussion of user interaction modalities explores the various ways in which end-users can engage with and influence the behavior of digital humans. From immersive virtual and augmented reality experiences to interactive web and mobile platforms, this segment evaluates the effectiveness of different modalities in creating a two-way interaction that enriches the user’s experience and deepens their connection to digital characters. At the heart of this exploration lies the creation of stylized digital humans, a field that has witnessed remarkable progress in recent years. The generation of these characters can be broadly classified into two categories: explicit 3D models and implicit 3D models. Explicit 3D digital human stylization encompasses a range of methodologies, including optimization-based approaches that meticulously refine digital meshes to conform to specific stylistic attributes. These techniques often involve iterative processes that adjust geometric details, textures, and lighting to achieve the desired aesthetic. Generative adversarial networks, as cornerstones of deep learning, have revolutionized this landscape by enabling the automatic generation of novel stylized forms that capture intricate nuances of various artistic styles. Furthermore, engine-based methods harness the power of advanced rendering engines to apply artistic filters and affect real time, offering unparalleled flexibility and control over the final visual output. Implicit 3D digital human stylization draws inspiration from the realm of implicit scene stylization, particularly via neural implicit representations. These approaches offer a more holistic and flexible approach for representing and manipulating 3D geometry and appearance, enabling stylization that transcends traditional mesh-based limitations. Within this framework, facial stylization holds a special place, requiring a profound understanding of facial anatomy, expression dynamics, and cultural nuances. Specialized methods have been developed to capture and manipulate facial features in a nuanced and artistic manner, fostering a level of realism and emotional expressiveness that is crucial for believable digital humans. Animating and controlling the behavior of stylized digital humans necessitates the use of diverse driving signals, which serve as the lifeblood of these virtual beings. This study delves into three primary sources of these signals: explicit audio drivers, text drivers, and video drivers. Audio drivers leverage speech recognition and prosody analysis to synchronize digital human movements with spoken language, enabling them to lip-sync and gesture in a natural and expressive manner. Conversely, text drivers rely on natural language processing (NLP) techniques to interpret textual commands or prompts and convert them into coherent actions, allowing for a more directive form of control. Video drivers, which are perhaps the most advanced in terms of realism, employ computer vision algorithms to track and mimic the movements of real-world actors, providing a seamless bridge between the virtual and physical worlds. These drivers are supported by sophisticated implementation algorithms, many of which rely on intermediate variable-driven coding-decoding structures. Keypoint-based methods play a pivotal role in capturing and transferring motion, allowing for the precise replication of movements across different characters. Moreover, 3D face-based approaches focus on facial animation and utilize detailed facial models and advanced animation techniques to achieve unparalleled realism in expressions and emotions. Meanwhile, optical flow-based techniques offer a holistic approach to motion estimation, synthesis, capture, and reproduction of complex motion patterns across the entire digital human body. The true magic of stylized digital humans lies in their ability to engage with users in meaningful and natural interactions. Voice interaction, currently the mainstream mode of communication, relies heavily on automatic speech recognition for accurate speech-to-text conversion and text-to-speech synthesis for generating natural-sounding synthetic speech. The dialog system module, a cornerstone of virtual human interaction, emphasizes the importance of natural language understanding for interpreting user inputs and natural language generation for crafting appropriate responses. When these capabilities are seamlessly integrated, stylized digital humans are capable of engaging in fluid and contextually relevant conversations with users, fostering a sense of intimacy and connection. The study of stylized digital characters will likely continue its ascendancy, fueled by advancements in deep learning, computer vision, and NLP. Future research may delve into integrating multiple modalities for richer and more nuanced interactions, pushing the boundaries of what is possible in virtual human communication. Innovative stylization techniques that bridge the gap between reality and fiction will also be explored, enabling the creation of digital humans that are both fantastic and relatable. Moreover, the development of intelligent agents capable of autonomous creativity and learning will revolutionize the way stylized digital humans can contribute to various industries, including entertainment, education, healthcare, and beyond. As technology continues to evolve, stylized digital humans will undoubtedly play an increasingly substantial role in shaping how people engage with digital content and with each other, ushering in a new era of digital creativity and expression. This study serves as a valuable resource for researchers and practitioners alike, offering a comprehensive overview of the current state of the art and guiding the way forward in this dynamic, exciting field.
Abdal R , Lee H Y , Zhu P H , Chai M L , Siarohin A , Wonka P and Tulyakov S . 2023 . 3DAvatarGAN: bridging domains for personalized editable avatars // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 4552 - 4562 [ DOI: 10.1109/CVPR52729.2023.00442 http://dx.doi.org/10.1109/CVPR52729.2023.00442 ]
Almahairi A , Rajeshwar S , Sordoni A , Bachman P and Courville A . 2018 . Augmented CycleGAN: learning many-to-many mappings from unpaired data [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1802.10151.pdf https://arxiv.org/pdf/1802.10151.pdf
Amodei D , Anubhai R , Battenberg E , Case C , Casper J , Catanzaro B , Chen J D , Chrzanowski M , Coates A , Diamos G , Elsen E , Engel J , Fan L X , Fougner C , Han T , Hannun A , Jun B , Legresley P , Lin L , Narang S , Ng A , Ozair S , Prenger R , Raiman J , Satheesh S , Seetapun D , Sengupta S , Wang Y , Wang Z Q , Wang C , Xiao B , Yogatama D , Zhan J and Zhu Z Y . 2015 . Deep speech 2: end-to-end speech recognition in english and mandarin [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1512.02595.pdf https://arxiv.org/pdf/1512.02595.pdf
Aneja S , Thies J , Dai A and Niessner M . 2023 . ClipFace: text-guided editing of textured 3D morphable models // Proceedings of 2023 ACM SIGGRAPH Conference Proceedings . Los Angeles, USA : Association for Computing Machinery: #70 [ DOI: 10.1145/3588432.3591566 http://dx.doi.org/10.1145/3588432.3591566 ]
Averbuch-Elor H , Cohen-Or D , Kopf J and Cohen M F . 2017 . Bringing portraits to life . ACM Transactions on Graphics (TOG) , 36 ( 6 ): # 196 [ DOI: 10.1145/3130800.3130818 http://dx.doi.org/10.1145/3130800.3130818 ]
Baevski A , Zhou H , Mohamed A and Auli M . 2020 . wav2vec 2 . 0 : a framework for self-supervised learning of speech representations [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2006.11477.pdf https://arxiv.org/pdf/2006.11477.pdf
Bengio Y , Ducharme R and Vincent P . 2000 . A neural probabilistic language model // Proceedings of the 13th International Conference on Neural Information Processing Systems . Denver, USA : MIT Press: 893 - 899
Blanz V and Vetter T . 1999 . A morphable model for the synthesis of 3D faces //Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. [s.l.]: ACM Press/Addison-Wesley Publishing Co .: 187 - 194 [ DOI: 10.1145/311535.311556 http://dx.doi.org/10.1145/311535.311556 ]
Blanz V and Vetter T . 2023 . A morphable model for the synthesis of 3D faces // Seminal Graphics Papers: Pushing the Boundaries , Volume 2 , 157 - 164
Brooks T , Holynski A and Efros A A . 2023 . InstructPix2Pix: learning to follow image editing instructions // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 18392 - 18402 [ DOI: 10.1109/CVPR52729.2023.01764 http://dx.doi.org/10.1109/CVPR52729.2023.01764 ]
Cai Q , Ma M X , Wang C and Li H S . 2023 . Image neural style transfer: a review . Computers and Electrical Engineering , 108 : # 108723 [ DOI: 10.1016/j.compeleceng.2023.108723 http://dx.doi.org/10.1016/j.compeleceng.2023.108723 ]
Cao C , Simon T , Kim J K , Schwartz G , Zollhoefer M , Saito S S , Lombardi S , Wei S E , Belko D , Yu S I , Sheikh Y and Saragih J . 2022 . Authentic volumetric avatars from a phone scan . ACM Transactions on Graphics (TOG) , 41 ( 4 ): # 163 [ DOI: 10.1145/3528223.3530143 http://dx.doi.org/10.1145/3528223.3530143 ]
Cao K D , Liao J and Yuan L . 2018 . CariGANs: unpaired photo-to-caricature translation . ACM Transactions on Graph ICS (TOG) , 37 ( 6 ): # 244 [ DOI: 10.1145/3272127.3275046 http://dx.doi.org/10.1145/3272127.3275046 ]
Cao Y , Tien W C , Faloutsos P and Pighin F . 2005 . Expressive speech-driven facial animation . ACM Transactions on Graphics (TOG) , 24 ( 4 ): 1283 - 1302 [ DOI: 10.1145/1095878.1095881 http://dx.doi.org/10.1145/1095878.1095881 ]
Chan E R , Lin C Z , Chan M A , Nagano K , Pan B X , De Mello S , Gallo O , Guibas L , Tremblay J , Khamis S , Karras T and Wetzstein G . 2022 . Efficient geometry-aware 3D generative adversarial networks // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 16102 - 16112 [ DOI: 10.1109/CVPR52688.2022.01565 http://dx.doi.org/10.1109/CVPR52688.2022.01565 ]
Chatziagapi A , Athar S , Jain A , Rohith M V , Bhat V and Samaras D . 2023 . LipNeRF: what is the right feature space to lip-sync a NeRF? // Proceedings of the 17th IEEE International Conference on Automatic Face and Gesture Recognition (FG) . Waikoloa Beach, USA : IEEE: 1 - 8 [ DOI: 10.1109/FG57933.2023.10042567 http://dx.doi.org/10.1109/FG57933.2023.10042567 ]
Chen D D , Liao J , Yuan L , Yu N H and Hua G . 2017 . Coherent online video style transfer // Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV) . Venice, Italy : IEEE: 1114 - 1123 [ DOI: 10.1109/Iccv.2017.126 http://dx.doi.org/10.1109/Iccv.2017.126 ]
Chen L L , Li Z H , Maddox R K , Duan Z Y and Xu C L . 2018 . Lip movements generation at a glance // Proceedings of the 15th European Conference on Computer Vision (ECCV) . Munich, Germany : Springer: 538 - 553 [ DOI: 10.1007/978-3-030-01234-2_32 http://dx.doi.org/10.1007/978-3-030-01234-2_32 ]
Chen L L , Maddox R K , Duan Z Y and Xu C L . 2019 . Hierarchical cross-modal talking face generation with dynamic pixel-wise loss // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 7824 - 7833 [ DOI: 10.1109/CVPR.2019.00802 http://dx.doi.org/10.1109/CVPR.2019.00802 ]
Chen Z , Xu X D , Yan Y C , Pan Y , Zhu W H , Wu W , Dai B and Yang X K . 2023 . HyperStyle 3 D: text-guided 3D portrait stylization via hypernetworks [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2304.09463.pdf https://arxiv.org/pdf/2304.09463.pdf
Cudeiro D , Bolkart T , Laidlaw C , Ranjan A and Black M J . 2019 . Capture, learning, and synthesis of 3D speaking styles // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 10093 - 10103 [ DOI: 10.1109/CVPR.2019.01034 http://dx.doi.org/10.1109/CVPR.2019.01034 ]
Das D , Biswas S , Sinha S and Bhowmick B . 2020 . Speech-driven facial animation using cascaded gans for learning of motion and texture // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 408 - 424 [ DOI: 10.1007/978-3-030-58577-8_25 http://dx.doi.org/10.1007/978-3-030-58577-8_25 ]
Edwards P , Landreth C , Fiume E and Singh K . 2016 . JALI: an animator-centric viseme model for expressive lip synchronization . ACM Transactions on Graphics (TOG) , 35 ( 4 ): # 127 [ DOI: 10.1145/2897824.2925984 http://dx.doi.org/10.1145/2897824.2925984 ]
Eskimez S E , Zhang Y and Duan Z Y . 2022 . Speech driven talking face generation from a single image and an emotion condition . IEEE Transactions on Multimedia , 24 : 3480 - 3490 [ DOI: 10.1109/TMM.2021.3099900 http://dx.doi.org/10.1109/TMM.2021.3099900 ]
Ezzat T and Poggio T . 2002 . MikeTalk: a talking facial display based on morphing visemes // Proceedings Computer Animation ’98 . Philadelphia, USA : IEEE: 96 - 102 [ DOI: 10.1109/CA.1998.681913 http://dx.doi.org/10.1109/CA.1998.681913 ]
Fan Y C , Qian Y , Xie F L and Soong F K . 2014 . TTS synthesis with bidirectional LSTM based recurrent neural networks // Interspeech 2014 . Singapore, Singapore : ISCA: 1964 - 1968 [ DOI: 10.21437/interspeech.2014-443 http://dx.doi.org/10.21437/interspeech.2014-443 ]
Fan Y R , Lin Z J , Saito J , Wang W P and Komura T . 2022 . FaceFormer: speech-driven 3D facial animation with transformers // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 18749 - 18758 [ DOI: 10.1109/CVPR52688.2022.01821 http://dx.doi.org/10.1109/CVPR52688.2022.01821 ]
Fried O , Tewari A , Zollhöfer M , Finkelstein A , Shechtman E , Goldman D B , Genova K , Jin Z Y , Theobalt C and Agrawala M . 2019 . Text-based editing of talking-head video . ACM Transactions on Graphics (TOG) , 38 ( 4 ): # 68 [ DOI: 10.1145/3306346.3323028 http://dx.doi.org/10.1145/3306346.3323028 ]
Gal R , Patashnik O , Maron H , Chechik G and Cohen-Or D . 2021 . StyleGAN-NADA: CLIP-guided domain adaptation of image generators [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2108.00946.pdf https://arxiv.org/pdf/2108.00946.pdf
Gao W , Lie Y J , Yin Y H and Yang M H . 2020 . Fast video multi-style transfer // Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) . Snowmass, USA : IEEE: 3211 - 3219 [ DOI: 10.1109/WACV45572.2020.9093420 http://dx.doi.org/10.1109/WACV45572.2020.9093420 ]
Gatys L A , Ecker A S and Bethge M . 2016 . Image style transfer using convolutional neural networks // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, USA : IEEE: 2414 - 2423 [ DOI: 10.1109/Cvpr.2016.265 http://dx.doi.org/10.1109/Cvpr.2016.265 ]
Geng J H , Shao T J , Zheng Y Y , Weng Y L and Zhou K . 2018 . Warp-guided gans for single-photo facial animation . ACM Transactions on Graphics (TOG) , 37 ( 6 ): # 231 [ DOI: 10.1145/3272127.3275043 http://dx.doi.org/10.1145/3272127.3275043 ]
Genova K , Cole F , Maschinot A , Sarna A , Vlasic D and Freeman W T . 2018 . Unsupervised training for 3D morphable model regression // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 8377 - 8386 [ DOI: 10.1109/CVPR.2018.00874 http://dx.doi.org/10.1109/CVPR.2018.00874 ]
Gulati A , Qin J , Chiu C C , Parmar N , Zhang Y , Yu J H , Han W , Wang S B , Zhang Z D , Wu Y H and Pang R M . 2020 . Conformer: convolution-augmented transformer for speech recognition [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2005.08100.pdf https://arxiv.org/pdf/2005.08100.pdf
Guo Y D , Chen K Y , Liang S , Liu Y J , Bao H J and Zhang J Y . 2021 . AD-NeRF: audio driven neural radiance fields for talking head synthesis // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 5764 - 5774 [ DOI: 10.1109/ICCV48922.2021.00573 http://dx.doi.org/10.1109/ICCV48922.2021.00573 ]
Guo Y D , Jiang L , Cai L and Zhang J Y . 2019 . 3 D magic mirror: automatic video to 3D caricature translation [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1906.00544.pdf https://arxiv.org/pdf/1906.00544.pdf
Ha S , Kersner M , Kim B , Seo S and Kim D . 2020 . Marionette: few-shot face reenactment preserving identity of unseen targets // Proceedings of the 34th AAAI Conference on Artificial Intelligence . New York, USA : AAAI: 10893 - 10900 [ DOI: 10.1609/AAAI.V34I07.6721 http://dx.doi.org/10.1609/AAAI.V34I07.6721 ]
Han F Z , Ye S Q , He M M , Chai M L and Liao J . 2023 . Exemplar-based 3D portrait stylization . IEEE Transactions on Visualization and Computer Graphics , 29 ( 2 ): 1371 - 1383 [ DOI: 10.1109/TVCG.2021.3114308 http://dx.doi.org/10.1109/TVCG.2021.3114308 ]
Hao C H , Du Y Y , Wang L and Wang B B . 2024 . Survey of digital face rendering and appearance recovery methods . Journal of Image and Graphics , 29 ( 9 ): 2513 - 2540
郝琮晖 , 杜悠扬 , 王璐 , 王贝贝 . 2024 . 数字人脸渲染与外观恢复方法综述 . 中国图象图形学报 , 29 ( 9 ): 2513 - 2540 [ DOI: 10.11834/jig.230683 http://dx.doi.org/10.11834/jig.230683 ]
Haque A , Tancik M , Efros A A , Holynski A and Kanazawa A . 2023 . Instruct-NeRF2NeRF: editing 3D scenes with instructions [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2303.12789.pdf https://arxiv.org/pdf/2303.12789.pdf
Ho J , Salimans T , Gritsenko A , Chan W , Norouzi M and Fleet D J . 2022 . Video diffusion models [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2204.03458.pdf https://arxiv.org/pdf/2204.03458.pdf
Hu L , Qi J W , Zhang B , Pan P and Xu Y H . 2021 . Text-driven 3D avatar animation with emotional and expressive behaviors // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event, China : ACM: 2816 - 2818 [ DOI: 10.1145/3474085.3478569 http://dx.doi.org/10.1145/3474085.3478569 ]
Jamaludin A , Chung J S and Zisserman A . 2019 . You said that?: synthesising talking faces from audio . International Journal of Computer Vision , 127 ( 11 ): 1767 - 1779 [ DOI: 10.1007/S11263-019-01150-Y http://dx.doi.org/10.1007/S11263-019-01150-Y ]
Ji X Y , Zhou H , Wang K S Y , Wu Q Y , Wu W , Xu F and Cao X . 2022 . EAMM: one-shot emotional talking face via audio-based emotion-aware motion model // Proceedings of 2022 ACM SIGGRAPH Conference Proceedings . Vancouver, Canada : ACM: #61 [ DOI: 10.1145/3528233.3530745 http://dx.doi.org/10.1145/3528233.3530745 ]
Ji X Y , Zhou H , Wang K S Y , Wu W , Loy C C , Cao X and Xu F . 2021 . Audio-driven emotional video portraits // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 14075 - 14084 [ DOI: 10.1109/CVPR46437.2021.01386 http://dx.doi.org/10.1109/CVPR46437.2021.01386 ]
Jiang K W , Chen S Y , Liu F L , Fu H B and Gao L . 2022 . NeRFFaceEditing: disentangled face editing in neural radiance fields // Proceedings of 2022 SIGGRAPH Asia Conference Papers . Daegu, Republic of Korea : Association for Computing Machinery: #31 [ DOI: 10.1145/3550469.3555377 http://dx.doi.org/10.1145/3550469.3555377 ]
Kalberer G A and Van Gool L . 2001 . Face animation based on observed 3D speech dynamics // Proceedings of the 14th Conference on Computer Animation . Seoul, Korea (South) : IEEE: 20 - 27 [ DOI: 10.1109/CA.2001.982373 http://dx.doi.org/10.1109/CA.2001.982373 ]
Kalchbrenner N , Elsen E , Simonyan K , Noury S , Casagrande N , Lockhart E , Stimberg F , van den Oord A , Dieleman S and Kavukcuoglu K . 2018 . Efficient neural audio synthesis [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1802.08435.pdf https://arxiv.org/pdf/1802.08435.pdf
Karras T , Aila T , Laine S , Herva A and Lehtinen J . 2017 . Audio-driven facial animation by joint end-to-end learning of pose and emotion . ACM Transactions on Graphics (TOG) , 36 ( 4 ): # 94 [ DOI: 10.1145/3072959.3073658 http://dx.doi.org/10.1145/3072959.3073658 ]
Karras T , Laine S and Aila T . 2019 . A style-based generator architecture for generative adversarial networks // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 4396 - 4405 [ DOI: 10.1109/CVPR.2019.00453 http://dx.doi.org/10.1109/CVPR.2019.00453 ]
Kim H , Elgharib M , Zollhöfer M , Seidel H P , Beeler T , Richardt C and Theobalt C . 2019 . Neural style-preserving visual dubbing . ACM Transactions on Graphics (TOG) , 38 ( 6 ): # 178 [ DOI: 10.1145/3355089.3356500 http://dx.doi.org/10.1145/3355089.3356500 ]
Kim J , Kim S , Kong J and Yoon S . 2020 . Glow-TTS: a generative flow for text-to-speech via monotonic alignment search // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 8067 - 8077
Kim S , Lee S G , Song J , Kim J and Yoon S . 2018 . FloWaveNet: a generative flow for raw audio [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1811.02155.pdf https://arxiv.org/pdf/1811.02155.pdf
Kulkarni T D , Narasimhan K R , Saeedi A and Tenenbaum J B . 2016 . Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1604.06057.pdf https://arxiv.org/pdf/1604.06057.pdf
Kumar R , Sotelo J , Kumar K , De Brébisson A and Bengio Y . 2017 . ObamaNet: photo-realistic lip-sync from text [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1801.01442.pdf https://arxiv.org/pdf/1801.01442.pdf
Li J H , Zhang J W , Bai X , Zhou J and Gu L . 2023 . Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris, France : IEEE: 7534 - 7544 [ DOI: 10.1109/ICCV51070.2023.00696 http://dx.doi.org/10.1109/ICCV51070.2023.00696 ]
Li L C , Wang S Z , Zhang Z M , Ding Y , Zheng Y X , Yu X and Fan C J . 2021 . Write-a-speaker: text-based emotional and rhythmic talking-head generation // Proceedings of the 35th AAAI Conference on Artificial Intelligence . Virtually : AAAI: 1911 - 1920 [ DOI: 10.1609/AAAI.V35I3.16286 http://dx.doi.org/10.1609/AAAI.V35I3.16286 ]
Li N H , Liu S J , Liu Y Q , Zhao S and Liu M . 2019 . Neural speech synthesis with transformer network // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu, USA : AAAI: 6706 - 6713 [ DOI: 10.1609/aaai.v33i01.33016706 http://dx.doi.org/10.1609/aaai.v33i01.33016706 ]
Li S X . 2023 . Instruct-Video2Avatar: video-to-avatar generation with instructions [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2306.02903.pdf https://arxiv.org/pdf/2306.02903.pdf
Li S X and Pan Y . 2023 . Rendering and reconstruction based 3D portrait stylization // Proceedings of 2023 IEEE International Conference on Multimedia and Expo (ICME) . Brisbane, Australia : IEEE: 912 - 917 [ DOI: 10.1109/ICME55011.2023.00161 http://dx.doi.org/10.1109/ICME55011.2023.00161 ]
Liao Y H , Qian W H and Cao J D . 2023 . MStarGAN: a face style transfer network with changeable style intensity . Journal of Image and Graphics , 28 ( 12 ): 3784 - 3796
廖远鸿 , 钱文华 , 曹进德 . 2023 . 风格强度可变的人脸风格迁移网络 . 中国图象图形学报 , 28 ( 12 ): 3784 - 3796 [ DOI: 10.11834/jig.221149 http://dx.doi.org/10.11834/jig.221149 ]
Lin J K , Yuan Y and Zou Z X . 2021 . MeInGame: create a game character face from a single portrait // Proceedings of the 35th AAAI Conference on Artificial Intelligence . Virtual Event : AAAI: 311 - 319 [ DOI: 10.1609/AAAI.v35i1.16106 http://dx.doi.org/10.1609/AAAI.v35i1.16106 ]
Liu A A , Su Y T , Wang L J , Li B , Qian Z X , Zhang W M , Zhou L N , Zhang X P , Zhang Y D , Huang J W and Yu N H . 2024 . Review on the progress of the AIGC visual content generation and traceability . Journal of Image and Graphics , 29 ( 6 ): 1535 - 1554
刘安安 , 苏育挺 , 王岚君 , 李斌 , 钱振兴 , 张卫明 , 周琳娜 , 张新鹏 , 张勇东 , 黄继武 , 俞能海 . 2024 . AIGC视觉内容生成与溯源研究进展 . 中国图象图形学报 , 29 ( 6 ): 1535 - 1554 [ DOI: 10.11834/jig.240003 http://dx.doi.org/10.11834/jig.240003 ]
Liu J L , Zhu Z Y , Ren Y , Huang W C , Huai B X , Yuan N and Zhao Z . 2022a . Parallel and high-fidelity text-to-lip generation // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Virtual Event : AAAI: 1738 - 1746 [ DOI: 10.1609/AAAI.V36I2.20066 http://dx.doi.org/10.1609/AAAI.V36I2.20066 ]
Liu N , Li S , Du Y L , Torralba A and Tenenbaum J B . 2022b . Compositional visual generation with composable diffusion models // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 423 - 439 [ DOI: 10.1007/978-3-031-19790-1_26 http://dx.doi.org/10.1007/978-3-031-19790-1_26 ]
Liu X , Xu Y H , Wu Q Y , Zhou H , Wu W and Zhou B L . 2022c . Semantic-aware implicit neural audio-driven video portrait generation // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 106 - 125 [ DOI: 10.1007/978-3-031-19836-6_7 http://dx.doi.org/10.1007/978-3-031-19836-6_7 ]
Mehri S , Kumar K , Gulrajani I , Kumar R , Jain S , Sotelo J , Courville A and Bengio Y . 2017 . SampleRNN: an unconditional end-to-end neural audio generation model [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1612.07837.pdf https://arxiv.org/pdf/1612.07837.pdf
Miao C F , Liang S , Chen M C , Ma J , Wang S J and Xiao J . 2020 . Flow-TTS: a non-autoregressive network for text to speech based on flow // Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona, Spain : IEEE: 7209 - 7213 [ DOI: 10.1109/icassp40776.2020.9054484 http://dx.doi.org/10.1109/icassp40776.2020.9054484 ]
Mikolov T , Karafit M , Burget L , Černocký J and Khudanpur S . 2010 . Recurrent neural network based language model // Interspeech 2010 . Makuhari, Japan : ISCA: 1045 - 1048 [ DOI: 10.21437/interspeech.2010-343 http://dx.doi.org/10.21437/interspeech.2010-343 ]
Mildenhall B , Srinivasan P P , Tancik M , Barron J T , Ramamoorthi R and Ng R . 2021 . NeRF: representing scenes as neural radiance fields for view synthesis . Communications of the ACM , 65 ( 1 ): 99 - 106 [ DOI: 10.1145/3503250 http://dx.doi.org/10.1145/3503250 ]
Müller T , Evans A , Schied C and Keller A . 2022 . Instant neural graphics primitives with a multiresolution hash encoding [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2201.05989.pdf https://arxiv.org/pdf/2201.05989.pdf
Nguyen-Phuoc T , Schwartz G , Ye Y T , Lombardi S and Xiao L . 2023 . AlteredAvatar: stylizing dynamic 3D avatars with fast style adaptation [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2305.19245v1.pdf https://arxiv.org/pdf/2305.19245v1.pdf
Olivier N , Kerbiriou G , Arguelaguet F , Avril Q , Danieau F , Guillotel P , Hoyet L and Multon F . 2022 . Study on automatic 3D facial caricaturization: from rules to deep learning . Frontiers in Virtual Reality , 2 : # 785104 [ DOI: 10.3389/FRVIR.2021.785104 http://dx.doi.org/10.3389/FRVIR.2021.785104 ]
Ouyang L , Wu J , Jiang X , Almeida D , Wainwright C L , Mishkin P , Zhang C , Agarwal S , Slama K , Ray A , Schulman J , Hilton J , Kelton F , Miller L , Simens M , Askell A , Welinder P , Christiano P , Leike J and Lowe R . 2022 . Training language models to follow instructions with human feedback [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2203.02155.pdf https://arxiv.org/pdf/2203.02155.pdf
Park D S , Chan W , Zhang Y , Chiu C C , Zoph B , Cubuk E D and Le Q V . 2019 . SpecAugment: a simple data augmentation method for automatic speech recognition [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1904.08779.pdf https://arxiv.org/pdf/1904.08779.pdf
Park S J , Kim M , Hong J , Choi J and Ro Y M . 2022 . SyncTalkFace: talking face generation with precise lip-syncing via audio-lip memory // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Virtual Event : AAAI: 2062 - 2070 [ DOI: 10.1609/AAAI.V36I2.20102 http://dx.doi.org/10.1609/AAAI.V36I2.20102 ]
Peddinti V , Povey D and Khudanpur S . 2015 . A time delay neural network architecture for efficient modeling of long temporal contexts // Interspeech 2015 . Dresden, Germany : ISCA: 3214 - 3218 [ DOI: 10.21437/interspeech.2015-647 http://dx.doi.org/10.21437/interspeech.2015-647 ]
Povey D , Cheng G F , Wang Y M , Li K , Xu H N , Yarmohammadi M and Khudanpur S . 2018 . Semi-orthogonal low-rank matrix factorization for deep neural networks // Interspeech 2018 . Hyderabad, India : ISCA: 3743 - 3747 [ DOI: 10.21437/interspeech.2018-1417 http://dx.doi.org/10.21437/interspeech.2018-1417 ]
Prajwal K R , Mukhopadhyay R , Philip J , Jha A , Namboodiri V and Jawahar C . 2019 . Towards automatic face-to-face translation // Proceedings of the 27th ACM International Conference on Multimedia . New York, USA : ACM: 1428 - 1436 [ DOI: 10.1145/3343031.3351066 http://dx.doi.org/10.1145/3343031.3351066 ]
Prajwal K R , Mukhopadhyay R , Namboodiri V P and Jawahar C V . 2020 . A lip sync expert is all you need for speech to lip generation in the wild // Proceedings of the 28th ACM International Conference on Multimedia . Seattle, USA : ACM: 484 - 492 [ DOI: 10.1145/3394171.3413532 http://dx.doi.org/10.1145/3394171.3413532 ]
Radford A , Kim J W , Xu T , Brockman G , McLeavey C and Sutskever I . 2022 . Robust speech recognition via large-scale weak supervision [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2212.04356.pdf https://arxiv.org/pdf/2212.04356.pdf
Ren Y , Ruan Y J , Tan X , Qin T , Zhao S , Zhao Z and Liu T Y . 2019 . FastSpeech: fast, robust and controllable text to speech [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1905.09263.pdf https://arxiv.org/pdf/1905.09263.pdf
Richard A , Zollhöfer M , Wen Y D , de la Torre F and Sheikh Y . 2021 . MeshTalk: 3D face animation from speech using cross-modality disentanglement // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 1153 - 1162 [ DOI: 10.1109/ICCV48922.2021.00121 http://dx.doi.org/10.1109/ICCV48922.2021.00121 ]
Rombach R , Blattmann A , Lorenz D , Esser P and Ommer B . 2022 . High-resolution image synthesis with latent diffusion models // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 10674 - 10685 [ DOI: 10.1109/Cvpr52688.2022.01042 http://dx.doi.org/10.1109/Cvpr52688.2022.01042 ]
Ruder M , Dosovitskiy A and Brox T . 2016 . Artistic style transfer for videos // Proceedings of the 38th German Conference on Pattern Recognition . Hannover, Germany : Springer: 26 - 36 [ DOI: 10.1007/978-3-319-45886-1_3 http://dx.doi.org/10.1007/978-3-319-45886-1_3 ]
Sadoughi N and Busso C . 2021 . Speech-driven expressive talking lips with conditional sequential generative adversarial networks . IEEE Transactions on Affective Computing , 12 ( 4 ): 1031 - 1044 [ DOI: 10.1109/TAFFC.2019.2916031 http://dx.doi.org/10.1109/TAFFC.2019.2916031 ]
Sak H , Senior A and Beaufays F . 2014 . Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1402.1128.pdf https://arxiv.org/pdf/1402.1128.pdf
Schneider S , Baevski A , Collobert R and Auli M . 2019 . Wav2vec: unsupervised pre-training for speech recognition // Interspeech 2019 . Graz, Austria : ISCA: 3465 - 3469 [ DOI: 10.21437/Interspeech.2019-1873 http://dx.doi.org/10.21437/Interspeech.2019-1873 ]
Shao R Z , Sun J X , Peng C , Zheng Z R , Zhou B Y , Zhang H W and Liu Y B . 2023a . Control4D: efficient 4D portrait editing with text [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2305.20082.pdf https://arxiv.org/pdf/2305.20082.pdf
Shao R Z , Zheng Z R , Tu H Z , Liu B N , Zhang H W and Liu Y B . 2023b . Tensor4D: efficient neural 4D decomposition for high-fidelity dynamic reconstruction and rendering // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 16632 - 16642 [ DOI: 10.1109/CVPR52729.2023.01596 http://dx.doi.org/10.1109/CVPR52729.2023.01596 ]
Shen S , Li W H , Zhu Z , Duan Y Q , Zhou J and Lu J W . 2022 . Learning dynamic facial radiance fields for few-shot talking head synthesis // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 666 - 682 [ DOI: 10.1007/978-3-031-19775-8_39 http://dx.doi.org/10.1007/978-3-031-19775-8_39 ]
Shen T , Zuo J W , Shi F , Zhang J , Jiang L Q , Chen M , Zhang Z C , Zhang W , He X D and Mei T . 2021 . ViDA-MAN: visual dialog with digital humans // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event, China : ACM: 2789 - 2791 [ DOI: 10.1145/3474085.3478560 http://dx.doi.org/10.1145/3474085.3478560 ]
Shi T Y , Yuan Y , Fan C J , Zou Z X , Shi Z W and Liu Y . 2019 . Face-to-parameter translation for game character auto-creation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 161 - 170 [ DOI: 10.1109/ICCV.2019.00025 http://dx.doi.org/10.1109/ICCV.2019.00025 ]
Siarohin A , Lathuilière S , Tulyakov S , Ricci E and Sebe N . 2020 . First order motion model for image animation [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2003.00196.pdf https://arxiv.org/pdf/2003.00196.pdf
Sohl-Dickstein J , Weiss E A , Maheswaranathan N and Ganguli S . 2015 . Deep unsupervised learning using nonequilibrium thermodynamics // Proceedings of the 32nd International Conference on International Conference on Machine Learning . Lille, France : JMLR.org: 2256 - 2265
Song H K , Woo S H , Lee J , Yang S , Cho H , Lee Y , Choi D and Kim K W . 2022a . Talking face generation with multilingual tts // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 21393 - 21398 [ DOI: 10.1109/CVPR52688.2022.02074 http://dx.doi.org/10.1109/CVPR52688.2022.02074 ]
Song L S , Wu W , Qian C , He R and Loy C C . 2022b . Everybody’s talkin’: let me talk as you want . IEEE Transactions on Information Forensics and Security , 17 : 585 - 598 [ DOI: 10.1109/TIFS.2022.3146783 http://dx.doi.org/10.1109/TIFS.2022.3146783 ]
Song W , Yuan X , Zhang Z C , Zhang C , Wu Y Z , He X D and Zhou B W . 2021 . Dian: duration informed auto-regressive network for voice cloning // Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Toronto, Canada : IEEE: 8598 - 8602 [ DOI: 10.1109/icassp39728.2021.9414727 http://dx.doi.org/10.1109/icassp39728.2021.9414727 ]
Song Y , Zhu J W , Li D W , Wang X and Qi H R . 2019 . Talking face generation by conditional recurrent adversarial network [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1804.04786.pdf https://arxiv.org/pdf/1804.04786.pdf
Sun Y Q , He R , Tan W M and Yan B . 2023 . Instruct-neuraltalker: editing audio-driven talking radiance fields with instructions [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2306.10813.pdf https://arxiv.org/pdf/2306.10813.pdf
Tang J X , Wang K S Y , Zhou H , Chen X K , He D L , Hu T S , Liu J T , Zeng G and Wang J D . 2022 . Real-time neural radiance talking portrait synthesis via audio-spatial decomposition [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2211.12368.pdf https://arxiv.org/pdf/2211.12368.pdf
Tao J H , Fan C H , Lian Z , Lyu Z , Shen Y and Liang S . 2024 . Development of multimodal sentiment recognition and understanding . Journal of Image and Graphics , 29 ( 6 ): 1607 - 1627
陶建华 , 范存航 , 连政 , 吕钊 , 沈莹 , 梁山 . 2024 . 多模态情感识别与理解发展现状及趋势 . 中国图象图形学报 , 29 ( 6 ): 1607 - 1627 [ DOI: 10.11834/jig.240017 http://dx.doi.org/10.11834/jig.240017 ]
Tao J H , Gong J T , Gao N , Fu S W , Liang S and Yu C . 2023 . Human-computer interaction for virtual-real fusion . Journal of Image and Graphics , 28 ( 6 ): 1513 - 1542
陶建华 , 龚江涛 , 高楠 , 傅四维 , 梁山 , 喻纯 . 2023 . 面向虚实融合的人机交互 . 中国图象图形学报 , 28 ( 6 ): 1513 - 1542 [ DOI: 10.11834/jig.230020 http://dx.doi.org/10.11834/jig.230020 ]
Tao J H , Wu Y C , Yu C , Weng D D , Li G J , Han T , Wang Y T and Liu B . 2022 . A survey on multi-modal human-computer interaction . Journal of Image and Graphics , 27 ( 6 ): 1956 - 1987
陶建华 , 巫英才 , 喻纯 , 翁冬冬 , 李冠君 , 韩腾 , 王运涛 , 刘斌 . 2022 . 多模态人机交互综述 . 中国图象图形学报 , 27 ( 6 ): 1956 - 1987 [ DOI: 10.11834/jig.220151 http://dx.doi.org/10.11834/jig.220151 ]
Thies J , Elgharib M , Tewari A , Theobalt C and Nießner M . 2020 . Neural voice puppetry: audio-driven facial reenactment // Proceedings of the 16th European Conference on Computer Vision—ECCV 2020 . Glasgow, UK : Springer: 716 - 731 [ DOI: 10.1007/978-3-030-58517-4_42 http://dx.doi.org/10.1007/978-3-030-58517-4_42 ]
Tian G Z , Yuan Y and Liu Y . 2019 . Audio2face: generating speech/face animation from single audio with attention-based bidirectional LSTM networks // Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) . Shanghai, China : IEEE: 366 - 671 [ DOI: 10.1109/Icmew.2019.00069 http://dx.doi.org/10.1109/Icmew.2019.00069 ]
Valin J M , Isik U , Smaragdis P and Krishnaswamy A . 2022 . Neural speech synthesis on a shoestring: improving the efficiency of LPCNet // Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Singapore,Singapore : IEEE: 8437 - 8441 [ DOI: 10.1109/icassp43922.2022.9746103 http://dx.doi.org/10.1109/icassp43922.2022.9746103 ]
Valin J M and Skoglund J . 2019 . LPCNET: improving neural speech synthesis through linear prediction // Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK : IEEE: 5891 - 5895 [ DOI: 10.1109/icassp.2019.8682804 http://dx.doi.org/10.1109/icassp.2019.8682804 ]
van den Oord A , Dieleman S , Zen H , Simonyan K , Vinyals O , Graves A , Kalchbrenner N , Senior A and Kavukcuoglu K . 2016 . Wavenet: a generative model for raw audio [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1609.03499.pdf https://arxiv.org/pdf/1609.03499.pdf
van den Oord A , Li Y Z , Babuschkin I , Simonyan K , Vinyals O , Kavukcuoglu K , van den Driessche G , Lockhart E , Cobo L C , Stimberg F , Casagrande N , Grewe D , Noury S , Dieleman S , Elsen E , Kalchbrenner N , Zen H , Graves A , King H , Walters T , Belov D and Hassabis D . 2017 . Parallel WaveNet: fast high-fidelity speech synthesis [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/1711.10433.pdf https://arxiv.org/pdf/1711.10433.pdf
Vaswani A , Shazeer N , Parmar N , Uszkoreit J , Jones L , Gomez A N , Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates Inc.: 6000 - 6010
Verma A , Rajput N and Subramaniam L V . 2003 . Using viseme based acoustic models for speech driven lip synthesis // Proceedings of 2003 International Conference on Multimedia and Expo . Baltimore, USA : IEEE: 533 - 536 [ DOI: 10.1109/ICME.2003.1221366 http://dx.doi.org/10.1109/ICME.2003.1221366 ]
Wang C , Jiang R X , Chai M L , He M M , Chen D D and Liao J . 2024 . NeRF-Art: text-driven neural radiance fields stylization . IEEE Transactions on Visualization and Computer Graphics , 30 ( 8 ): 4983 - 4996 [ DOI: 10.1109/TVCG.2023.3283400 http://dx.doi.org/10.1109/TVCG.2023.3283400 ]
Wang H , Lin G S , Hoi S C H and Miao C Y . 2022a . 3 D cartoon face generation with controllable expressions from a single GAN image [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2207.14425.pdf https://arxiv.org/pdf/2207.14425.pdf
Wang K S Y , Wu Q Y , Song L S , Yang Z Q , Wu W , Qian C , He R , Qiao Y and Loy C C . 2020 . MEAD: a large-scale audio-visual dataset for emotional talking-face generation // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 700 - 717 [ DOI: 10.1007/978-3-030-58589-1_42 http://dx.doi.org/10.1007/978-3-030-58589-1_42 ]
Wang S Z , Li L C , Ding Y , Fan C J and Yu X . 2021 . Audio2Head: audio-driven one-shot talking-head generation with natural head motion // Proceedings of the 30th International Joint Conference on Artificial Intelligence . Virtual Event : IJCAI: 1098 - 1105 [ DOI: 10.24963/IJCAI.2021/152 http://dx.doi.org/10.24963/IJCAI.2021/152 ]
Wang S Z , Li L C , Ding Y and Yu X . 2022b . One-shot talking face generation from single-speaker audio-visual correlation learning // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Virtually : AAAI: 2531 - 2539 [ DOI: 10.1609/AAAI.V36I3.20154 http://dx.doi.org/10.1609/AAAI.V36I3.20154 ]
Wang S Z , Zeng W H , Wang X , Yang H , Chen L , Zhang C , Wu M , Yuan Y , Zeng Y , Zheng M and Liu J . 2023 . SwiftAvatar: efficient auto-creation of parameterized stylized character on arbitrary avatar engines // Proceedings of the 37th AAAI Conference on Artificial Intelligence . Washington, USA : AAAI: 6101 - 6109 [ DOI: 10.1609/AAAI.v37i5.25753 http://dx.doi.org/10.1609/AAAI.v37i5.25753 ]
Wen X , Wang M , Richardt C , Chen Z Y and Hu S M . 2020 . Photorealistic audio-driven video portraits . IEEE Transactions on Visualization and Computer Graphics , 26 ( 12 ): 3457 - 3466 [ DOI: 10.1109/TVCG.2020.3023573 http://dx.doi.org/10.1109/TVCG.2020.3023573 ]
Wiles O , Koepke A S and Zisserman A . 2018 . X 2 Face : a network for controlling face generation using images, audio, and pose codes// Proceedings of the 15th European Conference on Computer Vision (ECCV) . Munich, Germany : Springer: 690 - 706 [ DOI: 10.1007/978-3-030-01261-8_41 http://dx.doi.org/10.1007/978-3-030-01261-8_41 ]
Wood E , Baltrušaitis T , Hewitt C , Johnson M , Shen J J , Milosavljević N , Wilde D , Garbin S , Sharp T , Stojiljković I , Cashman T and Valentin J . 2022 . 3D face reconstruction with dense landmarks // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 160 - 177 [ DOI: 10.1007/978-3-031-19778-9_10 http://dx.doi.org/10.1007/978-3-031-19778-9_10 ]
Wu J Z , Ge Y X , Wang X T , Lei W X , Gu Y C , Shi Y F , Hsu W , Shan Y , Qie X H and Shou M Z . 2023 . Tune-A-Video: one-shot tuning of image diffusion models for text-to-video generation [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2212.11565.pdf https://arxiv.org/pdf/2212.11565.pdf
Wu Z Z , Watts O and King S . 2016 . Merlin: an open source neural network speech synthesis system // 9th ISCA Workshop on Speech Synthesis Workshop . Sunnyvale, USA : [s.n.]: 202 - 207 [ DOI: 10.21437/SSW.2016-33 http://dx.doi.org/10.21437/SSW.2016-33 ]
Yang D C , Liu S X , Huang R J , Weng C and Meng H L . 2023a . InstructTTS: modelling expressive TTS in discrete latent space with natural language style prompt [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2301.13662.pdf https://arxiv.org/pdf/2301.13662.pdf
Yang S , Zhou Y F , Liu Z W and Loy C C . 2023b . Rerender a video: zero-shot text-guided video-to-video translation // Proceedings of 2023 SIGGRAPH Asia Conference Papers . Sydney, Australia : ACM: #95 [ DOI: 10.1145/3610548.3618160 http://dx.doi.org/10.1145/3610548.3618160 ]
Yao S Y , Zhong R Z , Yan Y C , Zhai G T and Yang X K . 2022 . DFA-NeRF: personalized talking head generation via disentangled face attributes neural rendering [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2201.00791.pdf https://arxiv.org/pdf/2201.00791.pdf
Ye Z H , Jiang Z Y , Ren Y , Liu J L , He J Z and Zhao Z . 2023a . GeneFace: generalized and high-fidelity audio-driven 3D talking face synthesis [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2301.13430.pdf https://arxiv.org/pdf/2301.13430.pdf
Ye Z P , Xia M F , Sun Y N , Yi R , Yu M J , Zhang J Y , Lai Y K and Liu Y J . 2023b . 3D-CariGAN: an end-to-end solution to 3D caricature generation from normal face photos . IEEE Transactions on Visualization and Computer Graphics , 29 ( 4 ): 2203 - 2210 [ DOI: 10.1109/tvcg.2021.3126659 http://dx.doi.org/10.1109/tvcg.2021.3126659 ]
Yi R , Ye Z P , Zhang J Y , Bao H J and Liu Y J . 2020 . Audio-driven talking face video generation with learning-based personalized head pose [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2002.10137.pdf https://arxiv.org/pdf/2002.10137.pdf
Yin F , Zhang Y , Cun X , Cao M D , Fan Y B , Wang X , Bai Q Y , Wu B Y , Wang J and Yang Y J . 2022 . StyleHEAT: one-shot high-resolution editable talking face generation via pre-trained StyleGAN // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 85 - 101 [ DOI: 10.1007/978-3-031-19790-1_6 http://dx.doi.org/10.1007/978-3-031-19790-1_6 ]
Yu L Y , Yu J and Ling Q . 2019 . Mining audio, text and visual information for talking face generation // Proceedings of 2019 IEEE International Conference on Data Mining (ICDM) . Beijing, China : IEEE: 787 - 795 [ DOI: 10.1109/ICDM.2019.00089 http://dx.doi.org/10.1109/ICDM.2019.00089 ]
Zaremba W , Sutskever I and Vinyals O . 2015 . Recurrent neural network regularization [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/1409.2329.pdf https://arxiv.org/pdf/1409.2329.pdf
Zeng Z , Wang J Z , Cheng N , Xia T and Xiao J . 2020 . Aligntts: efficient feed-forward text-to-speech system without explicit alignment // Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona, Spain : IEEE: 6714 - 6718 [ DOI: 10.1109/icassp40776.2020.9054119 http://dx.doi.org/10.1109/icassp40776.2020.9054119 ]
Zhang B W , Qi C Y , Zhang P , Zhang B , Wu H , Chen D , Chen Q F , Wang Y and Wen F . 2023a . MetaPortrait: identity-preserving talking head generation with fast personalized adaptation // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . IEEE : 22096 - 22105 [ DOI: 10.1109/CVPR52729.2023.02116 http://dx.doi.org/10.1109/CVPR52729.2023.02116 ]
Zhang C X , Zhao Y F , Huang Y F , Zeng M , Ni S F , Budagavi M and Guo X H . 2021 . FACIAL: synthesizing dynamic talking face with implicit attribute learning // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 3847 - 3856 [ DOI: 10.1109/ICCV48922.2021.00384 http://dx.doi.org/10.1109/ICCV48922.2021.00384 ]
Zhang H , Yuan T , Chen J K , Li X T , Zheng R J , Huang Y X , Chen X J , Gong E L , Chen Z Y , Hu X G , Yu D H , Ma Y J and Huang L . 2022a . PaddleSpeech: an easy-to-use all-in-one speech toolkit [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2205.12007.pdf https://arxiv.org/pdf/2205.12007.pdf
Zhang L M , Rao A Y and Agrawala M . 2023b . Adding conditional control to text-to-image diffusion models [EB/OL]. [ 2024-09-07 ]. https://arxiv.org/pdf/2302.05543.pdf https://arxiv.org/pdf/2302.05543.pdf
Zhang L W , Qiu Q W , Lin H Y , Zhang Q X , Shi C , Yang W , Shi Y , Yang S B , Xu L and Yu J Y . 2023c . DreamFace: progressive generation of animatable 3D faces under text guidance [EB/OL]. [ 2023-09-12 ]. https://arxiv.org/pdf/2304.03117.pdf https://arxiv.org/pdf/2304.03117.pdf
Zhang Q , Lu H , Sak H , Tripathi A , McDermott E , Koo S and Kumar S . 2020 . Transformer transducer: a streamable speech recognition model with transformer encoders and RNN-T loss // Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona, Spain : IEEE: 7829 - 7833 [ DOI: 10.1109/icassp40776.2020.9053896 http://dx.doi.org/10.1109/icassp40776.2020.9053896 ]
Zhang W X , Cun X , Wang X , Zhang Y , Shen X , Guo Y , Shan Y and Wang F . 2023d . SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 8652 - 8661 [ DOI: 10.1109/CVPR52729.2023.00836 http://dx.doi.org/10.1109/CVPR52729.2023.00836 ]
Zhang Y H , He W H , Li M L , Tian K , Zhang Z Y , Cheng J , Wang Y Y and Liao J X . 2022c . Meta talk: learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition // Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Singapore, Singapore : IEEE: 4848 - 4852 [ DOI: 10.1109/ICASSP43922.2022.9747284 http://dx.doi.org/10.1109/ICASSP43922.2022.9747284 ]
Zhen R , Song W C and Cao J . 2022 . Research on the application of virtual human synthesis technology in human-computer interaction // Proceedings of the 22nd IEEE/ACIS International Conference on Computer and Information Science (ICIS) . Zhuhai, China : IEEE: 199 - 204 [ DOI: 10.1109/ICIS54925.2022.9882355 http://dx.doi.org/10.1109/ICIS54925.2022.9882355 ]
Zhen R , Song W C , He Q , Cao J , Shi L and Luo J . 2023 . Human-computer interaction system: a survey of talking-head generation . Electronics , 12 ( 1 ): # 218 [ DOI: 10.3390/electronics12010218 http://dx.doi.org/10.3390/electronics12010218 ]
Zhou H , Liu Y , Liu Z W , Luo P and Wang X G . 2019 . Talking face generation by adversarially disentangled audio-visual representation // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu, USA : AAAI: 9299 - 9306 [ DOI: 10.1609/AAAI.V33I01.33019299 http://dx.doi.org/10.1609/AAAI.V33I01.33019299 ]
Zhou H , Sun Y S , Wu W , Loy C C , Wang X G and Liu Z W . 2021 . Pose-controllable talking face generation by implicitly modularized audio-visual representation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 4174 - 4184 [ DOI: 10.1109/CVPR46437.2021.00416 http://dx.doi.org/10.1109/CVPR46437.2021.00416 ]
Zhou Y , Han X T , Shechtman E , Echevarria J , Kalogerakis E and Li D Z Y . 2020 . MakeltTalk: speaker-aware talking-head animation . ACM Transactions on Graphics (TOG) , 39 ( 6 ): # 221 [ DOI: 10.1145/3414685.3417774 http://dx.doi.org/10.1145/3414685.3417774 ]
Zielonka W , Bolkart T and Thies J . 2023 . Instant volumetric head avatars // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 4574 - 4584 [ DOI: 10.1109/CVPR52729.2023.00444 http://dx.doi.org/10.1109/CVPR52729.2023.00444 ]
相关作者
相关机构