多模态数字人建模、合成与驱动综述
Multi-modal digital human modeling, synthesis, and driving: a survey
- 2024年29卷第9期 页码:2494-2512
纸质出版日期: 2024-09-16
DOI: 10.11834/jig.230649
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-09-16 ,
移动端阅览
高玄, 刘东宇, 张举勇. 2024. 多模态数字人建模、合成与驱动综述. 中国图象图形学报, 29(09):2494-2512
Gao Xuan, Liu Dongyu, Zhang Juyong. 2024. Multi-modal digital human modeling, synthesis, and driving: a survey. Journal of Image and Graphics, 29(09):2494-2512
多模态数字人是指具备多模态认知与交互能力,且有类人的思维和行为逻辑的真实自然虚拟人。近年来随着计算机视觉与自然语言处理等领域的交叉融合以及蓬勃发展,相关技术取得显著进步。本文讨论在图形学和视觉领域比较重要的多模态人头动画、多模态人体动画以及多模态数字人形象构建3个主题,介绍其方法论和代表工作。在多模态人头动画主题下介绍语音驱动人头和表情驱动人头两个问题的相关工作。在多模态人体动画主题下介绍基于循环神经网络(recurrent neural networks,RNN)的、基于Transformer的和基于降噪扩散模型的人体动画生成。在多模态数字人形象构建主题下介绍视觉语言相似性引导的虚拟形象构建、基于多模态降噪扩散模型引导的虚拟形象构建以及三维多模态虚拟人生成模型。本文将相关方向的代表性工作进行介绍和归类,对已有方法进行总结,并展望未来可能的研究方向。
A multimodal digital human refers to a digital avatar that can perform multimodal cognition and interaction and should be able to think and behave like a human being. Substantial progress has been made in related technologies due to cross-fertilization and vibrant development in various fields, such as computer vision and natural language processing. This article discusses three major themes in the areas of computer graphics and computer vision: multimodal head animation, multimodal body animation, and multimodal portrait creation. The methodologies and representative works in these areas are also introduced. Under the theme of multimodal head animation, this work presents the research on speech- and expression-driven head models. Under the theme of multimodal body animation, the paper explores techniques involving recurrent neural network (RNN)-, Transformer-, and denoising diffusion probabilistic model (DDPM)-based body animation. The discussion of multimodal portrait creation covers portrait creation guided by visual-linguistic similarity, portrait creation guided by multimodal denoising diffusion model, and three-dimensional (3D) multimodal generative models on digital portraits. Further, this article provides an overview and classification of representative works in these research directions, summarizes existing methods, and points out potential future research directions. This article delves into key directions in the field of multimodal digital humans and covers multimodal head animation, multimodal body animation, and the construction of multimodal digital human representations. In the realm of multimodal head animation, we extensively explore two major tasks: expression- and speech-driven animation. For explicit and implicit parameterized models for expression-driven head animation, mesh surfaces and neural radiance fields (NeRF) are used to improve the rendering effects. Explicit models employ 3D morphable and linear models but encounter challenges, such as weak expressive capacity, nondifferentiable rendering, and difficult modeling of personalized features. By contrast, implicit models, especially those based on NeRF, demonstrate superior expressive capacity and realism. In the domain of speech-driven head animation, we review 2D and 3D methods, with a particular focus on the important advantages of NeRF technology in enhancing realism. 2D speech-driven head video generation utilizes techniques, such as generative adversarial networks and image transfer, but depends on 3D prior knowledge and structural characteristics. On the other hand, methods using NeRF, such as audio driven NeRF for talking head synthesis (AD-NeRF) and semantic-aware implicit neural audio-driven video portrait generation (SSP-NeRF), achieve end-to-end training with differentiable NeRF. This condition substantially improves rendering realism while still addressing challenges associated with slow training and inference speeds. Multimodal body animation focuses on speech-driven body animation, music-driven dance, and text-driven body animation. We focus on the importance of learning speech semantics and melody and discuss the applications of RNN, Transformer, and denoising diffusion models in this field. Transformer gradually replaces RNN as the mainstream model, which gains notable advantages in sequence signal learning through attention mechanisms. We also highlight the body animation generation based on denoising diffusion models, such as free-form language-based motion synthesis and editing (FLAME), motion diffusion model (MDM), and text-driven human motion generation with diffusion model (MotionDiffuse), and multimodal denoising networks under music and text conditions. In the realm of the construction of multimodal digital human representations, the article emphasizes virtual-image construction guided by visual-language similarity and denoising of diffusion models. In addition, the demand for large-scale, diverse datasets in digital human representation construction is addressed to foster powerful and universal generative models. The three key aspects of multimodal digital humans are systematically explored: head animation, body animation, and digital human representation construction. In summary, explicit head models, although simple, editable, and computationally efficient, lack expressive capacity, and face challenges in rendering, especially in modeling facial personalization and nonfacial regions. By contrast, implicit models, especially those using NeRF, demonstrate stronger modeling capabilities and realistic rendering effects. In the realm of speech-driven animation, NeRF-based solutions for head animation overcome the limitations of 2D speaker and 3D digital head animation and achieve more natural and realistic speaker videos. Regarding body animation models, Transformer gradually replaces RNN, whereas denoising diffusion models can be used to potentially address mapping challenges in multimodal body animation. Finally, digital human representation construction faces challenges, with visual-language similarity and denoising diffusion model guidance showing promising results. However, the difficulty lies in the direct construction of 3D multimodal virtual humans due to the lack of sufficient 3D virtual human datasets. This study comprehensively analyzes various issues and provides clear directions and challenges for future research. In conclusion, should focus on future developments in multimodal digital humans. Key directions include improvement of 3D modeling and real-time rendering accuracy, integration of speech-driven and facial expression synthesis, construction of large and diverse datasets, exploration of multimodal information fusion and cross-modal learning, and addressing ethical and social impacts. Implicit representation methods, such as neural volume rendering, are crucial for improved 3D modeling. Simultaneously, the construction of larger datasets poses a formidable challenge in the development of robust and universal generative models. Exploration of multimodal information fusion and cross-modal learning allows models to learn from diverse data sources and present a range of behaviors and expressions. Attention to ethical and social impacts, including digital identity and privacy, is crucial. Such research directions should serve as guide the field toward a comprehensive, realistic, and universal future, with profound influence on interactions in virtual spaces.
虚拟数字人建模多模态角色动画多模态生成与编辑神经渲染生成模型神经隐式表示
virtual human modelingmultimodal character animationmultimodal generation and editingneural renderinggenerative modelsneural implicit representation
Ahn H, Ha T, Choi Y, Yoo H and Oh S. 2018. Text2Action: generative adversarial synthesis from language to action//Proceedings of 2018 IEEE International Conference on Robotics and Automation (ICRA). Brisbane, Australia: IEEE: 5915-5920 [DOI: 10.1109/ICRA.2018.8460608http://dx.doi.org/10.1109/ICRA.2018.8460608]
Ahuja C and Morency L P. 2019. Language2Pose: natural language grounded pose forecasting//Proceedings of 2019 International Conference on 3D Vision (3DV). Quebec City, Canada: IEEE: 719-728 [DOI: 10.1109/3DV.2019.00084http://dx.doi.org/10.1109/3DV.2019.00084]
Ao T L, Gao Q Z, Lou Y K, Chen B Q and Liu L B. 2022. Rhythmic gesticulator: rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics, 41(6): #209 [DOI: 10.1145/3550454.3555435http://dx.doi.org/10.1145/3550454.3555435]
Ao T L, Zhang Z Y and Liu L B. 2023. GestureDiffuCLIP: gesture diffusion model with CLIP latents. ACM Transactions on Graphics, 42(4) [DOI: 10.1145/3592097]
Athanasiou N, Petrovich M, Black M J and Varol G. 2022. TEACH: temporal action composition for 3D humans//Proceedings of 2022 International Conference on 3D Vision (3DV). Prague, Czech Republic: IEEE: 414-423 [DOI: 10.1109/3DV57658.2022.00053http://dx.doi.org/10.1109/3DV57658.2022.00053]
Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles, USA: ACM: 187-194 [DOI: 10.1145/311535.311556http://dx.doi.org/10.1145/311535.311556]
Brooks T, Holynski A and Efros A A. 2023. InstructPix2Pix: learning to follow image editing instructions//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 18392-18402 [DOI: 10.1109/CVPR52729.2023.01764http://dx.doi.org/10.1109/CVPR52729.2023.01764]
Cao C, Weng Y L, Zhou S, Tong Y Y and Zhou K. 2014. FaceWarehouse: a 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3): 413-425 [DOI: 10.1109/TVCG.2013.249http://dx.doi.org/10.1109/TVCG.2013.249]
Cao S H, Liu X H, Mao X Q and Zou Q. 2022. A review of human face forgery and forgery-detection technologies. Journal of Image and Graphics, 27(4): 1023-1038
曹申豪, 刘晓辉, 毛秀青, 邹勤. 2022. 人脸伪造及检测技术综述. 中国图象图形学报, 27(4): 1023-1038 [DOI: 10.11834/jig.200466http://dx.doi.org/10.11834/jig.200466]
Cao Y K, Cao Y P, Han K, Shan Y and Wong K Y K. 2023. DreamAvatar: text-and-shape guided 3D human avatar generation via diffusion models [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2304.00916.pdfhttps://arxiv.org/pdf/2304.00916.pdf
Chan E R, Lin C Z, Chan M A, Nagano K, Pan B X, de Mello S, Gallo O, Guibas L, Tremblay J, Khamis S, Karras T and Wetzstein G. 2022. Efficient geometry-aware 3D generative adversarial networks//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 16102-16112 [DOI: 10.1109/CVPR52688.2022.01565http://dx.doi.org/10.1109/CVPR52688.2022.01565]
Chen L L, Maddox R K, Duan Z Y and Xu C L. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 7824-7833 [DOI: 10.1109/cvpr.2019.00802http://dx.doi.org/10.1109/cvpr.2019.00802]
Chu B, Romdhani S and Chen L M. 2014. 3D-aided face recognition robust to expression and pose variations//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1907-1914 [DOI: 10.1109/CVPR.2014.245http://dx.doi.org/10.1109/CVPR.2014.245]
Chung J S and Zisserman A. 2016. Out of time: automated lip sync in the wild//Proceedings of ACCV 2016 International Workshops on Computer Vision. Taipei, China: Springer: 251-263 [DOI: 10.1007/978-3-319-54427-4_19http://dx.doi.org/10.1007/978-3-319-54427-4_19]
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A and Black M J. 2019. Capture, learning, and synthesis of 3D speaking styles//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 10093-10103 [DOI: 10.1109/CVPR.2019.01034http://dx.doi.org/10.1109/CVPR.2019.01034]
Dabral R, Mughal M H, Golyanik V and Theobalt C. 2023. MoFusion: a framework for denoising-diffusion-based motion synthesis//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 9760-9770 [DOI: 10.1109/CVPR52729.2023.00941http://dx.doi.org/10.1109/CVPR52729.2023.00941]
Dhariwal P, Jun H, Payne C, Kim J W, Radford A and Sutskever I. 2020. Jukebox: a generative model for music [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2005.00341.pdfhttps://arxiv.org/pdf/2005.00341.pdf
Edwards P, Landreth C, Fiume E and Singh K. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 35(4): #127 [DOI: 10.1145/2897824.2925984http://dx.doi.org/10.1145/2897824.2925984]
Fan R K, Xu S H and Geng W D. 2012. Example-based automatic music-driven conventional dance motion synthesis. IEEE Transactions on Visualization and Computer Graphics, 18(3): 501-515 [DOI: 10.1109/TVCG.2011.73http://dx.doi.org/10.1109/TVCG.2011.73]
Fan Y R, Lin Z J, Saito J, Wang W P and Komura T. 2021. FaceFormer: speech-driven 3D facial animation with Transformers//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 18749-18758 [DOI: 10.1109/CVPR52688.2022.01821http://dx.doi.org/10.1109/CVPR52688.2022.01821]
Fisher C G. 1968. Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 11(4): 796-804 [DOI: 10.1044/jshr.1104.796http://dx.doi.org/10.1044/jshr.1104.796]
Fukayama S and Goto M. 2015. Music content driven automated choreography with beat-wise motion connectivity constraints//Proceedings of the 12th Sound and Music Computing Conference. Maynooth, Ireland: [s.n.]: 177-183
Gafni G, Thies J, Zollhöfer M and Nießner M. 2021. Dynamic neural radiance fields for monocular 4D facial avatar reconstruction//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 8645-8654 [DOI: 10.1109/CVPR46437.2021.00854http://dx.doi.org/10.1109/CVPR46437.2021.00854]
Gal R, Patashnik O, Maron H, Chechik G and Cohen-Or D. 2021. STYLEGAN-NADA: CLIP-guided domain adaptation of image generators [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2108.00946.pdfhttps://arxiv.org/pdf/2108.00946.pdf
Gao X, Zhong C L, Xiang J, Hong Y, Guo Y D and Zhang J Y. 2022. Reconstructing personalized semantic facial NeRF models from monocular video. ACM Transactions on Graphics, 41(6): #200 [DOI: 10.1145/3550454.3555501http://dx.doi.org/10.1145/3550454.3555501]
Ghosh A, Cheema N, Oguz C, Theobalt C and Slusallek P. 2021. Synthesis of compositional animations from textual descriptions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 1376-1386 [DOI: 10.1109/ICCV48922.2021.00143http://dx.doi.org/10.1109/ICCV48922.2021.00143]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
Guo C, Zou S H, Zuo X X, Wang S, Ji W, Li X Y and Cheng L. 2022a. Generating diverse and natural 3D human motions from text//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 5142-5151 [DOI: 10.1109/CVPR52688.2022.00509http://dx.doi.org/10.1109/CVPR52688.2022.00509]
Guo C, Zuo X X, Wang S and Cheng L. 2022b. TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 580-597 [DOI: 10.1007/978-3-031-19833-5_34http://dx.doi.org/10.1007/978-3-031-19833-5_34]
Guo Y D, Cai L and Zhang J Y. 2021a. 3D face from X: learning face shape from diverse sources. IEEE Transactions on Image Processing, 30: 3815-3827 [DOI: 10.1109/TIP.2021.3065798http://dx.doi.org/10.1109/TIP.2021.3065798]
Guo Y D, Chen K Y, Liang S, Liu Y J, Bao H J and Zhang J Y. 2021b. AD-NeRF: audio driven neural radiance fields for talking head synthesis//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 5764-5774 [DOI: 10.1109/ICCV48922.2021.00573http://dx.doi.org/10.1109/ICCV48922.2021.00573]
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A and Ng A Y. 2014. Deep speech: scaling up end-to-end speech recognition [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/1412.5567.pdfhttps://arxiv.org/pdf/1412.5567.pdf
Haque A, Tancik M, Efros A A, Holynski A and Kanazawa A. 2023. Instruct-NeRF2NeRF: editing 3D scenes with instructions//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 19683-19693 [DOI: 10.1109/ICCV51070.2023.01808http://dx.doi.org/10.1109/ICCV51070.2023.01808]
Ho J, Jain A and Abbeel P. 2020. Denoising diffusion probabilistic models//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: #574
Hong F Z, Zhang M Y, Pan L, Cai Z A, Yang L and Liu Z W. 2022a. AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ACM Transactions on Graphics, 41(4): #161 [DOI: 10.1145/3528223.3530094http://dx.doi.org/10.1145/3528223.3530094]
Hong Y, Peng B, Xiao H Y, Liu L G and Zhang J Y. 2022b. HeadNeRF: a realtime NeRF-based parametric head model//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 20342-20352 [DOI: 10.1109/CVPR52688.2022.01973http://dx.doi.org/10.1109/CVPR52688.2022.01973]
Huang X and Belongie S. 2017. Arbitrary style transfer in real-time with adaptive instance normalization//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 1510-1519 [DOI: 10.1109/iccv.2017.167http://dx.doi.org/10.1109/iccv.2017.167]
Huang Y K, Wang J N, Zeng A L, Cao H, Qi X B, Shi Y K, Zha Z J and Zhang L. 2023. DreamWaltz: make a scene with complex 3D animatable avatars//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: NeurIPS
Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 5967-5976 [DOI: 10.1109/CVPR.2017.632http://dx.doi.org/10.1109/CVPR.2017.632]
Jamaludin A, Chung J S and Zisserman A. 2019. You said that?: synthesising talking faces from audio. International Journal of Computer Vision, 127(11): 1767-1779 [DOI: 10.1007/s11263-019-01150-yhttp://dx.doi.org/10.1007/s11263-019-01150-y]
Ji X Y, Zhou H, Wang K S Y, Wu W, Loy C C, Cao X and Xu F. 2021. Audio-driven emotional video portraits//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: 14075-14084 [DOI: 10.1109/CVPR46437.2021.01386http://dx.doi.org/10.1109/CVPR46437.2021.01386]
Jiang B, Chen X, Liu W, Yu J Y, Yu G and Chen T. 2023. MotionGPT: human motion as a foreign language//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: NeurIPS
Kang C, Tan Z P, Lei J, Zhang S H, Guo Y C, Zhang W D and Hu S M. 2021. ChoreoMaster: choreography-oriented music-driven dance synthesis. ACM Transactions on Graphics, 40(4): #145 [DOI: 10.1145/3450626.3459932http://dx.doi.org/10.1145/3450626.3459932]
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of styleGAN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 8107-8116 [DOI: 10.1109/CVPR42600.2020.00813http://dx.doi.org/10.1109/CVPR42600.2020.00813]
Kerbl B, Kopanas G, Leimkuehler T and Drettakis G. 2023. 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4): #139 [DOI: 10.1145/3592433http://dx.doi.org/10.1145/3592433]
Kim G and Chun S Y. 2023. DATID-3D: diversity-preserved domain adaptation using text-to-image diffusion for 3D generative model//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 14203-14213 [DOI: 10.1109/CVPR52729.2023.01365http://dx.doi.org/10.1109/CVPR52729.2023.01365]
Kim J, Kim J and Choi S. 2023. FLAME: free-form language-based motion synthesis & editing//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, USA: AAAI: 8255-8263 [DOI: 10.1609/aaai.v37i7.25996http://dx.doi.org/10.1609/aaai.v37i7.25996]
Kwon G and Ye J C. 2022. CLIPstyler: image style transfer with a single text condition//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 18041-18050 [DOI: 10.1109/CVPR52688.2022.01753http://dx.doi.org/10.1109/CVPR52688.2022.01753]
Li R L, Bladin K, Zhao Y J, Chinara C, Ingraham O, Xiang P D, Ren X L, Prasad P, Kishore B, Xing J and Li H. 2020. Learning formation of physically-based face attributes//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 3407-3416 [DOI: 10.1109/CVPR42600.2020.00347http://dx.doi.org/10.1109/CVPR42600.2020.00347]
Li R L, Yang S, Ross D A and Kanazawa A. 2021b. AI choreographer: music conditioned 3D dance generation with AIST++//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 13381-13392 [DOI: 10.1109/ICCV48922.2021.01315http://dx.doi.org/10.1109/ICCV48922.2021.01315]
Li S Y, Yu W J, Gu T P, Lin C Z, Wang Q, Qian C, Loy C C and Liu Z W. 2022. Bailando: 3D dance generation by actor-critic GPT with choreographic memory//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 11040-11049 [DOI: 10.1109/CVPR52688.2022.01077http://dx.doi.org/10.1109/CVPR52688.2022.01077]
Li T Y, Bolkart T, Black M J, Li H and Romero J. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, 36(6): #194 [DOI: 10.1145/3130800.3130813http://dx.doi.org/10.1145/3130800.3130813]
Liao Y H, Qian W H and Cao J D. 2023. MStarGAN: a face style transfer network with changeable style intensity. Journal of Image and Graphics, 28(12): 3784-3796
廖远鸿, 钱文华, 曹进德. 2023. 风格强度可变的人脸风格迁移网络. 中国图象图形学报, 28(12): 3784-3796 [DOI: 10.11834/jig.221149http://dx.doi.org/10.11834/jig.221149]
Liu X, Xu Y H, Wu Q Y, Zhou H, Wu W and Zhou B L. 2022. Semantic-aware implicit neural audio-driven video portrait generation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 106-125 [DOI: 10.1007/978-3-031-19836-6_7http://dx.doi.org/10.1007/978-3-031-19836-6_7]
Medsker L R and Jain L. 2001. Recurrent neural networks. Design and Applications, 5: 64-67
Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R and Ng R. 2020. NeRF: representing scenes as neural radiance fields for view synthesis//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 405-421 [DOI: 10.1007/978-3-030-58452-8_24http://dx.doi.org/10.1007/978-3-030-58452-8_24]
Ofli F, Erzin E, Yemez Y and Tekalp A M. 2011. Learn2Dance: learning statistical music-to-dance mappings for choreography synthesis. IEEE Transactions on Multimedia, 14(3): 747-759 [DOI: 10.1109/TMM.2011.2181492http://dx.doi.org/10.1109/TMM.2011.2181492]
Patashnik O, Wu Z Z, Shechtman E, Cohen-Or D and Lischinski D. 2021. StyleCLIP: text-driven manipulation of StyleGAN imagery//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 2065-2074 [DOI: 10.1109/ICCV48922.2021.00209http://dx.doi.org/10.1109/ICCV48922.2021.00209]
Petrovich M, Black M J and Varol G. 2021. Action-conditioned 3D human motion synthesis with Transformer VAE//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 10965-10975 [DOI: 10.1109/ICCV48922.2021.01080http://dx.doi.org/10.1109/ICCV48922.2021.01080]
Petrovich M, Black M J and Varol G. 2022. TEMOS: generating diverse human motions from textual descriptions//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 480-497 [DOI: 10.1007/978-3-031-20047-2_28http://dx.doi.org/10.1007/978-3-031-20047-2_28]
Plappert M, Mandery C and Asfour T. 2018. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, 109: 13-26 [DOI: 10.1016/j.robot.2018.07.006http://dx.doi.org/10.1016/j.robot.2018.07.006]
Poole B, Jain A, Barron J T and Mildenhall B. 2022. DreamFusion: text-to-3D using 2D diffusion//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR
Prajwal K R, Mukhopadhyay R, Namboodiri V P and Jawahar C V. 2020. A lip sync expert is all you need for speech to lip generation in the wild//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 484-492 [DOI: 10.1145/3394171.3413532http://dx.doi.org/10.1145/3394171.3413532]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. [s.l.]: PMLR: 8748-8763
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W and Liu P J. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 21(1): #140
Ramesh A, Dhariwal P, Nichol A, Chu C and Chen M. 2022. Hierarchical text-conditional image generation with CLIP latents [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2204.06125.pdfhttps://arxiv.org/pdf/2204.06125.pdf
Ranjan A, Bolkart T, Sanyal S and Black M J. 2018. Generating 3D faces using convolutional mesh autoencoders//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 725-741 [DOI: 10.1007/978-3-030-01219-9_43http://dx.doi.org/10.1007/978-3-030-01219-9_43]
Richard A, Zollhöfer M, Wen Y D, de la Torre F and Sheikh Y. 2021. MeshTalk: 3D face animation from speech using cross-modality disentanglement//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 1153-1162 [DOI: 10.1109/ICCV48922.2021.00121http://dx.doi.org/10.1109/ICCV48922.2021.00121]
Rombach R, Blattmann A, Lorenz D, Esser P and Ommer B. 2021. High-resolution image synthesis with latent diffusion models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 10674-10685 [DOI: 10.1109/CVPR52688.2022.01042http://dx.doi.org/10.1109/CVPR52688.2022.01042]
Saharia C, Chan W, Saxena S, Li L L, Whang J, Denton E L, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T, Ho J, Fleet D J and Norouzi M. 2022. Photorealistic text-to-image diffusion models with deep language understanding//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: #2643
Sanh V, Debut L, Chaumond J and Wolf T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/1910.01108.pdfhttps://arxiv.org/pdf/1910.01108.pdf
Shen S, Li W H, Zhu Z, Duan Y Q, Zhou J and Lu J W. 2022. Learning dynamic facial radiance fields for few-shot talking head synthesis//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 666-682 [DOI: 10.1007/978-3-031-19775-8_39http://dx.doi.org/10.1007/978-3-031-19775-8_39]
Shen S, Zhao W L, Meng Z B, Li W H, Zhu Z, Zhou J and Lu J W. 2023. DiffTalk: crafting diffusion models for generalized audio-driven portraits animation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 1982-1991 [DOI: 10.1109/CVPR52729.2023.00197http://dx.doi.org/10.1109/CVPR52729.2023.00197]
Sun J X, Wang X, Wang L Z, Li X Y, Zhang Y, Zhang H W and Liu Y B. 2023. Next3D: generative neural texture rasterization for 3D-aware head avatars//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 20991-21002 [DOI: 10.1109/CVPR52729.2023.02011http://dx.doi.org/10.1109/CVPR52729.2023.02011]
Sutskever I, Vinyals O and Le Q V. 2014. Sequence to sequence learning with neural networks//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 3104-3112
Tang J X, Wang K S Y, Zhou H, Chen X K, He D L, Hu T S, Liu J T, Zeng G and Wang J D. 2022. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2211.12368.pdfhttps://arxiv.org/pdf/2211.12368.pdf
Tevet G, Gordon B, Hertz A, Bermano A H and Cohen-Or D. 2022. MotionCLIP: exposing human motion generation to clip space//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 358-374 [DOI: 10.1007/978-3-031-20047-2_21http://dx.doi.org/10.1007/978-3-031-20047-2_21]
Tevet G, Raab S, Gordon B, Shafir Y, Cohen-or D and Bermano A H. 2023. Human motion diffusion model//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR
Thies J, Elgharib M, Tewari A, Theobalt C and Nießner M. 2020. Neural voice puppetry: audio-driven facial reenactment//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 716-731 [DOI: 10.1007/978-3-030-58517-4_42http://dx.doi.org/10.1007/978-3-030-58517-4_42]
Tran L and Liu X M. 2018. Nonlinear 3D face morphable model//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7346-7355 [DOI: 10.1109/CVPR.2018.00767http://dx.doi.org/10.1109/CVPR.2018.00767]
Tseng J, Castellon R and Liu C K. 2023. EDGE: editable dance generation from music//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 448-458 [DOI: 10.1109/CVPR52729.2023.00051http://dx.doi.org/10.1109/CVPR52729.2023.00051]
van den Oord A, Kalchbrenner N, Vinyals O, Espeholt L, Graves A and Kavukcuoglu K. 2016. Conditional image generation with PixelCNN decoders//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc.: 4797-4805
van den Oord A, Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6309-6318
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang P, Liu L J, Liu Y, Theobalt C, Komura T and Wang W P. 2021a. NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver, Canada: NeurIPS: 27171-27183
Wang S Z, Li L C, Ding Y, Fan C J and Yu X. 2021b. Audio2Head: audio-driven one-shot talking-head generation with natural head motion//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal, Canada: IJCAI: 1098-1105 [DOI: 10.24963/ijcai.2021/152http://dx.doi.org/10.24963/ijcai.2021/152]
Wang T F, Zhang B, Zhang T, Gu S Y, Bao J M, Baltrusaitis T, Shen J J, Chen D, Wen F, Chen Q F and Guo B M. 2023. RODIN: a generative model for sculpting 3D digital avatars using diffusion//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 4563-4573 [DOI: 10.1109/CVPR52729.2023.00443http://dx.doi.org/10.1109/CVPR52729.2023.00443]
Wang Y B, Zhang K, Kong Y H, Yu T T and Zhao S W. 2023. Overview of human-facial-related age syntheis based generative adversarial network methods. Journal of Image and Graphics, 28(10): 3004-3024
王艺博, 张珂, 孔英会, 于婷婷, 赵士玮. 2023. 人脸年龄合成的生成对抗网络方法综述. 中国图象图形学报, 28(10): 3004-3024 [DOI: 10.11834/jig.220842http://dx.doi.org/10.11834/jig.220842]
Xing J B, Xia M H, Zhang Y C, Cun X D, Wang J and Wong T T. 2023. CodeTalker: speech-driven 3D facial animation with discrete motion prior//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 12780-12790 [DOI: 10.1109/CVPR52729.2023.01229http://dx.doi.org/10.1109/CVPR52729.2023.01229]
Ye Z H, Jiang Z Y, Ren Y, Liu J L, He J Z and Zhao Z. 2023. GeneFace: generalized and high-fidelity audio-driven 3D talking face synthesis//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR
Ye Z J, Wu H Z, Jia J, Bu Y H, Chen W, Meng F B and Wang Y F. 2020. ChoreoNet: towards music to dance synthesis with choreographic action unit//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 744-752 [DOI: 10.1145/3394171.3414005http://dx.doi.org/10.1145/3394171.3414005]
Yenamandra T, Tewari A, Bernard F, Seidel H P, Elgharib M, Cremers D and Theobalt C. 2021. i3DMM: deep implicit 3D morphable model of human heads//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 12798-12808 [DOI: 10.1109/CVPR46437.2021.01261http://dx.doi.org/10.1109/CVPR46437.2021.01261]
Yi H W, Liang H L, Liu Y F, Cao Q, Wen Y D, Bolkart T, Tao D C and Black M J. 2023. Generating holistic 3D human motion from speech//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 469-480 [DOI: 10.1109/CVPR52729.2023.00053http://dx.doi.org/10.1109/CVPR52729.2023.00053]
Zakharov E, Shysheya A, Burkov E and Lempitsky V. 2019. Few-shot adversarial learning of realistic neural talking head models//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9458-9467 [DOI: 10.1109/ICCV.2019.00955http://dx.doi.org/10.1109/ICCV.2019.00955]
Zeng Y F, Lu Y X, Ji X Y, Yao Y, Zhu H and Cao X. 2023. AvatarBooth: high-quality and customizable 3D human avatar generation [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2306.09864.pdfhttps://arxiv.org/pdf/2306.09864.pdf
Zhang J R, Zhang Y S, Cun X D, Huang S L, Zhang Y, Zhao H W, Lu H T and Shen X. 2023a. T2M-GPT: generating human motion from textual descriptions with discrete representations [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2301.06052.pdfhttps://arxiv.org/pdf/2301.06052.pdf
Zhang L W, Qiu Q W, Lin H Y, Zhang Q X, Shi C, Yang W, Shi Y, Yang S B, Xu L and Yu J Y. 2023b. DreamFace: progressive generation of animatable 3D faces under text guidance. ACM Transactions on Graphics, 42(4): #138 [DOI: 10.1145/3592094http://dx.doi.org/10.1145/3592094]
Zhang M Y, Cai Z G, Pan L, Hong F Z, Guo X Y, Yang L and Liu Z W. 2022. MotionDiffuse: text-driven human motion generation with diffusion model [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2208.15001.pdfhttps://arxiv.org/pdf/2208.15001.pdf
Zheng Y F, Abrevaya V F, Bühler M C, Chen X, Black M J and Hilliges O. 2022. I M avatar: implicit morphable head avatars from videos//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 13535-13545 [DOI: 10.1109/CVPR52688.2022.01318http://dx.doi.org/10.1109/CVPR52688.2022.01318]
Zhou H, Sun Y S, Wu W, Loy CC, Wang X G and Liu Z W. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 4174-4184 [DOI: 10.1109/CVPR46437.2021.00416http://dx.doi.org/10.1109/CVPR46437.2021.00416]
Zhou Y, Han X T, Shechtman E, Echevarria J, Kalogerakis E and Li D Z Y. 2020. MakeItTalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6): #221 [DOI: 10.1145/3414685.3417774http://dx.doi.org/10.1145/3414685.3417774]
Zhou Z W, Wang Z, Yao S Y, Yan Y C, Yang C, Zhai G T, Yan J C and Yang X K. 2022. DialogueNeRF: towards realistic avatar face-to-face conversation video generation [EB/OL]. [2023-08-20]. https://arxiv.org/pdf/2203.07931v1.pdfhttps://arxiv.org/pdf/2203.07931v1.pdf
相关文章
相关作者
相关机构