多模态数字人建模、合成与驱动综述

高玄; 刘东宇; 张举勇

doi:10.11834/jig.230649

数字人建模、生成与渲染技术 | 浏览量 : 0 下载量: 593 CSCD: 0

PDF
导出
分享
收藏
专辑

多模态数字人建模、合成与驱动综述
Multi-modal digital human modeling， synthesis， and driving： a survey
2024年29卷第9期页码：2494-2512
收稿日期：2023-09-12，

修回日期：2024-03-15，

纸质出版日期：2024-09-16
DOI： 10.11834/jig.230649
稿件说明：

移动端阅览

高玄，刘东宇，张举勇. 2024. 多模态数字人建模、合成与驱动综述. 中国图象图形学报， 29(09):2494-2512 DOI： 10.11834/jig.230649.

Gao Xuan， Liu Dongyu， Zhang Juyong. 2024. Multi-modal digital human modeling， synthesis， and driving： a survey. Journal of Image and Graphics， 29(09):2494-2512 DOI： 10.11834/jig.230649.

摘要

多模态数字人是指具备多模态认知与交互能力，且有类人的思维和行为逻辑的真实自然虚拟人。近年来随着计算机视觉与自然语言处理等领域的交叉融合以及蓬勃发展，相关技术取得显著进步。本文讨论在图形学和视觉领域比较重要的多模态人头动画、多模态人体动画以及多模态数字人形象构建3个主题，介绍其方法论和代表工作。在多模态人头动画主题下介绍语音驱动人头和表情驱动人头两个问题的相关工作。在多模态人体动画主题下介绍基于循环神经网络（recurrent neural networks，RNN）的、基于Transformer的和基于降噪扩散模型的人体动画生成。在多模态数字人形象构建主题下介绍视觉语言相似性引导的虚拟形象构建、基于多模态降噪扩散模型引导的虚拟形象构建以及三维多模态虚拟人生成模型。本文将相关方向的代表性工作进行介绍和归类，对已有方法进行总结，并展望未来可能的研究方向。

Abstract

A multimodal digital human refers to a digital avatar that can perform multimodal cognition and interaction and should be able to think and behave like a human being. Substantial progress has been made in related technologies due to cross-fertilization and vibrant development in various fields， such as computer vision and natural language processing. This article discusses three major themes in the areas of computer graphics and computer vision： multimodal head animation， multimodal body animation， and multimodal portrait creation. The methodologies and representative works in these areas are also introduced. Under the theme of multimodal head animation， this work presents the research on speech- and expression-driven head models. Under the theme of multimodal body animation， the paper explores techniques involving recurrent neural network （RNN）-， Transformer-， and denoising diffusion probabilistic model （DDPM）-based body animation. The discussion of multimodal portrait creation covers portrait creation guided by visual-linguistic similarity， portrait creation guided by multimodal denoising diffusion model， and three-dimensional （3D） multimodal generative models on digital portraits. Further， this article provides an overview and classification of representative works in these research directions， summarizes existing methods， and points out potential future research directions. This article delves into key directions in the field of multimodal digital humans and covers multimodal head animation， multimodal body animation， and the construction of multimodal digital human representations. In the realm of multimodal head animation， we extensively explore two major tasks： expression- and speech-driven animation. For explicit and implicit parameterized models for expression-driven head animation， mesh surfaces and neural radiance fields （NeRF） are used to improve the rendering effects. Explicit models employ 3D morphable and linear models but encounter challenges， such as weak expressive capacity， nondifferentiable rendering， and difficult modeling of personalized features. By contrast， implicit models， especially those based on NeRF， demonstrate superior expressive capacity and realism. In the domain of speech-driven head animation， we review 2D and 3D methods， with a particular focus on the important advantages of NeRF technology in enhancing realism. 2D speech-driven head video generation utilizes techniques， such as generative adversarial networks and image transfer， but depends on 3D prior knowledge and structural characteristics. On the other hand， methods using NeRF， such as audio driven NeRF for talking head synthesis （AD-NeRF） and semantic-aware implicit neural audio-driven video portrait generation （SSP-NeRF）， achieve end-to-end training with differentiable NeRF. This condition substantially improves rendering realism while still addressing challenges associated with slow training and inference speeds. Multimodal body animation focuses on speech-driven body animation， music-driven dance， and text-driven body animation. We focus on the importance of learning speech semantics and melody and discuss the applications of RNN， Transformer， and denoising diffusion models in this field. Transformer gradually replaces RNN as the mainstream model， which gains notable advantages in sequence signal learning through attention mechanisms. We also highlight the body animation generation based on denoising diffusion models， such as free-form language-based motion synthesis and editing （FLAME）， motion diffusion model （MDM）， and text-driven human motion generation with diffusion model （MotionDiffuse）， and multimodal denoising networks under music and text conditions. In the realm of the construction of multimodal digital human representations， the article emphasizes virtual-image construction guided by visual-language similarity and denoising of diffusion models. In addition， the demand for large-scale， diverse datasets in digital human representation construction is addressed to foster powerful and universal generative models. The three key aspects of multimodal digital humans are systematically explored： head animation， body animation， and digital human representation construction. In summary， explicit head models， although simple， editable， and computationally efficient， lack expressive capacity， and face challenges in rendering， especially in modeling facial personalization and nonfacial regions. By contrast， implicit models， especially those using NeRF， demonstrate stronger modeling capabilities and realistic rendering effects. In the realm of speech-driven animation， NeRF-based solutions for head animation overcome the limitations of 2D speaker and 3D digital head animation and achieve more natural and realistic speaker videos. Regarding body animation models， Transformer gradually replaces RNN， whereas denoising diffusion models can be used to potentially address mapping challenges in multimodal body animation. Finally， digital human representation construction faces challenges， with visual-language similarity and denoising diffusion model guidance showing promising results. However， the difficulty lies in the direct construction of 3D multimodal virtual humans due to the lack of sufficient 3D virtual human datasets. This study comprehensively analyzes various issues and provides clear directions and challenges for future research. In conclusion， should focus on future developments in multimodal digital humans. Key directions include improvement of 3D modeling and real-time rendering accuracy， integration of speech-driven and facial expression synthesis， construction of large and diverse datasets， exploration of multimodal information fusion and cross-modal learning， and addressing ethical and social impacts. Implicit representation methods， such as neural volume rendering， are crucial for improved 3D modeling. Simultaneously， the construction of larger datasets poses a formidable challenge in the development of robust and universal generative models. Exploration of multimodal information fusion and cross-modal learning allows models to learn from diverse data sources and present a range of behaviors and expressions. Attention to ethical and social impacts， including digital identity and privacy， is crucial. Such research directions should serve as guide the field toward a comprehensive， realistic， and universal future， with profound influence on interactions in virtual spaces.

关键词

Keywords

references

Ahn H ， Ha T ， Choi Y ， Yoo H and Oh S . 2018 . Text2Action： generative adversarial synthesis from language to action // Proceedings of 2018 IEEE International Conference on Robotics and Automation （ICRA） . Brisbane， Australia ： IEEE： 5915 - 5920 ［ DOI： 10.1109/ICRA.2018.8460608 http://dx.doi.org/10.1109/ICRA.2018.8460608 ］

Ahuja C and Morency L P . 2019 . Language2Pose： natural language grounded pose forecasting // Proceedings of 2019 International Conference on 3D Vision （3DV） . Quebec City， Canada ： IEEE： 719 - 728 ［ DOI： 10.1109/3DV.2019.00084 http://dx.doi.org/10.1109/3DV.2019.00084 ］

Ao T L ， Gao Q Z ， Lou Y K ， Chen B Q and Liu L B . 2022 . Rhythmic gesticulator： rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings . ACM Transactions on Graphics ， 41 （ 6 ）： # 209 ［ DOI： 10.1145/3550454.3555435 http://dx.doi.org/10.1145/3550454.3555435 ］

Ao T L ， Zhang Z Y and Liu L B . 2023 . GestureDiffuCLIP： gesture diffusion model with CLIP latents . ACM Transactions on Graphics ， 42 （ 4 ）［DOI： 10.1145/3592097］

Athanasiou N ， Petrovich M ， Black M J and Varol G . 2022 . TEACH： temporal action composition for 3D humans // Proceedings of 2022 International Conference on 3D Vision （3DV） . Prague， Czech Republic ： IEEE： 414 - 423 ［ DOI： 10.1109/3DV57658.2022.00053 http://dx.doi.org/10.1109/3DV57658.2022.00053 ］

Blanz V and Vetter T . 1999 . A morphable model for the synthesis of 3D faces // Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques . Los Angeles， USA ： ACM： 187 - 194 ［ DOI： 10.1145/311535.311556 http://dx.doi.org/10.1145/311535.311556 ］

Brooks T ， Holynski A and Efros A A . 2023 . InstructPix2Pix： learning to follow image editing instructions // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 18392 - 18402 ［ DOI： 10.1109/CVPR52729.2023.01764 http://dx.doi.org/10.1109/CVPR52729.2023.01764 ］

Cao C ， Weng Y L ， Zhou S ， Tong Y Y and Zhou K . 2014 . FaceWarehouse： a 3D facial expression database for visual computing . IEEE Transactions on Visualization and Computer Graphics ， 20 （ 3 ）： 413 - 425 ［ DOI： 10.1109/TVCG.2013.249 http://dx.doi.org/10.1109/TVCG.2013.249 ］

Cao S H ， Liu X H ， Mao X Q and Zou Q . 2022 . A review of human face forgery and forgery-detection technologies . Journal of Image and Graphics ， 27 （ 4 ）： 1023 - 1038

曹申豪，刘晓辉，毛秀青，邹勤 . 2022 . 人脸伪造及检测技术综述 . 中国图象图形学报， 27 （ 4 ）： 1023 - 1038 ［ DOI： 10.11834/jig.200466 http://dx.doi.org/10.11834/jig.200466 ］

Cao Y K ， Cao Y P ， Han K ， Shan Y and Wong K Y K . 2023 . DreamAvatar： text-and-shape guided 3D human avatar generation via diffusion models ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2304.00916.pdf https://arxiv.org/pdf/2304.00916.pdf

Chan E R ， Lin C Z ， Chan M A ， Nagano K ， Pan B X ， de Mello S ， Gallo O ， Guibas L ， Tremblay J ， Khamis S ， Karras T and Wetzstein G . 2022 . Efficient geometry-aware 3D generative adversarial networks // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 16102 - 16112 ［ DOI： 10.1109/CVPR52688.2022.01565 http://dx.doi.org/10.1109/CVPR52688.2022.01565 ］

Chen L L ， Maddox R K ， Duan Z Y and Xu C L . 2019 . Hierarchical cross-modal talking face generation with dynamic pixel-wise loss // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Long Beach， USA ： IEEE： 7824 - 7833 ［ DOI： 10.1109/cvpr.2019.00802 http://dx.doi.org/10.1109/cvpr.2019.00802 ］

Chu B ， Romdhani S and Chen L M . 2014 . 3D-aided face recognition robust to expression and pose variations // Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition . Columbus， USA ： IEEE： 1907 - 1914 ［ DOI： 10.1109/CVPR.2014.245 http://dx.doi.org/10.1109/CVPR.2014.245 ］

Chung J S and Zisserman A . 2016 . Out of time： automated lip sync in the wild // Proceedings of ACCV 2016 International Workshops on Computer Vision . Taipei， China ： Springer： 251 - 263 ［ DOI： 10.1007/978-3-319-54427-4_19 http://dx.doi.org/10.1007/978-3-319-54427-4_19 ］

Cudeiro D ， Bolkart T ， Laidlaw C ， Ranjan A and Black M J . 2019 . Capture， learning， and synthesis of 3D speaking styles // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Long Beach， USA ： IEEE： 10093 - 10103 ［ DOI： 10.1109/CVPR.2019.01034 http://dx.doi.org/10.1109/CVPR.2019.01034 ］

Dabral R ， Mughal M H ， Golyanik V and Theobalt C . 2023 . MoFusion： a framework for denoising-diffusion-based motion synthesis // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 9760 - 9770 ［ DOI： 10.1109/CVPR52729.2023.00941 http://dx.doi.org/10.1109/CVPR52729.2023.00941 ］

Dhariwal P ， Jun H ， Payne C ， Kim J W ， Radford A and Sutskever I . 2020 . Jukebox： a generative model for music ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2005.00341.pdf https://arxiv.org/pdf/2005.00341.pdf

Edwards P ， Landreth C ， Fiume E and Singh K . 2016 . JALI： an animator-centric viseme model for expressive lip synchronization . ACM Transactions on Graphics ， 35 （ 4 ）： # 127 ［ DOI： 10.1145/2897824.2925984 http://dx.doi.org/10.1145/2897824.2925984 ］

Fan R K ， Xu S H and Geng W D . 2012 . Example-based automatic music-driven conventional dance motion synthesis . IEEE Transactions on Visualization and Computer Graphics ， 18 （ 3 ）： 501 - 515 ［ DOI： 10.1109/TVCG.2011.73 http://dx.doi.org/10.1109/TVCG.2011.73 ］

Fan Y R ， Lin Z J ， Saito J ， Wang W P and Komura T . 2021 . FaceFormer： speech-driven 3D facial animation with Transformers // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 18749 - 18758 ［ DOI： 10.1109/CVPR52688.2022.01821 http://dx.doi.org/10.1109/CVPR52688.2022.01821 ］

Fisher C G . 1968 . Confusions among visually perceived consonants . Journal of Speech and Hearing Research ， 11 （ 4 ）： 796 - 804 ［ DOI： 10.1044/jshr.1104.796 http://dx.doi.org/10.1044/jshr.1104.796 ］

Fukayama S and Goto M . 2015 . Music content driven automated choreography with beat-wise motion connectivity constraints // Proceedings of the 12th Sound and Music Computing Conference . Maynooth， Ireland ：［s.n.］： 177 - 183

Gafni G ， Thies J ， Zollhöfer M and Nießner M . 2021 . Dynamic neural radiance fields for monocular 4D facial avatar reconstruction // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 8645 - 8654 ［ DOI： 10.1109/CVPR46437.2021.00854 http://dx.doi.org/10.1109/CVPR46437.2021.00854 ］

Gal R ， Patashnik O ， Maron H ， Chechik G and Cohen-Or D . 2021 . STYLEGAN-NADA： CLIP-guided domain adaptation of image generators ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2108.00946.pdf https://arxiv.org/pdf/2108.00946.pdf

Gao X ， Zhong C L ， Xiang J ， Hong Y ， Guo Y D and Zhang J Y . 2022 . Reconstructing personalized semantic facial NeRF models from monocular video . ACM Transactions on Graphics ， 41 （ 6 ）： # 200 ［ DOI： 10.1145/3550454.3555501 http://dx.doi.org/10.1145/3550454.3555501 ］

Ghosh A ， Cheema N ， Oguz C ， Theobalt C and Slusallek P . 2021 . Synthesis of compositional animations from textual descriptions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 1376 - 1386 ［ DOI： 10.1109/ICCV48922.2021.00143 http://dx.doi.org/10.1109/ICCV48922.2021.00143 ］

Goodfellow I J ， Pouget-Abadie J ， Mirza M ， Xu B ， Warde-Farley D ， Ozair S ， Courville A and Bengio Y . 2014 . Generative adversarial nets // Proceedings of the 27th International Conference on Neural Information Processing Systems . Montreal， Canada ： MIT Press： 2672 - 2680

Guo C ， Zou S H ， Zuo X X ， Wang S ， Ji W ， Li X Y and Cheng L . 2022a . Generating diverse and natural 3D human motions from text // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 5142 - 5151 ［ DOI： 10.1109/CVPR52688.2022.00509 http://dx.doi.org/10.1109/CVPR52688.2022.00509 ］

Guo C ， Zuo X X ， Wang S and Cheng L . 2022b . TM2T： stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 580 - 597 ［ DOI： 10.1007/978-3-031-19833-5_34 http://dx.doi.org/10.1007/978-3-031-19833-5_34 ］

Guo Y D ， Cai L and Zhang J Y . 2021a . 3D face from X： learning face shape from diverse sources . IEEE Transactions on Image Processing ， 30 ： 3815 - 3827 ［ DOI： 10.1109/TIP.2021.3065798 http://dx.doi.org/10.1109/TIP.2021.3065798 ］

Guo Y D ， Chen K Y ， Liang S ， Liu Y J ， Bao H J and Zhang J Y . 2021b . AD-NeRF： audio driven neural radiance fields for talking head synthesis // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 5764 - 5774 ［ DOI： 10.1109/ICCV48922.2021.00573 http://dx.doi.org/10.1109/ICCV48922.2021.00573 ］

Hannun A ， Case C ， Casper J ， Catanzaro B ， Diamos G ， Elsen E ， Prenger R ， Satheesh S ， Sengupta S ， Coates A and Ng A Y . 2014 . Deep speech： scaling up end-to-end speech recognition ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/1412.5567.pdf https://arxiv.org/pdf/1412.5567.pdf

Haque A ， Tancik M ， Efros A A ， Holynski A and Kanazawa A . 2023 . Instruct-NeRF2NeRF： editing 3D scenes with instructions // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE： 19683 - 19693 ［ DOI： 10.1109/ICCV51070.2023.01808 http://dx.doi.org/10.1109/ICCV51070.2023.01808 ］

Ho J ， Jain A and Abbeel P . 2020 . Denoising diffusion probabilistic models // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： #574

Hong F Z ， Zhang M Y ， Pan L ， Cai Z A ， Yang L and Liu Z W . 2022a . AvatarCLIP： zero-shot text-driven generation and animation of 3D avatars . ACM Transactions on Graphics ， 41 （ 4 ）： # 161 ［ DOI： 10.1145/3528223.3530094 http://dx.doi.org/10.1145/3528223.3530094 ］

Hong Y ， Peng B ， Xiao H Y ， Liu L G and Zhang J Y . 2022b . HeadNeRF： a realtime NeRF-based parametric head model // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 20342 - 20352 ［ DOI： 10.1109/CVPR52688.2022.01973 http://dx.doi.org/10.1109/CVPR52688.2022.01973 ］

Huang X and Belongie S . 2017 . Arbitrary style transfer in real-time with adaptive instance normalization // Proceedings of 2017 IEEE International Conference on Computer Vision （ICCV） . Venice， Italy ： IEEE： 1510 - 1519 ［ DOI： 10.1109/iccv.2017.167 http://dx.doi.org/10.1109/iccv.2017.167 ］

Huang Y K ， Wang J N ， Zeng A L ， Cao H ， Qi X B ， Shi Y K ， Zha Z J and Zhang L . 2023 . DreamWaltz： make a scene with complex 3D animatable avatars // Proceedings of the 37th International Conference on Neural Information Processing Systems . New Orleans， USA ： NeurIPS

Isola P ， Zhu J Y ， Zhou T H and Efros A A . 2017 . Image-to-image translation with conditional adversarial networks // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 5967 - 5976 ［ DOI： 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ］

Jamaludin A ， Chung J S and Zisserman A . 2019 . You said that？： synthesising talking faces from audio . International Journal of Computer Vision ， 127 （ 11 ）： 1767 - 1779 ［ DOI： 10.1007/s11263-019-01150-y http://dx.doi.org/10.1007/s11263-019-01150-y ］

Ji X Y ， Zhou H ， Wang K S Y ， Wu W ， Loy C C ， Cao X and Xu F . 2021 . Audio-driven emotional video portraits // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： 14075 - 14084 ［ DOI： 10.1109/CVPR46437.2021.01386 http://dx.doi.org/10.1109/CVPR46437.2021.01386 ］

Jiang B ， Chen X ， Liu W ， Yu J Y ， Yu G and Chen T . 2023 . MotionGPT： human motion as a foreign language // Proceedings of the 37th International Conference on Neural Information Processing Systems . New Orleans， USA ： NeurIPS

Kang C ， Tan Z P ， Lei J ， Zhang S H ， Guo Y C ， Zhang W D and Hu S M . 2021 . ChoreoMaster： choreography-oriented music-driven dance synthesis . ACM Transactions on Graphics ， 40 （ 4 ）： # 145 ［ DOI： 10.1145/3450626.3459932 http://dx.doi.org/10.1145/3450626.3459932 ］

Karras T ， Laine S ， Aittala M ， Hellsten J ， Lehtinen J and Aila T . 2020 . Analyzing and improving the image quality of styleGAN // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： IEEE： 8107 - 8116 ［ DOI： 10.1109/CVPR42600.2020.00813 http://dx.doi.org/10.1109/CVPR42600.2020.00813 ］

Kerbl B ， Kopanas G ， Leimkuehler T and Drettakis G . 2023 . 3D Gaussian splatting for real-time radiance field rendering . ACM Transactions on Graphics ， 42 （ 4 ）： # 139 ［ DOI： 10.1145/3592433 http://dx.doi.org/10.1145/3592433 ］

Kim G and Chun S Y . 2023 . DATID-3D： diversity-preserved domain adaptation using text-to-image diffusion for 3D generative model // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 14203 - 14213 ［ DOI： 10.1109/CVPR52729.2023.01365 http://dx.doi.org/10.1109/CVPR52729.2023.01365 ］

Kim J ， Kim J and Choi S . 2023 . FLAME： free-form language-based motion synthesis & editing // Proceedings of the 37th AAAI Conference on Artificial Intelligence . Washington， USA ： AAAI： 8255 - 8263 ［ DOI： 10.1609/aaai.v37i7.25996 http://dx.doi.org/10.1609/aaai.v37i7.25996 ］

Kwon G and Ye J C . 2022 . CLIPstyler： image style transfer with a single text condition // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 18041 - 18050 ［ DOI： 10.1109/CVPR52688.2022.01753 http://dx.doi.org/10.1109/CVPR52688.2022.01753 ］

Li R L ， Bladin K ， Zhao Y J ， Chinara C ， Ingraham O ， Xiang P D ， Ren X L ， Prasad P ， Kishore B ， Xing J and Li H . 2020 . Learning formation of physically-based face attributes // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： IEEE： 3407 - 3416 ［ DOI： 10.1109/CVPR42600.2020.00347 http://dx.doi.org/10.1109/CVPR42600.2020.00347 ］

Li R L ， Yang S ， Ross D A and Kanazawa A . 2021b . AI choreographer： music conditioned 3D dance generation with AIST++//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE ： 13381 - 13392 ［ DOI： 10.1109/ICCV48922.2021.01315 http://dx.doi.org/10.1109/ICCV48922.2021.01315 ］

Li S Y ， Yu W J ， Gu T P ， Lin C Z ， Wang Q ， Qian C ， Loy C C and Liu Z W . 2022 . Bailando： 3D dance generation by actor-critic GPT with choreographic memory // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 11040 - 11049 ［ DOI： 10.1109/CVPR52688.2022.01077 http://dx.doi.org/10.1109/CVPR52688.2022.01077 ］

Li T Y ， Bolkart T ， Black M J ， Li H and Romero J . 2017 . Learning a model of facial shape and expression from 4D scans . ACM Transactions on Graphics ， 36 （ 6 ）： # 194 ［ DOI： 10.1145/3130800.3130813 http://dx.doi.org/10.1145/3130800.3130813 ］

Liao Y H ， Qian W H and Cao J D . 2023 . MStarGAN： a face style transfer network with changeable style intensity . Journal of Image and Graphics ， 28 （ 12 ）： 3784 - 3796

廖远鸿，钱文华，曹进德 . 2023 . 风格强度可变的人脸风格迁移网络 . 中国图象图形学报， 28 （ 12 ）： 3784 - 3796 ［ DOI： 10.11834/jig.221149 http://dx.doi.org/10.11834/jig.221149 ］

Liu X ， Xu Y H ， Wu Q Y ， Zhou H ， Wu W and Zhou B L . 2022 . Semantic-aware implicit neural audio-driven video portrait generation // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 106 - 125 ［ DOI： 10.1007/978-3-031-19836-6_7 http://dx.doi.org/10.1007/978-3-031-19836-6_7 ］

Medsker L R and Jain L . 2001 . Recurrent neural networks . Design and Applications ， 5 ： 64 - 67

Mildenhall B ， Srinivasan P P ， Tancik M ， Barron J T ， Ramamoorthi R and Ng R . 2020 . NeRF： representing scenes as neural radiance fields for view synthesis // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 405 - 421 ［ DOI： 10.1007/978-3-030-58452-8_24 http://dx.doi.org/10.1007/978-3-030-58452-8_24 ］

Ofli F ， Erzin E ， Yemez Y and Tekalp A M . 2011 . Learn2Dance： learning statistical music-to-dance mappings for choreography synthesis . IEEE Transactions on Multimedia ， 14 （ 3 ）： 747 - 759 ［ DOI： 10.1109/TMM.2011.2181492 http://dx.doi.org/10.1109/TMM.2011.2181492 ］

Patashnik O ， Wu Z Z ， Shechtman E ， Cohen-Or D and Lischinski D . 2021 . StyleCLIP： text-driven manipulation of StyleGAN imagery // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 2065 - 2074 ［ DOI： 10.1109/ICCV48922.2021.00209 http://dx.doi.org/10.1109/ICCV48922.2021.00209 ］

Petrovich M ， Black M J and Varol G . 2021 . Action-conditioned 3D human motion synthesis with Transformer VAE // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 10965 - 10975 ［ DOI： 10.1109/ICCV48922.2021.01080 http://dx.doi.org/10.1109/ICCV48922.2021.01080 ］

Petrovich M ， Black M J and Varol G . 2022 . TEMOS： generating diverse human motions from textual descriptions // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 480 - 497 ［ DOI： 10.1007/978-3-031-20047-2_28 http://dx.doi.org/10.1007/978-3-031-20047-2_28 ］

Plappert M ， Mandery C and Asfour T . 2018 . Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks . Robotics and Autonomous Systems ， 109 ： 13 - 26 ［ DOI： 10.1016/j.robot.2018.07.006 http://dx.doi.org/10.1016/j.robot.2018.07.006 ］

Poole B ， Jain A ， Barron J T and Mildenhall B . 2022 . DreamFusion： text-to-3D using 2D diffusion // Proceedings of the 11th International Conference on Learning Representations . Kigali， Rwanda ： ICLR

Prajwal K R ， Mukhopadhyay R ， Namboodiri V P and Jawahar C V . 2020 . A lip sync expert is all you need for speech to lip generation in the wild // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM： 484 - 492 ［ DOI： 10.1145/3394171.3413532 http://dx.doi.org/10.1145/3394171.3413532 ］

Radford A ， Kim J W ， Hallacy C ， Ramesh A ， Goh G ， Agarwal S ， Sastry G ， Askell A ， Mishkin P ， Clark J ， Krueger G and Sutskever I . 2021 . Learning transferable visual models from natural language supervision //Proceedings of the 38th International Conference on Machine Lea rning. ［s.l.］： PMLR： 8748 - 8763

Raffel C ， Shazeer N ， Roberts A ， Lee K ， Narang S ， Matena M ， Zhou Y Q ， Li W and Liu P J . 2020 . Exploring the limits of transfer learning with a unified text-to-text Transformer . The Journal of Machine Learning Research ， 21 （ 1 ）： #140

Ramesh A ， Dhariwal P ， Nichol A ， Chu C and Chen M . 2022 . Hierarchical text-conditional image generation with CLIP latents ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2204.06125.pdf https://arxiv.org/pdf/2204.06125.pdf

Ranjan A ， Bolkart T ， Sanyal S and Black M J . 2018 . Generating 3D faces using convolutional mesh autoencoders // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 725 - 741 ［ DOI： 10.1007/978-3-030-01219-9_43 http://dx.doi.org/10.1007/978-3-030-01219-9_43 ］

Richard A ， Zollhöfer M ， Wen Y D ， de la Torre F and Sheikh Y . 2021 . MeshTalk： 3D face animation from speech using cross-modality disentanglement // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 1153 - 1162 ［ DOI： 10.1109/ICCV48922.2021.00121 http://dx.doi.org/10.1109/ICCV48922.2021.00121 ］

Rombach R ， Blattmann A ， Lorenz D ， Esser P and Ommer B . 2021 . High-resolution image synthesis with latent diffusion models // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 10674 - 10685 ［ DOI： 10.1109/CVPR52688.2022.01042 http://dx.doi.org/10.1109/CVPR52688.2022.01042 ］

Saharia C ， Chan W ， Saxena S ， Li L L ， Whang J ， Denton E L ， Ghasemipour K ， Gontijo Lopes R ， Karagol Ayan B ， Salimans T ， Ho J ， Fleet D J and Norouzi M . 2022 . Photorealistic text-to-image diffusion models with deep language understanding // Proceedings of the 36th International Conference on Neural Information Processing Systems . New Orleans， USA ： Curran Associates Inc.： #2643

Sanh V ， Debut L ， Chaumond J and Wolf T . 2019 . DistilBERT， a distilled version of BERT： smaller， faster， cheaper and lighter ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/1910.01108.pdf https://arxiv.org/pdf/1910.01108.pdf

Shen S ， Li W H ， Zhu Z ， Duan Y Q ， Zhou J and Lu J W . 2022 . Learning dynamic facial radiance fields for few-shot talking head synthesis // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 666 - 682 ［ DOI： 10.1007/978-3-031-19775-8_39 http://dx.doi.org/10.1007/978-3-031-19775-8_39 ］

Shen S ， Zhao W L ， Meng Z B ， Li W H ， Zhu Z ， Zhou J and Lu J W . 2023 . DiffTalk： crafting diffusion models for generalized audio-driven portraits animation // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 1982 - 1991 ［ DOI： 10.1109/CVPR52729.2023.00197 http://dx.doi.org/10.1109/CVPR52729.2023.00197 ］

Sun J X ， Wang X ， Wang L Z ， Li X Y ， Zhang Y ， Zhang H W and Liu Y B . 2023 . Next3D： generative neural texture rasterization for 3D-aware head avatars // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 20991 - 21002 ［ DOI： 10.1109/CVPR52729.2023.02011 http://dx.doi.org/10.1109/CVPR52729.2023.02011 ］

Sutskever I ， Vinyals O and Le Q V . 2014 . Sequence to sequence learning with neural networks // Proceedings of the 27th International Conference on Neural Information Processing Systems . Montreal， Canada ： MIT Press： 3104 - 3112

Tang J X ， Wang K S Y ， Zhou H ， Chen X K ， He D L ， Hu T S ， Liu J T ， Zeng G and Wang J D . 2022 . Real-time neural radiance talking portrait synthesis via audio-spatial decomposition ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2211.12368.pdf https://arxiv.org/pdf/2211.12368.pdf

Tevet G ， Gordon B ， Hertz A ， Bermano A H and Cohen-Or D . 2022 . MotionCLIP： exposing human motion generation to clip space // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 358 - 374 ［ DOI： 10.1007/978-3-031-20047-2_21 http://dx.doi.org/10.1007/978-3-031-20047-2_21 ］

Tevet G ， Raab S ， Gordon B ， Shafir Y ， Cohen-or D and Bermano A H . 2023 . Human motion diffusion model // Proceedings of the 11th International Conference on Learning Representations . Kigali， Rwanda ： ICLR

Thies J ， Elgharib M ， Tewari A ， Theobalt C and Nießner M . 2020 . Neural voice puppetry： audio-driven facial reenactment // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 716 - 731 ［ DOI： 10.1007/978-3-030-58517-4_42 http://dx.doi.org/10.1007/978-3-030-58517-4_42 ］

Tran L and Liu X M . 2018 . Nonlinear 3D face morphable model // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 7346 - 7355 ［ DOI： 10.1109/CVPR.2018.00767 http://dx.doi.org/10.1109/CVPR.2018.00767 ］

Tseng J ， Castellon R and Liu C K . 2023 . EDGE： editable dance generation from music // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 448 - 458 ［ DOI： 10.1109/CVPR52729.2023.00051 http://dx.doi.org/10.1109/CVPR52729.2023.00051 ］

van den Oord A ， Kalchbrenner N ， Vinyals O ， Espeholt L ， Graves A and Kavukcuoglu K . 2016 . Conditional image generation with PixelCNN decoders // Proceedings of the 30th International Conference on Neural Information Processing Systems . Barcelona， Spain ： Curran Associates Inc.： 4797 - 4805

van den Oord A ， Vinyals O and Kavukcuoglu K . 2017 . Neural discrete representation learning // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6309 - 6318

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Wang P ， Liu L J ， Liu Y ， Theobalt C ， Komura T and Wang W P . 2021a . NeuS： learning neural implicit surfaces by volume rendering for multi-view reconstruction // Proceedings of the 35th International Conference on Neural Information Processing Systems . Vancouver， Canada ： NeurIPS： 27171 - 27183

Wang S Z ， Li L C ， Ding Y ， Fan C J and Yu X . 2021b . Audio2Head： audio-driven one-shot talking-head generation with natural head motion // Proceedings of the 30th International Joint Conference on Artificial Intelligence . Montreal， Canada ： IJCAI： 1098 - 1105 ［ DOI： 10.24963/ijcai.2021/152 http://dx.doi.org/10.24963/ijcai.2021/152 ］

Wang T F ， Zhang B ， Zhang T ， Gu S Y ， Bao J M ， Baltrusaitis T ， Shen J J ， Chen D ， Wen F ， Chen Q F and Guo B M . 2023 . RODIN： a generative model for sculpting 3D digital avatars using diffusion // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 4563 - 4573 ［ DOI： 10.1109/CVPR52729.2023.00443 http://dx.doi.org/10.1109/CVPR52729.2023.00443 ］

Wang Y B ， Zhang K ， Kong Y H ， Yu T T and Zhao S W . 2023 . Overview of human-facial-related age syntheis based generative adversarial network methods . Journal of Image and Graphics ， 28 （ 10 ）： 3004 - 3024

王艺博，张珂，孔英会，于婷婷，赵士玮 . 2023 . 人脸年龄合成的生成对抗网络方法综述 . 中国图象图形学报， 28 （ 10 ）： 3004 - 3024 ［ DOI： 10.11834/jig.220842 http://dx.doi.org/10.11834/jig.220842 ］

Xing J B ， Xia M H ， Zhang Y C ， Cun X D ， Wang J and Wong T T . 2023 . CodeTalker： speech-driven 3D facial animation with discrete motion prior // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 12780 - 12790 ［ DOI： 10.1109/CVPR52729.2023.01229 http://dx.doi.org/10.1109/CVPR52729.2023.01229 ］

Ye Z H ， Jiang Z Y ， Ren Y ， Liu J L ， He J Z and Zhao Z . 2023 . GeneFace： generalized and high-fidelity audio-driven 3D talking face synthesis // Proceedings of the 11th International Conference on Learning Representations . Kigali， Rwanda ： ICLR

Ye Z J ， Wu H Z ， Jia J ， Bu Y H ， Chen W ， Meng F B and Wang Y F . 2020 . ChoreoNet： towards music to dance synthesis with choreographic action unit // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM： 744 - 752 ［ DOI： 10.1145/3394171.3414005 http://dx.doi.org/10.1145/3394171.3414005 ］

Yenamandra T ， Tewari A ， Bernard F ， Seidel H P ， Elgharib M ， Cremers D and Theobalt C . 2021 . i3DMM： deep implicit 3D morphable model of human heads // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 12798 - 12808 ［ DOI： 10.1109/CVPR46437.2021.01261 http://dx.doi.org/10.1109/CVPR46437.2021.01261 ］

Yi H W ， Liang H L ， Liu Y F ， Cao Q ， Wen Y D ， Bolkart T ， Tao D C and Black M J . 2023 . Generating holistic 3D human motion from speech // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Vancouver， Canada ： IEEE： 469 - 480 ［ DOI： 10.1109/CVPR52729.2023.00053 http://dx.doi.org/10.1109/CVPR52729.2023.00053 ］

Zakharov E ， Shysheya A ， Burkov E and Lempitsky V . 2019 . Few-shot adversarial learning of realistic neural talking head models // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 9458 - 9467 ［ DOI： 10.1109/ICCV.2019.00955 http://dx.doi.org/10.1109/ICCV.2019.00955 ］

Zeng Y F ， Lu Y X ， Ji X Y ， Yao Y ， Zhu H and Cao X . 2023 . AvatarBooth： high-quality and customizable 3D human avatar generation ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2306.09864.pdf https://arxiv.org/pdf/2306.09864.pdf

Zhang J R ， Zhang Y S ， Cun X D ， Huang S L ， Zhang Y ， Zhao H W ， Lu H T and Shen X . 2023a . T 2 M-GPT： generating human motion from textual descriptions with discrete representations ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2301.06052.pdf https://arxiv.org/pdf/2301.06052.pdf

Zhang L W ， Qiu Q W ， Lin H Y ， Zhang Q X ， Shi C ， Yang W ， Shi Y ， Yang S B ， Xu L and Yu J Y . 2023b . DreamFace： progressive generation of animatable 3D faces under text guidance . ACM Transactions on Graphics ， 42 （ 4 ）： # 138 ［ DOI： 10.1145/3592094 http://dx.doi.org/10.1145/3592094 ］

Zhang M Y ， Cai Z G ， Pan L ， Hong F Z ， Guo X Y ， Yang L and Liu Z W . 2022 . MotionDiffuse： text-driven human motion generation with diffusion model ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2208.15001.pdf https://arxiv.org/pdf/2208.15001.pdf

Zheng Y F ， Abrevaya V F ， Bühler M C ， Chen X ， Black M J and Hilliges O . 2022 . I M avatar： implicit morphable head avatars from videos // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 13535 - 13545 ［ DOI： 10.1109/CVPR52688.2022.01318 http://dx.doi.org/10.1109/CVPR52688.2022.01318 ］

Zhou H ， Sun Y S ， Wu W ， Loy CC ， Wang X G and Liu Z W . 2021 . Pose-controllable talking face generation by implicitly modularized audio-visual representation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 4174 - 4184 ［ DOI： 10.1109/CVPR46437.2021.00416 http://dx.doi.org/10.1109/CVPR46437.2021.00416 ］

Zhou Y ， Han X T ， Shechtman E ， Echevarria J ， Kalogerakis E and Li D Z Y . 2020 . MakeItTalk： speaker-aware talking-head animation . ACM Transactions on Graphics ， 39 （ 6 ）： # 221 ［ DOI： 10.1145/3414685.3417774 http://dx.doi.org/10.1145/3414685.3417774 ］

Zhou Z W ， Wang Z ， Yao S Y ， Yan Y C ， Yang C ， Zhai G T ， Yan J C and Yang X K . 2022 . DialogueNeRF： towards realistic avatar face-to-face conversation video generation ［EB/OL］. ［ 2023-08-20 ］. https://arxiv.org/pdf/2203.07931v1.pdf https://arxiv.org/pdf/2203.07931v1.pdf

文章被引用时，请邮件提醒。

提交

面向低光增强的三维高斯泼溅模型

融合局部空间信息的新视角合成方法