2024年度三维视觉前沿趋势与十大进展
Research Trends and Top Ten Advances of 3D Vision in 2024
- 2025年 页码:1-21
收稿日期:2025-02-14,
修回日期:2025-03-04,
录用日期:2025-03-24,
网络出版日期:2025-03-26
DOI: 10.11834/jig.250057
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2025-02-14,
修回日期:2025-03-04,
录用日期:2025-03-24,
网络出版日期:2025-03-26,
移动端阅览
三维视觉作为计算机视觉、图形学、人工智能与光学成像的交叉学科,是构建具身通用智能与元宇宙的核心基石。2024年,以NeRF和高斯泼溅为代表的可微表征技术持续发展和完善并逐渐突破传统三维重建边界,无论从微观细胞组织到宏观物理天体,还是从静态场景到动态人体,均取得了显著的精度提升;在生成式人工智能技术和大模型规模定律(Scaling Law)的推动下,三维视觉迎来了从优化到可泛化前馈生成的范式跃迁,并在可控数字内容生成方向取得了重要进展和突破;具身智能持续备受关注,研究者们逐渐意识到三维虚拟仿真数据和三维人体运动数据的捕捉和生成,是训练具身智能的核心关键;随着世界模型和空间智能的概念成为科技界热议的焦点,对物理世界进行建模、对空间关系进行理解、对未来状态进行预测成为了重要的研究方向,而这些都离不开三维视觉技术的支撑;此外,计算成像技术的革新则通过非传统视觉传感器与新型重建算法,突破了传统三维重建的物理限制与性能瓶颈。这些技术突破正在推动三维视觉进入"感知-建模-生成-交互"全链路智能化、规模化学习的新阶段。为促进学术交流,本文分析总结三维视觉领域前沿趋势,并遴选年度十大研究进展,为学术界与产业界提供参考观点。
As an interdisciplinary field integrating computer vision, graphics, artificial intelligence, and optical imaging, 3D vision serves as the foundational cornerstone for building embodied general intelligence and the metaverse ecosystem. In 2024, differentiable representation technologies, exemplified by NeRF and Gaussian Splatting, continued to evolve and refine, gradually transcending the boundaries of traditional 3D reconstruction. Significant improvements in accuracy have been achieved across a spectrum from microscopic cellular structures to macroscopic physical celestial bodies, and from static scenes to dynamic human bodies. Propelled by advancements in generative artificial intelligence technologies and the scaling laws of large models, the field of 3D vision has witnessed a paradigm shift from optimization to generalizable feedforward generation, marking important progress and breakthroughs in the direction of controllable digital content generation. Embodied intelligence remains a focal point of interest, with researchers increasingly recognizing the capture and generation of 3D virtual simulation data and 3D human motion data as central to training embodied intelligence. As the concepts of world models and spatial intelligence become hot topics in the technology community, modeling the physical world, understanding spatial relationships, and predicting future states have emerged as crucial research directions, all of which rely on the support of 3D vision technologies. Furthermore, innovations in computational imaging technology have overcome the physical limitations and performance bottlenecks of traditional 3D reconstruction through non-traditional visual sensors and novel reconstruction algorithms. These technological breakthroughs are propelling 3D vision into a new era characterized by an intelligent, large-scale learning process encompassing perception, modeling, generation, and interaction. Specifically, the development trends in 3D vision primarily manifested in the following aspects:• Controllable and Physics-Aware Generation of Visual Content AIGC. With the rapid development of AI-Generated Content (AIGC) technology, visual content generation is evolving from simple 2D image creation toward more controllable and physics-aware approaches. This trend requires the introduction of multidimensional control parameters such as 3D viewpoints, lighting conditions, and 3D character motion, combined with physical prior knowledge, to achieve higher-quality content generation. 3D vision technology plays a pivotal role in this process, providing essential spatial-temporal and physical constraints for AIGC.• 4D Spatial Intelligence: Bridging Virtual and Physical Worlds. 4D spatial intelligence (3D space + time dimension) is emerging as a core technology connecting virtual worlds (e.g., the metaverse) and physical realities (e.g., embodied intelligent robots). This technology focuses on establishing a digital mapping of dynamic physical environments. Leveraging 3D vision and multimodal large model technologies, AI systems can construct 4D spatial models to understand spatial relationships, predict motion trajectories, and simulate future evolutions. Concurrently, intelligent agents can interactively learn within physical or virtual 4D environments to acquire intelligence.• Data-Driven Embodied Intelligence: 3D Virtual Simulation and Human Motion Capture. The advancement of embodied intelligence heavily relies on high-quality 3D virtual simulation data and the capture/generation of human 3D motion data. These datasets serve as the "fuel" for training embodied intelligent robots to achieve sophisticated behavioral control. Through high-precision 3D vision technologies, robots can better understand and simulate human actions, thereby demonstrating enhanced intelligence in complex tasks.• Differentiable 3D Representation and Integration with Large Model Technologies. Novel 3D representations such as NeRF and 3D Gaussian Splatting are driving performance improvements in scene generation and reconstruction across scales—from microscopic cellular structures to indoor environments, human/animal modeling, autonomous driving/city modeling, and even astronomical black hole reconstruction. Their efficiency and flexibility have opened up new possibilities for 3D vision applications. Furthermore, by integrating large-scale 3D data with transformer-based architectures and advanced generative methods like diffusion models, fundamental 3D vision tasks are being unified into efficient end-to-end frameworks, achieving scaled-up learning of core 3D vision paradigms.The breakthroughs in 3D vision in 2024 have injected new momentum into technological development, with future trends focusing on several key directions. Spatiotemporal-consistent world models, integrating 4D spacetime and physical laws, will provide dynamic prediction and interaction support for complex scenarios like autonomous driving and embodied intelligence. The deep integration of generative AI and 3D content generation technologies will overcome data bottlenecks, enabling automated creation of high-fidelity and controllable 3D content. Enhanced cross-modal generalization capabilities will strengthen the fusion of vision, language, and motion modalities, improving the adaptability and robustness of robotic strategies. Physics-driven dynamic reconstruction, combined with physical engines, will achieve high-precision modeling and interactive editing of dynamic scenes, advancing digital twins and virtual reality. Meanwhile, 3D imaging technologies will accelerate scientific exploration in astrophysics and cell biology. Efficient real-time processing and lightweight solutions will boost the reconstruction and rendering efficiency of large-scale dynamic scenes, promoting edge device applications. Ethical and privacy protection will emerge as critical concerns, balancing innovation and security through encryption technologies and regulatory frameworks governing 3D data acquisition and generation. 3D vision is evolving toward an era of more intelligent and universal "spatial intelligence," where technological breakthroughs will reshape human-computer interaction, scientific exploration, and industrial ecosystems. To foster academic exchange, this article analyzes cutting-edge trends in 3D vision and highlights the top ten research breakthroughs of the year, offering insights and references for both academia and industry.
Achiam J , Adler S , Agarwal S , Ahmad L , Akkaya I , Aleman F L , Almeida D , Altenschmidt J , Altman S , Anadkat S , et al. , 2023 . GPT-4 technical report [J]. arXiv preprint arXiv:2303.08774, AgarwalN, AliA, BalaM, BalajiY, BarkerE, CaiT, ChattopadhyayP, ChenY, CuiY, DingY, et al., 2025.
Cosmos world foundation model platform for physical ai [J]. arXiv preprint arXiv:2501. 03575 ,
Bai J , Xia M , Wang X , Yuan Z , Fu X , Liu Z , Hu H , Wan P , Zhang D , 2024 . SynCamMaster: Synchronizing Multi- Camera Video Generation from Diverse Viewpoints [J]. arXiv preprint arXiv:2412. 07760 ,
Bharadhwaj H , Mottaghi R , Gupta A , Tulsiani S , 2024 . Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [C]// European Conference on Computer Vision , Cham : Springer .
Blanc H , Deschaud J E , Paljic A , 2024 . RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis [J]. arXiv preprint arXiv:2408. 03356 ,
Brohan A , Brown N , Carbajal J , et al. , 2023 . RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [C]// arXiv preprint arXiv: 2307.15818 .
Cao R , Divekar N S , Nuñez J K , Upadhyayula S , Waller L , 2024 . Neural space–time model for dynamic multi-shot imaging [J]. Nature Methods , 21 ( 12 ): 2336 - 2341 .
Cha H , Lee I , Joo H , 2024 . PERSE: Personalized 3D Generative Avatars from A Single Portrait [J]. arXiv preprint arXiv:2412. 21206 ,
Cha J , Kim J , Yoon J S , Baek S , 2024 . Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction [C]// IEEE Conference on Computer Vision and Pattern Recognition . IEEE : 1577 - 1585 .
Cheang C L , Chen G , Jing Y , Kong T , Li H , Li Y , Liu Y , Wu H , Xu J , Yang Y , Zhang H , Zhu M , 2024 . GR- 2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [J]. arXiv preprint arXiv:2410. 06158 ,
Chen G , Wang W , 2024 . A survey on 3d gaussian splatting [J]. arXiv preprint arXiv:2401. 03890 ,
Chen H , Gao Y , Zhang S , Wu J , Ma Y , Zheng R , 2024 . RoCoSDF: Row-Column Scanned Neural Signed Distance Fields for Freehand 3D Ultrasound Imaging Shape Reconstruction [C]// International Conference on Medical Image Computing and Computer-Assisted Intervention : 721 - 731 .
Chen Y , Wu Q , Li M , Lin W , Harandi M , Cai J , 2024 . Fast feedforward 3d gaussian splatting compression [J]. arXiv preprint arXiv:2410. 08017 ,
Chen Y , Wang Y , Zhang Z , 2024 . Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers [J]. arXiv preprint arXiv:2412.18607, ChenZ, YangJ, HuangJ, deLutioR, EsturoJM, IvanovicB, LitanyO, GojcicZ, FidlerS, PavoneM, et al., 2024.
Omnire: Omni urban scene reconstruction [J]. arXiv preprint arXiv:2408. 16760 ,
Cheng X , Ji Y , Chen J , Yang R , Yang G , Wang X , 2024 . Expressive whole-body control for humanoid robots [J]. arXiv preprint arXiv:2402. 16796 ,
Chernyadev N , Backshall N , Ma X , Lu Y , Seo Y , James S , 2024 . Bigym: A demo-driven mobile bi-manual ma- nipulation benchmark [J]. arXiv preprint arXiv:2407. 07788 ,
Chi C , Xu Z , Pan C , Cousineau E , Burchfiel B , Feng S , Tedrake R , Song S , 2024 . Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots [C]// Proceedings of Robotics: Science and Systems .
Christen S , Hampali S , Sener F , Remelli E , Hodan T , Sauser E , Ma S , Tekin B , 2024 . Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions [C]// SIGGRAPH Asia 2024 Conference Papers : 1 - 11 .
Collaboration O X E , O’Neill A , Rehman A ,
et al. , 2023 . Open X-Embodiment: Robotic Learning Datasets and RT-X Models[Z].
相关作者
相关机构