Geometric attribute-guided 3D semantic instance reconstruction

Wan Junhui; Liu Xinpu; Chen Lili; Ao Sheng; Zhang Peng; Guo Yulan

doi:10.11834/jig.230106

Image Understanding and Computer Vision | Views : 0 下载量: 2 CSCD: 0

PDF
Export
Share
Collection
Album

Geometric attribute-guided 3D semantic instance reconstruction
Vol. 29, Issue 1, Pages: 218-230(2024)
Published： 16 January 2024 ，
DOI： 10.11834/jig.230106
稿件说明：

移动端阅览

万骏辉，刘心溥，陈莉丽，敖晟，张鹏，郭裕兰. 2024. 几何属性引导的三维语义实例重建. 中国图象图形学报， 29(01):0218-0230

Wan Junhui， Liu Xinpu， Chen Lili， Ao Sheng， Zhang Peng， Guo Yulan. 2024. Geometric attribute-guided 3D semantic instance reconstruction. Journal of Image and Graphics， 29(01):0218-0230
万骏辉，刘心溥，陈莉丽，敖晟，张鹏，郭裕兰. 2024. 几何属性引导的三维语义实例重建. 中国图象图形学报， 29(01):0218-0230 DOI： 10.11834/jig.230106.

Wan Junhui， Liu Xinpu， Chen Lili， Ao Sheng， Zhang Peng， Guo Yulan. 2024. Geometric attribute-guided 3D semantic instance reconstruction. Journal of Image and Graphics， 29(01):0218-0230 DOI： 10.11834/jig.230106.

摘要

目的

语义实例重建是机器人理解现实世界的一个重要问题。虽然近年来取得了很多进展，但重建性能易受遮挡和噪声的影响。特别地，现有方法忽视了物体的先验几何属性，同时忽视了物体的关键细节信息，导致重建的网格模型粗糙，精度较低。针对这种问题，提出了一种几何属性引导的语义实例重建算法。

方法

首先，通过目标检测器获取检测框参数，并对每个目标实例进行检测框盒采样，从而获得场景中对应的残缺局部点云。然后，通过编码器端的特征嵌入层和Transformer层提取物体丰富且关键的细节几何信息，以获取对应的局部特征，同时利用物体的先验语义信息来帮助算法更快地逼近目标形状。最后，本文设计了一种特征转换器以对齐物体全局特征，并将其与前述局部特征融合送入形状生成模块，进行物体网格重建。

结果

在真实数据集ScanNet v2上，本文算法与现有最新方法进行了全面的性能比较，实验结果证明了本文算法的有效性。与性能排名第2的RfD-Net相比，本算法的实例重建指标提升了8%。此外，本文开展了详尽的消融实验以验证算法中各个模块的有效性。

结论

本文所提出的几何属性引导的语义实例重建算法，更好地利用了物体的几何属性信息，使得重建结果更为精细、准确。

Abstract

Objective

The objective of 3D vision is to capture the geometric and optical features of the real world from multiple perspectives and convert this information into digital form， enabling computers to understand and process it. 3D vision is an important aspect of computer graphics. Nonetheless， sensors can only provide partial observations of the world due to viewpoint occlusion， sparse sensing， and measurement noise， resulting in a partial and incomplete representation of a scene. Semantic instance reconstruction is proposed to solve this problem. It converts 2D/3D data obtained from multiple sensors into a semantic representation of the scene， including modeling each object instance in the scene. Machine learning and computer vision techniques are applied to achieve high-precision reconstruction results， and point cloud-based methods have demonstrated prominent advantages. However， existing methods disregard prior geometric and semantic information of objects， and the subsequent simple max-pooling operation loses key structural information of objects， resulting in poor instance reconstruction performance.

Method

In this study， a geometric attribute-guided semantic instance reconstruction network （GANet）， which consists of a 3D object detector， a spatial Transformer， and a mesh generator， is proposed. We design the spatial Transformer to utilize the geometric and semantic information of instances. After obtaining the 3D bounding box information of instances in the scene， box sampling is used to obtain the real local point cloud of each target instance in the scene on the basis of the instance scale information， and then semantic information is embedded for foreground point segmentation. Compared with ball sampling， box sampling reduces noise and obtains more effective information. Then， the encoder’s feature embedding and Transformer layers extract rich and crucial detailed geometric information of objects from coarse to fine to obtain the corresponding local features. The feature embedding layer also utilizes the prior semantic information of objects to help the algorithm quickly approximate the target shape. The attention module in the Transformer integrates the correlation information between points. The algorithm also uses the object’s global features provided by the detector. Considering the inconsistency between the scene space and the canonical space， a designed feature space Transformer is used to align the object’s global features. Finally， the fused features are sent to the mesh generator for mesh reconstruction. The loss function of GANet consists of two parts： detection and shape losses. Detection loss is the weighted sum of the instance confidence， semantic classification， and bounding box estimation losses. Shape loss consists of three parts： Kullback-Leibler divergence between the predicted and standard normal distributions， foreground point segmentation loss， and occupancy point estimation loss. Occupancy point estimation loss is the cross-entropy between the predicted occupancy value of the spatial candidate points and the real occupancy value.

Result

The experiment was compared with the latest methods on the ScanNet v2 datasets. The algorithm utilized computer aided design （CAD） models provided by Scan2CAD， which included 8 categories， as ground truth for training. The mean average precision of semantic instance reconstruction increased by 8% compared with the second-ranked method， i.e.， RfD-Net. The average precision of bathtub， trash bin， sofa， chair， and cabinet is better than that from RfD-Net. In accordance with the visualization results， GANet can reconstruct object models that are more in line with the scene. Ablation experiments were also conducted on the corresponding dataset. The performance of the complete network was better than the other four networks， which included a GANet that replaced ball sampling with box sampling， replaced the Transformer with PointNet， and removed the semantic embedding of point cloud features and feature transformation. The experimental results indicate that box sampling obtains more effective local point cloud information， the Transformer-based point cloud encoder enables the network to use more critical local structural information of the foreground point cloud during reconstruction， and semantic embedding provides prior information for instance reconstruction. Feature space transformation aligns the global prior information of an object， further improving the reconstruction effect.

Conclusion

In this study， we proposed a geometric attribute-guided network. This network considers the complexity of scene objects and can better utilize the geometric and attribute information of objects. The experiment results show that our network outperforms several state-of-the-art approaches. Current 3D-based semantic instance reconstruction algorithms have achieved good results， but acquiring and annotating 3D data are still relatively expensive. Future research can focus on how to use 2D data to assist in semantic instance reconstruction.

关键词

场景重建三维点云语义实例重建网格生成目标检测

Keywords

scene reconstructionthree-dimensional point cloudsemantic instance reconstructionmesh generationobject detection

references

Avetisyan A， Dahnert M， Dai A， Savva M， Chang A X and Nießner M. 2019. Scan2CAD： learning CAD model alignment in RGB-D scans//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 2609-2618 ［DOI： 10.1109/CVPR.2019.00272http://dx.doi.org/10.1109/CVPR.2019.00272］

Chang A X， Funkhouser T， Guibas L， Hanrahan P， Huang Q X， Li Z M， Savaresr S， Savva M， Song S R， Su H， Xiao J X， Yi L and Yu F. 2015. ShapeNet： an information-rich 3D model repository ［EB/OL］. ［2023-03-20］. https://arxiv.org/pdf/1512.03012.pdfhttps://arxiv.org/pdf/1512.03012.pdf

Dai A， Chang A X， Savva M， Halber M， Funkhouser T and Nießner M. 2017a. ScanNet： richly-annotated 3D reconstructions of indoor scenes//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5828-5839 ［DOI： 10.1109/CVPR.2017.261http://dx.doi.org/10.1109/CVPR.2017.261］

Dai A， Qi C R and Nießner M. 2017b. Shape completion using 3D-encoder-predictor CNNs and shape synthesis//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5868-5877 ［DOI： 10.1109/CVPR.2017.693http://dx.doi.org/10.1109/CVPR.2017.693］

Dai A， Ritchie D， Bokeloh M， Reed S， Sturm J and Nießner M. 2018. ScanComplete： large-scale scene completion and semantic segmentation for 3D scans//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4578-4587 ［DOI： 10.1109/CVPR.2018.00481http://dx.doi.org/10.1109/CVPR.2018.00481］

Dai A， Siddiqui Y， Thies J， Valentin J and Nießner M. 2021. SPSG： self-supervised photometric scene generation from RGB-D scans//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 1747-1756 ［DOI： 10.1109/CVPR46437.2021.00179http://dx.doi.org/10.1109/CVPR46437.2021.00179］

Fu Z H， Wang L G， Xu L， Wang Z Y， Laga H， Guo Y L， Boussaid Farid and Bennamoun M. 2023. VAPCNet： viewpoint-aware 3D point cloud completion//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris， France： IEEE： 12108-12118

Gkioxari G， Johnson J and Malik J. 2019. Mesh R-CNN//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 9785-9795 ［DOI： 10.1109/ICCV.2019.00988http://dx.doi.org/10.1109/ICCV.2019.00988］

Guo Y L， Wang H Y， Hu Q Y， Liu H， Liu L and Bennamoun M. 2020. Deep learning for 3D point clouds： a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（12）： 4338-4364 ［DOI： 10.1109/TPAMI.2020.3005434http://dx.doi.org/10.1109/TPAMI.2020.3005434］

Han X G， Zhang Z X， Du D， Yang M D， Yu J M， Pan P， Yang X， Liu L G， Xiong Z X and Cui S G. 2019. Deep reinforcement learning of volume-guided progressive view inpainting for 3D point scene completion from a single depth image//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 234-243 ［DOI： 10.1109/CVPR.2019.00032http://dx.doi.org/10.1109/CVPR.2019.00032］

Hou J， Dai A and Nießner M. 2019. 3D-SIS： 3D semantic instance segmentation of RGB-D scans//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4416-4425 ［DOI： 10.1109/CVPR.2019.00455http://dx.doi.org/10.1109/CVPR.2019.00455］

Hou J， Dai A and Nießner M. 2020. RevealNet： seeing behind objects in RGB-D scans//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 2098-2107 ［DOI： 10.1109/CVPR42600.2020.00217http://dx.doi.org/10.1109/CVPR42600.2020.00217］

Liu X P， Ma Y X， Xu K， Wan J W and Guo Y L. 2022. AGFA-Net： adaptive global feature augmentation network for point cloud completion. IEEE Geoscience and Remote Sensing Letters， 19： #7004505 ［DOI： 10.1109/LGRS.2022.3198799http://dx.doi.org/10.1109/LGRS.2022.3198799］

Long X X， Cheng X J， Zhu H， Zhang P J， Liu H M， Li J， Zheng L T， Hu Q Y， Liu H， Cao X， Yang R G， Wu Y H， Zhang G F， Liu Y B， Xu K， Guo Y L and Chen B Q. 2021. Recent progress in 3D vision. Journal of Image and Graphics， 26（6）： 1389-1428

龙霄潇，程新景，朱昊，张朋举，刘浩敏，李俊，郑林涛，胡庆拥，刘浩，曹汛，杨睿刚，吴毅红，章国锋，刘烨斌，徐凯，郭裕兰，陈宝权. 2021. 三维视觉前沿进展. 中国图象图形学报， 26（6）： 1389-1428 ［DOI： 10.11834/jig.210043http://dx.doi.org/10.11834/jig.210043］

Lorensen W E and Cline H E. 1987. Marching cubes： a high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics， 21（4）： 163-169 ［DOI： 10.1145/37402.37422http://dx.doi.org/10.1145/37402.37422］

Mescheder L， Oechsle M， Niemeyer M， Nowozin S and Geiger A. 2019. Occupancy networks： learning 3D reconstruction in function space//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4455-4465［DOI： 10.1109/CVPR.2019.00459http://dx.doi.org/10.1109/CVPR.2019.00459］

Nie Y Y， Han X G， Guo S H， Zheng Y J， Chang J and Zhang J J. 2020. Total3DUnderstanding： joint layout， object pose and mesh reconstruction for indoor scenes from a single image//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 52-61 ［DOI： 10.1109/CVPR42600.2020.00013http://dx.doi.org/10.1109/CVPR42600.2020.00013］

Nie Y Y， Hou J， Han X G and Nießner M. 2021. RfD-Net： point scene understanding by semantic instance reconstruction//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 4608-4618 ［DOI： 10.1109/CVPR46437.2021.00458http://dx.doi.org/10.1109/CVPR46437.2021.00458］

Qi C R， Su H， Mo K C and Guibas L J. 2017. PointNet： deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 77-85 ［DOI： 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16］

Roldão L， De Charette R and Verroust-Blondet A. 2022. 3D semantic scene completion： a survey. International Journal of Computer Vision， 130（8）： 1978-2005 ［DOI： 10.1007/s11263-021-01504-5http://dx.doi.org/10.1007/s11263-021-01504-5］

Tang J X， Chen X K， Wang J B and Zeng G. 2022. Point scene understanding via disentangled instance mesh reconstruction//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 684-701 ［DOI： 10.1007/978-3-031-19824-3_40http://dx.doi.org/10.1007/978-3-031-19824-3_40］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin L. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Xie Q， Lai Y K， Wu J， Wang Z T， Zhang Y M， Xu K and Wang J. 2020. MLCVNet： multi-level context VoteNet for 3D object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10447-10456 ［DOI： 10.1109/CVPR42600.2020.01046http://dx.doi.org/10.1109/CVPR42600.2020.01046］

Yuan W T， Khot T， Held D， Mertz C and Hebert M. 2018. PCN： point completion network//Proceedings of 2018 International Conference on 3D Vision （3DV）. Verona， Italy： IEEE： 728-737 ［DOI： 10.1109/3DV.2018.00088http://dx.doi.org/10.1109/3DV.2018.00088］

Zhong M and Zeng G. 2020. Semantic point completion network for 3D semantic scene completion//Proceedings of the 24th European Conference on Artificial Intelligence. Santiago de Compostela， Spain： IOS Press： 2824-2831

Alert me when the article has been cited

提交

Infrared-visible image object detection algorithm using feature dynamic selection

Survey of texture optimization algorithms for 3D reconstructed scenes

Survey on the fusion of point clouds and images for environmental object detection

Automatic texture exemplar extraction with jointed deep and broad learning models

Automatic capture for standard fetal cardiac four-chamber ultrasound view by fusing frame sequential relationships