嵌入Transformer结构的多尺度点云补全

刘心溥; 马燕新; 许可; 万建伟; 郭裕兰

doi:10.11834/jig.210510

三维形状分析 | 浏览量 : 0 下载量: 1 CSCD: 7

PDF
导出
分享
收藏
专辑

嵌入Transformer结构的多尺度点云补全
Multi-scale Transformer based point cloud completion network
2022年27卷第2期页码：538-549
纸质出版日期： 2022-02-16 ，

录用日期： 2021-11-02
DOI： 10.11834/jig.210510
稿件说明：

移动端阅览

刘心溥, 马燕新, 许可, 万建伟, 郭裕兰. 嵌入Transformer结构的多尺度点云补全[J]. 中国图象图形学报, 2022,27(2):538-549.

Xinpu Liu, Yanxin Ma, Ke Xu, Jianwei Wan, Yulan Guo. Multi-scale Transformer based point cloud completion network[J]. Journal of Image and Graphics, 2022,27(2):538-549.
刘心溥, 马燕新, 许可, 万建伟, 郭裕兰. 嵌入Transformer结构的多尺度点云补全[J]. 中国图象图形学报, 2022,27(2):538-549. DOI： 10.11834/jig.210510.

Xinpu Liu, Yanxin Ma, Ke Xu, Jianwei Wan, Yulan Guo. Multi-scale Transformer based point cloud completion network[J]. Journal of Image and Graphics, 2022,27(2):538-549. DOI： 10.11834/jig.210510.

摘要

目的

当前点云补全的深度学习算法多采用自编码器结构，然而编码器端常用的多层感知器（multilayer perceptron，MLP）网络往往只聚焦于点云整体形状，很难对物体的细节特征进行有效提取，使点云残缺结构的补全效果不佳。因此需要一种准确的点云局部特征提取算法，用于点云补全任务。

方法

为解决该问题，本文提出了嵌入注意力模块的多尺度点云补全算法。网络整体采用编码器—解码器结构，通过编码器端的特征嵌入层和Transformer层提取并融合3种不同分辨率的残缺点云特征信息，将其输入到全连接网络的解码器中，输出逐级补全的缺失点云。最后在解码器端添加注意力鉴别器，借鉴生成对抗网络（generative adversarial networks，GAN）的思想，优化网络补全性能。

结果

采用倒角距离（Chamfer distance，CD）作为评价标准，本文算法在2个数据集上与相关的4种方法进行了实验比较，在ShapeNet数据集上，相比于性能第2的PF-Net（point fractal network）模型，本文算法的类别平均CD值降低了3.73%；在ModelNet10数据集上，相比于PF-Net模型，本文算法的类别平均CD值降低了12.75%。不同算法的可视化补全效果图，验证了本文算法具有更精准的细节结构补全能力和面对类别中特殊样本的强泛化能力。

结论

本文所提出的基于Transformer结构的多尺度点云补全算法，更好地提取了残缺点云的局部特征信息，使得点云补全的结果更加准确。

Abstract

Objective

Three dimensional vision analysis is a key research aspect in computer vision research. Point cloud representation preserves the initial geometric information in 3D space under no discretization circumstances. Unfortunately

scanned 3D point clouds are incomplete due to occlusion

constrained sensor resolution and small viewing angle. Hence

a shape completion process is required for downstream 3D computer vision applications. Most deep learning based point cloud completion algorithms demonstrate an encoder-decoder structure and align multilayer perception (MLP) to extract point cloud features at the encoder. However

MLP networks tend to focus on the overall shape of the point cloud

and it is difficult to extract the local structural features of the object effectively. In addition

MLP does not generalize well to new objects

and it is difficult to complete the shape of objects with small training samples. So

it is a challenged issue that an efficient and accurate local structural feature extraction algorithm for point cloud completion.

Method

Multi-scale transformer based point cloud completion network (MSTCN) is illustrated. The entire network adopts an encoder decoder structure

which is composed of a multi-scale feature extractor

a pyramid point generator and a transformer based discriminator. The encoder of MSTCN extracts and aggregates the feature information of three types of incomplete point clouds with different resolutions through the transformer module

inputs them into a fully connected network based decoder

and then obtains the missing point clouds as outputs gradually. The feature embedding layer (FEL) and attention layer are melted into the encoder. The former improves the ability of the encoder to extract local structural features of point cloud via sampling and neighborhood grouping

the latter obtains the correlation information amongst points based on an improved self-attention module. As for decoder

pyramid point generator is mainly composed of a full connection layer and reshape operation. On the whole

a network adopts parallel operation on point clouds with three different resolutions

which are generated by the farthest down sampling approach. Similarly

point cloud completion is divided into three stages to achieve coarse-to-fine processing in the pyramid point generator. Based on generative adversarial network (GAN)

MSTCN adds a transformer based discriminator at the back end of the decoder

so that the discriminator and the generator can promote each other in joint training and optimize the completion performance of network. The loss function of MSTCN is mainly composed of two parts: generating loss and adversarial loss. Generating loss is the weighted sum of chamfer-distance(CD) between the generated point cloud and its ground-truth of three scales

and adversarial loss is the cross entropy sum of the generated point cloud and its ground-truth through the transformer-based discriminator.

Result

The experiment was compared with the latest methods on the ShapeNet and ModelNet10 datasets. On the ShapeNet dataset

this paper used all of the 16 categories for training

the average CD value of category calculated by MSTCN was reduced by 3.73% as compared to the second best model. Specifically

the CD values of cap

car

chair

earphone

lamp

pistol and table are better than those of point fractal network(PF-Net). On the ModelNet10 dataset

the average CD value of each category calculated by MSTCN was decreased by 12.75% as compared to the second best model. Specifically

the CD values of bathtub

chair

desk

dresser

monitor

night-stand

sofa

table and toilet are better than those of PF-Net. According to the visualization results based on six categories of aircraft

hat

chair

headset

motorcycle and table

MSTCN can accurately complete special structures and generalize to special samples in one category. The ablation studies were also taken on the ShapeNet dataset. As a result

the full MSTCN network performs better than three other networks which were MSTCNs with no feature embedding layer

attention layer and discriminator respectively. It illustrates that the feature embedding layer can make the model more capable to extract local structure information of point clouds

the attention layer can make the model selectively refer to the local structure of the input point cloud when completing. The discriminator can promote the completion effect of network. Meanwhile

three groups of point cloud completion sub models for different missing ratios were trained on ShapeNet dataset to verify the completion robustness of MSTCN model for input point clouds with different missing ratios. The category of chair and visualized the effect of completion are opted. As a result

the MSTCN model always maintains a good point cloud completion effect although the number of input point clouds decreases gradually

in which the completion results of 25% and 50% missing ratios have similar CD values. Even the missing ratio reaches 75%

CD value of the chair category remains at a low level of 2.074/2.456. The entire chair shape can be identified and completed only in accordance with the incomplete chair legs. This demonstration verifies that MSTCN has strong completion robustness while dealing with input point clouds with different missing ratios.

Conclusion

A multi-scale transformer based point cloud completion network (MSTCN) for point cloud completion has been illustrated. MSTCN can better extract local feature information of the residual point cloud

which makes the result of point cloud completion more accurate. The current point cloud completion algorithm has achieved good results in the completion of a single object. Future research can focus on the completion of large-scale scenes because the incomplete point cloud in scenes has a variety of incomplete types

such as view missing

spherical missing and occlusion missing. It is more challenging and practical to complete large scenes. On the other hand

the point clouds of real scanned scenes have no ground truth point cloud for reference.The unsupervised completion algorithms have its priority than supervised completion algorithms.

关键词

3维点云点云补全自编码器注意力机制生成对抗网络(GAN)

Keywords

three-dimensional point cloudpoint cloud completionautoencoderattention mechanismgenerative adversarial networks(GAN)

references

Achlioptas P, Diamanti O, Mitliagkas I and Guibas L. 2018. Learning representations and generative models for 3D point clouds//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 40-49

Chang A X, Funkhouser T, Guibas L, Hanrahan P, Huang Q X, Li Z M, Savarese S, Savva M, Song S R, Su H, Xiao J X, Li Y and Yu F. 2015. ShapeNet: an information-rich 3D model repository[EB/OL]. [2021-06-20].https://arxiv.org/pdf/1512.03012.pdfhttps://arxiv.org/pdf/1512.03012.pdf

Charles R Q, Li Y, Su H and Guibas L J. 2017b. PointNet ++: deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 5105-5114

Charles R Q, Su H, Kaichun M and Guibas L J. 2017a. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu,USA: IEEE: 77-85[DOI: 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16]

Dai A, Qi C R and Nieβner M. 2017. Shape completion using 3D-encoder-predictor CNNs and shape synthesis//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2463-2471[DOI: 10.1109/CVPR.2017.693http://dx.doi.org/10.1109/CVPR.2017.693]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2021-06-05].https://arxiv.org/pdf/2010.11929v1.pdfhttps://arxiv.org/pdf/2010.11929v1.pdf

Fan H Q, Su H and Guibas L. 2017. A point set generation network for 3D object reconstruction from a single image//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2463-2471[DOI: 10.1109/CVPR.2017.264http://dx.doi.org/10.1109/CVPR.2017.264]

Goodfellow I J, Jean P A, Mirza M, Xu B, David W F, Ozair S, Courville A and Bengio Y. 2014. Generative Adversarial Nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: 2672-2680

Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R and Hu S M. 2021a. PCT: point cloud transformer. Computational Visual Media, 7(2): 187-199[DOI: 10.1007/s41095-021-0229-5]

Guo Y L, WangH Y, Hu Q Y, Liu H, Liu L and Bennamoun M. 2021b. Deep learning for 3D point clouds: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12): 4338-4364[DOI: 10.1109/TPAMI.2020.3005434]

Han X G, Li Z, Huang H B, Kalogerakis E and Yu Y Z. 2017. High-resolution shape completion using deep neural networks for global structure and local geometry inference//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 85-93[DOI: 10.1109/ICCV.2017.19http://dx.doi.org/10.1109/ICCV.2017.19]

Huang Z T, Yu Y K, Xu J W, Ni F and Le X Y. 2020. PF-Net: point fractal network for 3D point cloud completion//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7659-7667[DOI: 10.1109/CVPR42600.2020.00768http://dx.doi.org/10.1109/CVPR42600.2020.00768]

Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks[EB/OL]. [2021-06-20].https://arxiv.org/pdf/1609.02907.pdfhttps://arxiv.org/pdf/1609.02907.pdf

Li Y Y, Dai A, Guibas L and Nieβner M. 2015. Database-assisted object retrieval for real-time 3D reconstruction. Computer Graphics Forum, 34(2): 435-446[DOI: 10.1111/cgf.12573]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]

Long X X, Cheng X J, Zhu H, Zhang P J, Liu H M, Li J, Zheng L T, Hu Q Y, Liu H, Cao X, Yang R G, Wu Y H, Zhang G F, Liu Y B, Xu K, Guo Y L and Chen B Q. 2021. Recent progress in 3D vision. Journal of Image and Graphics, 26(6): 1389-1428

龙霄潇, 程新景, 朱昊, 张朋举, 刘浩敏, 李俊, 郑林涛, 胡庆拥, 刘浩, 曹汛, 杨睿刚, 吴毅红, 章国锋, 刘烨斌, 徐凯, 郭裕兰, 陈宝权. 2021. 3维视觉前沿进展. 中国图象图形学报, 26(6): 1389-1428[DOI: 10.11834/jig.210043]

Mitra N J, Pauly M, Wand M and Ceylan D. 2013. Symmetry in 3D geometry: extraction and applications. Computer Graphics Forum, 32(6): 1-23[DOI: 10.1111/cgf.12010]

Nguyen D T, Hua B S, Tran M K, Pham Q H and Yeung S K. 2016. A field model for repairing 3D shapes//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 5676-5684[DOI: 10.1109/CVPR.2016.612http://dx.doi.org/10.1109/CVPR.2016.612]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin L. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 6000-6010

Wang Y, Sun Y B, Liu Z W, Sarma S E, Bronstein M M and Solomon J M. 2019. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics, 38(5): #146[DOI: 10.1145/3326362]

Wen X, Li T Y, Han Z Z and Liu Y S. 2020. Point cloud completion by skip-attention network with hierarchical folding//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1936-1945[DOI: 10.1109/CVPR42600.2020.00201http://dx.doi.org/10.1109/CVPR42600.2020.00201]

Wu Z R, Song S R, Khosla A, Khosla A, Yu F, Zhang L G, Tang X O and Xiao J X. 2015. 3D ShapeNets: a deep representation for volumetric shapes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1912-1920[DOI: 10.1109/CVPR.2015.7298801http://dx.doi.org/10.1109/CVPR.2015.7298801]

Xie H Z, Yao H X, Zhou S C, Mao J G, Zhang S P and Sun W X. 2020. GRNet: gridding residual network for dense point cloud completion//Proceedings of the 16th European Conference on Computer Vision. Glasgow, Scotland: Springer: 365-381[DOI: 10.1007/978-3-030-58545-7_21http://dx.doi.org/10.1007/978-3-030-58545-7_21]

Yang Y Q, Chen F, Shen Y R and Tian D. 2018. FoldingNet: point cloud auto-encoder via deep grid deformation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 206-215[DOI: 10.1109/CVPR.2018.00029http://dx.doi.org/10.1109/CVPR.2018.00029]

Yuan W T, Khot T, Held D, Mertz C and Hebert M. 2018. PCN: point completion network//Proceedings of 2018 International Conference on 3D Vision (3DV). Verona, Italy: IEEE: 728-737[DOI: 10.1109/3DV.2018.00088http://dx.doi.org/10.1109/3DV.2018.00088]

Zhao H S, Jiang L, Jia J Y, Torr P and Koltun V. 2021. Point transformer[EB/OL]. [2021-06-20].https://arxiv.org/pdf/2012.09164.pdfhttps://arxiv.org/pdf/2012.09164.pdf

Zhao W, Gao S M and Lin H W. 2007. A robust hole-filling algorithm for triangular mesh. The Visual Computer, 23(12): 987-997[DOI:10.1007/s00371-007-0167-y]

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络

注意力引导局部特征联合学习的人脸表情识别

结合注意力机制和编码器—解码器架构的化学结构识别方法

基于多视图自适应3D骨架网络的工业装箱动作识别

阿尔茨海默症诊断与病理区域检测的反事实推理模型