融合多视图一致和互补信息的深度3D模型分类
Multi-view Consistent and Complementary Information Fusion Method for 3D Model Classification
- 2024年 页码:1-13
网络出版日期: 2024-09-10
DOI: 10.11834/jig.240060
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-09-10 ,
移动端阅览
吴晗,胡良臣,杨影等.融合多视图一致和互补信息的深度3D模型分类[J].中国图象图形学报,
Wu Han,Hu Liangchen,Yang Ying,et al.Multi-view Consistent and Complementary Information Fusion Method for 3D Model Classification[J].Journal of Image and Graphics,
目的
2
基于深度学习的方法在3D模型分类任务中取得了先进的性能,此类方法需要提取三维模型不同数据表示的特征,例如使用深度学习模型提取多视图特征并将其组合成单一而紧凑的形状描述符。然而,这些方法只考虑了多视图之间的一致信息,而忽视了视图与视图之间存在的差异信息。为了解决这一问题,本文提出了新的特征网络学习3D模型多视图数据表示的一致信息和互补信息,并将其有效融合,以充分利用多视图数据的特征,提高3D模型分类准确率。
方法
2
该方法通过在残差网络的残差结构中引入空洞卷积,扩大卷积操作的感受野。随后,对网络结构进行调整以进行多视图特征提取。然后,通过设计的视图分类网络划分一致信息和互补信息,充分利用每个视图。为了处理这两类不同的信息,引入了一种结合注意力机制的学习融合策略,将两类特征视图融合,从而得到形状级描述符,实现可靠的3D模型分类。
结果
2
该模型的有效性在ModelNet数据集的两个子集上得到验证。在基于ModelNet40数据集的所有对比方法中具有最好的性能表现。为了对比不同的特征提取网络,设置单分类任务实现,本文方法在分类准确度和平均损失方面表现最好。相较于基准方法—多视图卷积神经网络(Multi-view Convolutional Neural Network, MVCNN),在不同视图数下本文方法的性能最高提升了3.6%,整体分类准确度提高了5.43%。
结论
2
本文提出的一种多视图信息融合的深度3D模型分类网络,深度融合多视图的一致信息和互补信息,在3D模型分类任务中获得明显的效果。并且实验结果表明,相比于现有相关方法,本文方法展现出一定的优越性。
Objective
2
3D model classification holds significant promise across diverse applications, including autonomous driving, game design, and 3D printing. With the rapid advancement of deep learning, numerous deep neural networks have been investigated for 3D model classification. Among these approaches, view-based methods consistently outperform voxel mesh and 3D point cloud-based methods. The view-based method captures multiple 2D perspectives from various angles of 3D objects to represent their 3D information. This approach closely aligns with human visual processing, transforming 3D problems into manageable 2D tasks solvable through standard convolutional neural networks (CNNs). In contrast, voxel-based and point cloud-based methods primarily focus on the spatial characteristics of 3D models, necessitating the generation of substantial datasets. The view-based method, by obtaining multiple 2D views from different angles of 3D models, visually transforms 3D challenges into 2D tasks, mirroring human approaches and facilitating resolution through conventional CNNs. The utilization of CNNs typically involves employing established models such as the Visual Geometry Group Network (VGG), the Inception Network (GoogleNet), and the Residual Network (ResNet) to derive a view representation of 3D models. Methods like MVCNN and the Group-view Convolutional Neural Network (GVCNN) leverage pre-trained network weights to obtain view descriptors for multiple perspectives. However, these approaches often neglect the complementary information between views, an aspect crucial for shaping the final descriptor. As shape descriptors serve as the ultimate recognition task, acquiring 3D model shape descriptors through view descriptors remains a fundamental challenge for achieving optimal 3D model classification. Recent studies, including MVCNN and the Dynamic Routing Convolutional Neural Network (DRCNN), employ a view pooling scheme to generate discriminative descriptors from feature representations of multiple views, marking significant milestones in 3D model classification with notable performance improvements. It is noteworthy that existing methods inadequately exploit the view characteristics among multiple perspectives, severely limiting the efficacy of shape descriptors. On one hand, the inherent differences in the two-dimensional views projected from three-dimensional objects constitute complementary information, enhancing the generation of the final shape descriptor. On the other hand, each 2D view can to some extent represent its corresponding 3D object, signifying consistent features between views. Integrating these consistent features enhances the accuracy of recognition tasks. Consequently, learning complementary information between views and integrating it with consistent information emerges as a critical aspect for advancing 3D model classification.
Method
2
Addressing this challenge, our paper introduces a network model that amalgamates complementary and consistent information gleaned from multiple views, thereby augmenting the comprehensiveness of information. Specifically, the model aims to fuse association information between views for 3D object classification. Initially, an enhanced residual network is employed to extract feature representations from multiple views, yielding view descriptors. Subsequently, a pre-classification network, coupled with an attention mechanism and weight learning strategy, is utilized to fuse these view descriptors and generate shape descriptors. To enhance the residual structure of the ResNet model, we introduce multi-scale dilated convolution after the ordinary convolution within the residual module during one-way network propagation in a single view. This augmentation extends the receptive field of the convolution operation, facilitating the extraction of complementary information. Additionally, a pre-classification module is proposed to gauge the recognition degree of each view based on shape characteristics. Using this information, views are categorized into complementary and consistent views. A subset of both types of views is fused into feature views, each possessing two characteristics. These feature views are input into an attention network to reinforce consistency and complementarity. Subsequently, a learnable weight fusion module is applied to the two feature views, weighting and fusing them to generate shape descriptors. Finally, we refine the overall network structure and strategically position the pre-classification layer and attention layer to ensure optimal outcomes for the proposed methodology.
Result
2
This study conducted a series of experiments to validate the effectiveness of the proposed model on the ModelNet10 and ModelNet40 datasets. Initially, an ablation experiment was performed, focusing on the comparison of module insertion positions, feature extraction networks, and the number of views. The experimental results indicate that tightly coupling the pre-classification module with the attention module and inserting them into the second or third layer of the residual network yields superior final classification accuracy compared to insertion between other layers. The method introduced in this study demonstrates higher average single-class classification accuracy than models like ResNet50 under an equivalent number of training iterations. Additionally, it exhibits lower average losses and more robust convergence of losses in terms of loss reduction. We further evaluated the performance of the model across varying numbers of views. Irrespective of the number of views, our method consistently outperforms or matches MVCNN and DRCNN. With an increase in the number of views from 3 to 6 to 12, both our model and the DRCNN model exhibit a continuous improvement in accuracy. Finally, when compared to 15 classic multi-view 3D object classification algorithms, our model achieved an average instance accuracy of 97.27% in ModelNet10 using the designed feature extraction network with the same 12 views. However, it falls slightly behind DRCNN, possibly due to limited data in the ModelNet10 dataset leading to overfitting and reduced classification accuracy. In ModelNet40, our model achieved an average instance accuracy of 96.32%. The comparison with ResNet50 and VGG-M also highlights certain advantages in classification accuracy for our model, affirming its ability to more effectively extract information between views on the backbone network than other methods.
Conclusion
2
In this study, we present a robust deep 3D object classification method that leverages multi-view information fusion to integrate both consistency and complementarity information among views through network structure adjustments. The experimental findings substantiate the favorable performance of the proposed model.
多视图分类3D模型分类一致性与互补性改进残差网络视图融合
multi-view learning3D model classificationconsistency and complementarityimproving residual networkview fusion
Wu Z R,Song S R,Khosla A,Yu F,Zhang L G,Tang X O and Xiao J X.2015. 3D ShapeNets: A deep representation for volumetric shapes. IEEE Conference on Computer Vision and Pattern Recognition:1912-1920[DOI:10.1109/CVPR.2015.7298801http://dx.doi.org/10.1109/CVPR.2015.7298801]
Malik J, Shimada S, Elhayek A, Ali S A, Theobalt C, Golyanik V and Stricker D.2022. Handvoxnet++: 3d hand shape and pose estimation using voxel-based neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence.44(12):8962-8974[DOI: 10.1109/TPAMI.2021.3122874http://dx.doi.org/10.1109/TPAMI.2021.3122874]
Brock A,Lim T, Ritchie J M and Weston N.2016. Generative and discriminative voxel modeling with convolutional neural networks,[DB/OL].[2016-08-16].http://arxiv.org/abs/1608.04236.pdfhttp://arxiv.org/abs/1608.04236.pdf
Qi C R, Yi L, Su H and Guibas L J.2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. IEEE Conference on Computer Vision and Pattern Recognition:77-85[DOI: 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16]
Gao X Y, Li K P,Zhang C X and Yu B.2021. 3D Model Classification Based on Bayesian Classifier with AdaBoost.Discrete Dynamics in Nature and Society.2021: 2154762.[doi:10.1155/2021/2154762http://dx.doi.org/10.1155/2021/2154762]
Gao Y B, Liu X B, Li J, Fang Z J, Jiang X Y and Kazi M S H. 2023. LFT-Net: Local Feature Transformer Network for Point Clouds Analysis [J]. IEEE Transactions on Intelligent Transportation Systems, 24(2): 2158-2168.[doi:10.1109/TITS.2022.3140355http://dx.doi.org/10.1109/TITS.2022.3140355]
Su H, Maji S, Kalogerakis E and Miller E G.L.2015. Multi-view convolutional neural networks for 3D shape recognition. IEEE International Conference on Computer Vision:945-953[DOI: 10.1109/ICCV.2015.114http://dx.doi.org/10.1109/ICCV.2015.114]
Feng Y F,Zhang Z Z,Zhao X B,Ji R R and Gao Y.2018.Gvcnn: Group-View convolutional neural networks for 3D shape recognition. IEEE Conference on Computer Vision and Pattern Recognition:264-272[DOI: 10.1109/CVPR.2018.00035http://dx.doi.org/10.1109/CVPR.2018.00035]
Kanezaki A, Matsushita Y and Nishida Y.2019. RotationNet for joint object categorization and unsupervised pose estimation from multi-view images.IEEE Transactions on Pattern Analysis and Machine Intelligence.43(1):269-283[DOI: 10.1109/TPAMI.2019.2922640http://dx.doi.org/10.1109/TPAMI.2019.2922640]
Yu T,Meng J J and Yuan J S.2018. Multi-view harmonized bilinear network for 3D object recognition.IEEE Conference on Computer Vision and Pattern Recognition:186-194[DOI: 10.1109/CVPR.2018.00027http://dx.doi.org/10.1109/CVPR.2018.00027]
Sun K,Zhang J S,Liu J M,Yu R X and Song Z J.2020.DRCNN:dynamic routing convolutional neural network for multi-view 3D object recognition.IEEE Transactions Image Process.30:868-877[DOI: 10.1109/TIP.2020.3039378http://dx.doi.org/10.1109/TIP.2020.3039378]
Xu Y,Zheng C D,Xu R T,Quan Y H,Ling H B.2021. Multi-View 3D Shape Recognition via Correspondence-Aware Deep Learning. IEEE Transactions on Image Processing.30:5299-5312[DOI: 10.1109/TIP.2021.3082310http://dx.doi.org/10.1109/TIP.2021.3082310]
Liu Z H,Zhang Y H,Gao J,Wang S R.2022. VFMVAC: View-filtering-based multi-view aggregating convolution for 3D shape recognition and retrieval.Pattern Recognition.129: 108774[DOI: 10.1016/J.PATCOG.2022.108774http://dx.doi.org/10.1016/J.PATCOG.2022.108774]
Song D,Fu X W,Nie W Z,Li W H,Liu A A.2023. MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition. Computer Vision and Pattern Recognition(CVPR): 2311.18402[DOI: 10.48550/ARXIV.2311.18402http://dx.doi.org/10.48550/ARXIV.2311.18402]
Simonyan K and Zisserman A.2015. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations,[DB/OL].[2015-05-10].http://arxiv.org/abs/1409.1556.pdfhttp://arxiv.org/abs/1409.1556.pdf
Su J-C,Gadelha M,Wang R and Maji S.2018.A deeper look at 3D shape classifiers. computer vision (ECCV).11131(:645-661)[DOI: 10.1007/978-3-030-11015-4\49]
Szegedy C,Liu W,Jia Y Q,Sermanet P,Reed S E, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. Going deeper with convolutions.IEEE Conference on Computer Vision and Pattern Recognition:1-9[DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
He K M,Zhang X Y,Ren S Q and Sun J.2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition:770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Krizhevsky A, Sutskever I and Hinton G E.2012. ImageNet classification with deep convolutional neural networks [J]. Commun. ACM 60(6): 84-90[DOI: 10.1145/3065386http://dx.doi.org/10.1145/3065386]
Han Z Z,Shang M Y,Liu Z B,Vong C-M,Liu Y-S,Zwicker M, Han J W and Chen C P.2018. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Transactions Image Process.28(2:658-672)[DOI: 10.1109/TIP.2018.2868426http://dx.doi.org/10.1109/TIP.2018.2868426]
Wei X,Yu R X and Sun J.2020.View-gcn: View-based graph convolutional network for 3d shape analysis. IEEE/CVF Conference on Computer Vision and Pattern Recognition:1847-1856[DOI: 10.1109/CVPR42600.2020.00192http://dx.doi.org/10.1109/CVPR42600.2020.00192]
Wei X,Yu R X and Sun J.2023. Learning View-Based Graph Convolutional Network for Multi-View 3D Shape Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence:7525-7541[DOI: 10.1109/TPAMI.2022.3221785http://dx.doi.org/10.1109/TPAMI.2022.3221785]
Chen S,Yu T,Li P.2021. MVT: Multi-view Vision Transformer for 3D Object Recognition. Computer Vision and Pattern Recognition.arXiv. 2110.13083[DOI: arxiv-2110.13083http://dx.doi.org/arxiv-2110.13083]
Li J,Liu Z,Li L,Lin J Q,Yao J,Tu J M.2023. Multi-view convolutional vision transformer for 3D object recognition. Journal of Visual Communication and Image Representation.95:103906[DOI: 10.1016/J.JVCIR.2023.103906http://dx.doi.org/10.1016/J.JVCIR.2023.103906]
Jiang J W, Bao D, Chen Z Q, Zhao X B, Gao Y.2019. MLVCNN:Multi-Loop-View convolutional neural network for 3D shape retrieval. The Ninth {AAAI} Symposium on Educational Advances in Artificial Intelligence:8513-8520[DOI: 10.1609/AAAI.V33I01.33018513http://dx.doi.org/10.1609/AAAI.V33I01.33018513]
Deng J, Dong W, Socher R, Li L J, Li K and Li F-F. 2009. ImageNet: A large-scale hierarchical image database[C]. 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition:248-255 [ DOI: 10.1109/CVPR.2009.5206848]
Nong L P, Peng J, Zhang W H, Lin J M, Qiu H B and Wang J Y.2023. Adaptive Multi-Hypergraph Convolutional Networks for 3D Object Classification[J]. IEEE Transactions on Multimedia. 25: 4842-4855 (2023)[doi:10.1109/TMM.2022.3183388]
Gao X Y, Yang B Y and Zhang C X.2023. Combine EfficientNet and CNN for 3D model classification [J]. Mathematical Biosciences and Engineering, 2023, 20(5): 9062-9079.[doi:10.3934/mbe.2023398http://dx.doi.org/10.3934/mbe.2023398]
Zhou Y, Dang Z L, Zhang H D, Xu X M, Qin J, Li W J, Zeng F Z and Liu X Y.2023. EFSCNN: Encoded Feature Sphere Convolution Neural Network for fast non-rigid 3D models classification and retrieval [J]. Computer Vision and Image Understanding, 233: 103724.[doi:https://doi.org/10.1016/j.cviu.2023.103724http://dx.doi.org/https://doi.org/10.1016/j.cviu.2023.103724]
Wang W J, Wang X L, Chen G and Zhou H R.2022. Multi-view SoftPool attention convolutional networks for 3D model classification [J]. Frontiers Neurorobotics,16.[doi:10.3389/FNBOT.2022.1029968http://dx.doi.org/10.3389/FNBOT.2022.1029968]
Hou J H, Luo C Q, Qin F W, Shao Y L and Chen X X.2023.FuS-GCN: Efficient B-rep based graph convolutional networks for 3D-CAD model classification and retrieval[J].Advanced Engineering Informatics,56: 102008.[doi:https://doi.org/10.1016/j.aei.2023.102008http://dx.doi.org/https://doi.org/10.1016/j.aei.2023.102008]
Zhou H Y, Liu A A, Zhang C Y, Zhu P, Zhang Q Y and Kankanhalli M.2023. Multi-Modal Meta-Transfer Fusion Network for Few-Shot 3D Model Classification[J].International Journal of Computer Vision.[doi:10.1007/s11263-023-01905-8http://dx.doi.org/10.1007/s11263-023-01905-8]
Wu B, Liu Y A and Zhao J. 2024. Classification network for 3D point cloud based on spatial structure convolution and attention mechanism. Journal of Image and Graphics,29(02):0520-0532
武斌,刘溢安,赵洁.2024.结合空间结构卷积和注意力机制的三维点云分类网络.中国图象图形学报,29(02):0520-0532[DOI:10. 11834/jig. 230137http://dx.doi.org/10.11834/jig.230137]
相关文章
相关作者
相关机构