面向弱纹理目标立体匹配的Transformer网络
Transformer network for stereo matching of weak texture objects
- 2024年29卷第8期 页码:2413-2425
纸质出版日期: 2024-08-16
DOI: 10.11834/jig.230575
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-08-16 ,
移动端阅览
贾迪, 蔡鹏, 吴思, 王骞, 宋慧伦. 2024. 面向弱纹理目标立体匹配的Transformer网络. 中国图象图形学报, 29(08):2413-2425
Jia Di, Cai Peng, Wu Si, Wang Qian, Song Huilun. 2024. Transformer network for stereo matching of weak texture objects. Journal of Image and Graphics, 29(08):2413-2425
目的
2
近年来,采用神经网络完成立体匹配任务已成为计算机视觉领域的研究热点,目前现有方法存在弱纹理目标缺乏全局表征的问题,为此本文提出一种基于Transformer架构的密集特征提取网络。
方法
2
首先,采用空间池化窗口策略使得Transformer层可以在维持线性计算复杂度的同时,捕获广泛的上下文表示,弥补局部弱纹理导致的特征匮乏问题。其次,通过卷积与转置卷积实现重叠式块嵌入,使得所有特征点都尽可能多地捕捉邻近特征,便于细粒度匹配。再者,通过将跳跃查询策略应用于编码器和解码器间的特征融合部分,以此实现高效信息传递。最后,针对立体像对存在的遮挡情况,对固定区域内的匹配概率进行截断求和,输出更为合理的遮挡置信度。
结果
2
在Scene Flow数据集上进行了消融实验,实验结果表明,本文网络获得了0.33的绝对像素距离,0.92%的异常像素占比和98%的遮挡预测交并比。为了验证模型在实际路况场景下的有效性,在KITTI-2015数据集上进行了补充对比实验,本文方法获得了1.78%的平均异常值百分比,上述指标均优于STTR(stereo Transformer)等主流方法。此外,在KITTI-2015、MPI-Sintel(max planck institute sintel)和Middlebury-2014数据集的测试中,本文模型具备较强的泛化性。
结论
2
本文提出了一个纯粹的基于Transformer架构的密集特征提取器,使用空间池化窗口策略减小注意力计算的空间规模,并利用跳跃查询策略对编码器和解码器的特征进行了有效融合,可以较好地提高Transformer架构下的特征提取性能。
Objective
2
In recent years, the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision. Stereo matching is a classic and computationally intensive task in computer vision. It is commonly used in various advanced visual processing applications such as 3D reconstruction, autonomous driving, and augmented reality. Given a pair of distortion-corrected stereo images, the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity, also known as disparity. In recent years, many researchers have explored deep learning-based stereo matching methods, which achieving promising results. Convolutional neural networks are often used to construct feature extractors for stereo matching. Although convolution-based feature extractors have yielded significant improvements in performance, neural networks are still constrained by the fundamental operation unit of “convolution”. By definition, convolution is a linear operator with a limited receptive field. Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures. This limitation becomes particularly pronounced in stereo matching tasks. In stereo matching tasks, captured stereo image pairs inevitably contain large areas of weak texture. Substantial computational resources are required to obtain comprehensive global feature representations through repeated convolutional layer stacking. We build a dense feature extraction Transformer for the stereo matching tasks, which incorporates Transformer and convolution blocks, to address the abovementioned issue.
Method
2
In the context of stereo matching tasks, FET exhibits three key advantages. First, by addressing high-resolution stereo image pairs, the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation. This way addresses the issue of feature scarcity caused by local weak textures. Second, we utilize convolution and transposed convolution blocks for implementing subsampling and upsampling overlapping patch embeddings, which ensures that all points nearby features are captured as comprehensively as possible to facilitate fine-grained matching. Third, we experiment with employing a skip-query strategy for feature fusion between the encoder and decoder to efficiently transmit information. Finally, we incorporate the attention-based pixel matching strategy of stereo Transformer (STTR) to realize a purely Transformer-based architecture. This strategy truncates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values.
Result
2
In the experimental section, we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090. We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed. However, training a pure Transformer architecture in mixed precision proved to be unstable. The model experienced loss divergence errors after only a few iterations. We modified the order of computation for attention scores to suppress related overflows for addressing this issue. We also restructured the attention calculation method based on the additivity invariance of the softmax operation. Ablation experiments were conducted on the Scene Flow dataset. Results show that the proposed network achieves an absolute pixel distance of 0.33, an outlier pixel ratio of 0.92%, and a 98% overlap prediction intersection over union. Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios. In these experiments, the proposed method achieved an average outlier percentage of 1.78, which outperformed mainstream methods such as STTR. Moreover, in tests on the KITTI-2015, MPI-Sintel, and Middlebury-2014 datasets, the proposed model demonstrated strong generalization capabilities. Subsequently, considering the limited definition of weak texture levels in currently available public datasets, we employed a clustering approach to filter images from the Scene Flow test dataset. Each pixel in the images was treated as a sample, with RGB values serving as the feature dimensions. This clustering process resulted in quantifying the number of different pixel categories within each image, which provided a measure of the texture strength or weakness in the images. The images were then categorized into “difficult”, “moderate”, and “easy” cases based on the number of clusters. Through comparative analysis, our approach consistently outperformed existing methods across the three sample categories, with a particularly notable improvement observed in the “difficult” case category.
Conclusion
2
For the stereo matching task, we propose a feature extractor based on the Transformer architecture. First, we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor, which effectively combines the inductive bias of convolutions with the global modeling capabilities of the Transformer. In addition, the Transformer-based feature extractor can capture a broader range of contextual representations, which partially alleviates region ambiguity issues caused by local weak textures. Furthermore, we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer, which mitigates semantic discrepancies between them. We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings, which keeps the attention computation of the model within linear complexity. Experimental results demonstrate a significant improvement in weak texture region prediction, occluded region prediction, and domain generalization when compared with relevant methods.
立体匹配弱纹理目标Transformer空间池化窗口跳跃查询截断求和Scene FlowKITTI-2015
stereo matchinglow-texture targetTransformerspatial pooling windowsjump queriestruncated summationScene FlowKITTI-2015
Badki A, Troccoli A, Kim K, Kautz J, Sen P and Gallo O. 2020. Bi3D: stereo depth estimation via binary classifications//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 1597-1605 [DOI: 10.1109/CVPR42600.2020.00167http://dx.doi.org/10.1109/CVPR42600.2020.00167]
Bendig K, Schuster R and Stricker D. 2022. Self-superflow: self-supervised scene flow prediction in stereo sequences//Proceedings of 2022 IEEE International Conference on Image Processing (ICIP). Bordeaux, France: IEEE: 481-485 [DOI: 10.1109/ICIP46576.2022.9897832http://dx.doi.org/10.1109/ICIP46576.2022.9897832]
Butler D J, Wulff J, Stanley G B and Black M J. 2012. A naturalistic open source movie for optical flow evaluation//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 611-625 [DOI: 10.1007/978-3-642-33783-3_44http://dx.doi.org/10.1007/978-3-642-33783-3_44]
Cao H, Wang Y Y, Chen J, Jiang D S, Zhang X P, Tian Q and Wang M N. 2022. Swin-Unet: Unet-like pure Transformer for medical image segmentation//Proceedings of 2022 European Conference on Computer Vision. Tel Aviv, Israel: Springer: 205-218 [DOI: 10.1007/978-3-031-25066-8_9http://dx.doi.org/10.1007/978-3-031-25066-8_9]
Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5410-5418 [DOI: 10.1109/CVPR.2018.00567http://dx.doi.org/10.1109/CVPR.2018.00567]
Chen C F R, Fan Q F and Panda R. 2021. CrossViT: cross-attention multi-scale vision Transformer for image classification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 347-356 [DOI: 10.1109/ICCV48922.2021.00041http://dx.doi.org/10.1109/ICCV48922.2021.00041]
Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 833-851 [DOI: 10.1007/978-3-030-01234-2_49http://dx.doi.org/10.1007/978-3-030-01234-2_49]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale [EB/OL]. [2023-08-18]. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Emlek A and Peker M. 2023. P3SNet: parallel pyramid pooling stereo network. IEEE Transactions on Intelligent Transportation Systems, 24(10): 10433-10444 [DOI: 10.1109/TITS.2023.3276328http://dx.doi.org/10.1109/TITS.2023.3276328]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1440-1448 [DOI: 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169]
Huang Z Y, Norris T B and Wang P Q. 2021. ES-Net: an efficient stereo matching network [EB/OL]. [2023-08-18]. https://arxiv.org/pdf/2103.03922.pdfhttps://arxiv.org/pdf/2103.03922.pdf
Hussain M I, Rafique M A and Jeon M. 2022. RVMDE: radar validated monocular depth estimation for robotics [EB/OL]. [2023-08-18]. https://arxiv.org/pdf/2109.05265.pdfhttps://arxiv.org/pdf/2109.05265.pdf
Laga H, Jospin L V, Boussaid F and Bennamoun M. 2022. A survey on deep learning techniques for stereo-based depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4): 1738-1764 [DOI: 10.1109/TPAMI.2020.3032602http://dx.doi.org/10.1109/TPAMI.2020.3032602]
Li J K, Wang P S, Xiong P F, Cai T, Yan Z W, Yang L, Liu J Y, Fan H Q and Liu S C. 2022. Practical stereo matching via cascaded recurrent network with adaptive correlation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 16242-16251 [DOI: 10.1109/CVPR52688.2022.01578http://dx.doi.org/10.1109/CVPR52688.2022.01578]
Li Z S, Liu X T, Drenkow N, Ding A, Creighton F X, Taylor R H and Unberath M. 2021. Revisiting stereo depth estimation from a sequence-to-sequence perspective with Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 6177-6186 [DOI: 10.1109/ICCV48922.2021.00614http://dx.doi.org/10.1109/ICCV48922.2021.00614]
Lipson L, Teed Z and Deng J. 2021. RAFT-stereo: multilevel recurrent field transforms for stereo matching//Proceedings of 2021 International Conference on 3D Vision. London, UK: IEEE: 218-227 [DOI: 10.1109/3DV53792.2021.00032http://dx.doi.org/10.1109/3DV53792.2021.00032]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Loshchilov I and Hutter F. 2019. Decoupled weight decay regularization [EB/OL]. [2023-08-18]. https://arxiv.org/pdf/1711.05101.pdfhttps://arxiv.org/pdf/1711.05101.pdf
Mao J Y, Song Y Q and Liu Z. 2021. CT image classification of liver tumors based on multi-scale and deep feature extraction. Journal of Image and Graphics, 26(7): 1704-1715
毛静怡, 宋余庆, 刘哲. 2021. 多尺度深度特征提取的肝脏肿瘤CT图像分类. 中国图象图形学报, 26(7): 1704-1715 [DOI: 10.11834/jig.200041http://dx.doi.org/10.11834/jig.200041]
Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4040-4048 [DOI: 10.1109/CVPR.2016.438http://dx.doi.org/10.1109/CVPR.2016.438]
Menze M and Geiger A. 2015. Object scene flow for autonomous vehicles//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3061-3070 [DOI: 10.1109/CVPR.2015.7298925http://dx.doi.org/10.1109/CVPR.2015.7298925]
Peng C, Zhang X Y, Yu G, Luo G M and Sun J. 2017. Large kernel matters——improve semantic segmentation by global convolutional network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1743-1751 [DOI: 10.1109/CVPR.2017.189http://dx.doi.org/10.1109/CVPR.2017.189]
Ranftl R, Bochkovskiy A and Koltun V. 2021. Vision Transformers for dense prediction//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 12159-12168 [DOI: 10.1109/ICCV48922.2021.01196http://dx.doi.org/10.1109/ICCV48922.2021.01196]
Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Sarlin P E, DeTone D, Malisiewicz T and Rabinovich A. 2020. SuperGlue: learning feature matching with graph neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 4937-4946 [DOI: 10.1109/CVPR42600.2020.00499http://dx.doi.org/10.1109/CVPR42600.2020.00499]
Scharstein D, Hirschmüller H, Kitajima Y, Krathwohl G, Nešić N, Wang X and Westling P. 2014. High-resolution stereo datasets with subpixel-accurate ground truth//Proceedings of the 36th German Conference on Pattern Recognition. Münster, Germany: Springer: 31-42 [DOI: 10.1007/978-3-319-11752-2_3http://dx.doi.org/10.1007/978-3-319-11752-2_3]
Sun J M, Shen Z H, Wang Y A, Bao H J and Zhou X W. 2021. LoFTR: detector-free local feature matching with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 8918-8927 [DOI: 10.1109/CVPR46437.2021.00881http://dx.doi.org/10.1109/CVPR46437.2021.00881]
Tankovich V, Häne C, Zhang Y D, Kowdle A, Fanello S and Bouaziz S. 2021. HITNet: hierarchical iterative tile refinement network for real-time stereo matching//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 14357-14367 [DOI: 10.1109/CVPR46437.2021.01413http://dx.doi.org/10.1109/CVPR46437.2021.01413]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang W H, Xie E Z, Li X, Fan D P, Song K T, Liang D, Lu T, Luo P and Shao L. 2021. Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 548-558 [DOI: 10.1109/ICCV48922.2021.00061http://dx.doi.org/10.1109/ICCV48922.2021.00061]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]
Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6230-6239 [DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]
Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, Fu Y W, Feng J F, Xiang T, Torr P H S and Zhang L. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 6877-6886 [DOI: 10.1109/CVPR46437.2021.00681http://dx.doi.org/10.1109/CVPR46437.2021.00681]
相关作者
相关机构