面向弱纹理目标立体匹配的Transformer网络

贾迪; 蔡鹏; 吴思; 王骞; 宋慧伦

doi:10.11834/jig.230575

图像理解和计算机视觉 | 浏览量 : 0 下载量: 409 CSCD: 1

PDF
导出
分享
收藏
专辑

面向弱纹理目标立体匹配的Transformer网络
Transformer network for stereo matching of weak texture objects
2024年29卷第8期页码：2413-2425
收稿日期：2023-08-18，

修回日期：2023-11-07，

纸质出版日期：2024-08-16
DOI： 10.11834/jig.230575
稿件说明：

移动端阅览

贾迪，蔡鹏，吴思，王骞，宋慧伦. 2024. 面向弱纹理目标立体匹配的Transformer网络. 中国图象图形学报， 29(08):2413-2425 DOI： 10.11834/jig.230575.

Jia Di， Cai Peng， Wu Si， Wang Qian， Song Huilun. 2024. Transformer network for stereo matching of weak texture objects. Journal of Image and Graphics， 29(08):2413-2425 DOI： 10.11834/jig.230575.

摘要

目的

近年来，采用神经网络完成立体匹配任务已成为计算机视觉领域的研究热点，目前现有方法存在弱纹理目标缺乏全局表征的问题，为此本文提出一种基于Transformer架构的密集特征提取网络。

方法

首先，采用空间池化窗口策略使得Transformer层可以在维持线性计算复杂度的同时，捕获广泛的上下文表示，弥补局部弱纹理导致的特征匮乏问题。其次，通过卷积与转置卷积实现重叠式块嵌入，使得所有特征点都尽可能多地捕捉邻近特征，便于细粒度匹配。再者，通过将跳跃查询策略应用于编码器和解码器间的特征融合部分，以此实现高效信息传递。最后，针对立体像对存在的遮挡情况，对固定区域内的匹配概率进行截断求和，输出更为合理的遮挡置信度。

结果

在Scene Flow数据集上进行了消融实验，实验结果表明，本文网络获得了0.33的绝对像素距离，0.92%的异常像素占比和98%的遮挡预测交并比。为了验证模型在实际路况场景下的有效性，在KITTI-2015数据集上进行了补充对比实验，本文方法获得了1.78%的平均异常值百分比，上述指标均优于STTR（stereo Transformer）等主流方法。此外，在KITTI-2015、MPI-Sintel（max planck institute sintel）和Middlebury-2014数据集的测试中，本文模型具备较强的泛化性。

结论

本文提出了一个纯粹的基于Transformer架构的密集特征提取器，使用空间池化窗口策略减小注意力计算的空间规模，并利用跳跃查询策略对编码器和解码器的特征进行了有效融合，可以较好地提高Transformer架构下的特征提取性能。

Abstract

Objective

In recent years， the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision. Stereo matching is a classic and computationally intensive task in computer vision. It is commonly used in various advanced visual processing applications such as 3D reconstruction， autonomous driving， and augmented reality. Given a pair of distortion-corrected stereo images， the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity， also known as disparity. In recent years， many researchers have explored deep learning-based stereo matching methods， which achieving promising results. Convolutional neural networks are often used to construct feature extractors for stereo matching. Although convolution-based feature extractors have yielded significant improvements in performance， neural networks are still constrained by the fundamental operation unit of “convolution”. By definition， convolution is a linear operator with a limited receptive field. Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures. This limitation becomes particularly pronounced in stereo matching tasks. In stereo matching tasks， captured stereo image pairs inevitably contain large areas of weak texture. Substantial computational resources are required to obtain comprehensive global feature representations through repeated convolutional layer stacking. We build a dense feature extraction Transformer for the stereo matching tasks， which incorporates Transformer and convolution blocks， to address the abovementioned issue.

Method

In the context of stereo matching tasks， FET exhibits three key advantages. First， by addressing high-resolution stereo image pairs， the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation. This way addresses the issue of feature scarcity caused by local weak textures. Second， we utilize convolution and transposed convolution blocks for implementing subsampling and upsampling overlapping patch embeddings， which ensures that all points nearby features are captured as comprehensively as possible to facilitate fine-grained matching. Third， we experiment with employing a skip-query strategy for feature fusion between the encoder and decoder to efficiently transmit information. Finally， we incorporate the attention-based pixel matching strategy of stereo Transformer （STTR） to realize a purely Transformer-based architecture. This strategy truncates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values.

Result

In the experimental section， we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090. We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed. However， training a pure Transformer architecture in mixed precision proved to be unstable. The model experienced loss divergence errors after only a few iterations. We modified the order of computation for attention scores to suppress related overflows for addressing this issue. We also restructured the attention calculation method based on the additivity invariance of the softmax operation. Ablation experiments were conducted on the Scene Flow dataset. Results show that the proposed network achieves an absolute pixel distance of 0.33， an outlier pixel ratio of 0.92%， and a 98% overlap prediction intersection over union. Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios. In these experiments， the proposed method achieved an average outlier percentage of 1.78， which outperformed mainstream methods such as STTR. Moreover， in tests on the KITTI-2015， MPI-Sintel， and Middlebury-2014 datasets， the proposed model demonstrated strong generalization capabilities. Subsequently， considering the limited definition of weak texture levels in currently available public datasets， we employed a clustering approach to filter images from the Scene Flow test dataset. Each pixel in the images was treated as a sample， with RGB values serving as the feature dimensions. This clustering process resulted in quantifying the number of different pixel categories within each image， which provided a measure of the texture strength or weakness in the images. The images were then categorized into “difficult”， “moderate”， and “easy” cases based on the number of clusters. Through comparative analysis， our approach consistently outperformed existing methods across the three sample categories， with a particularly notable improvement observed in the “difficult” case category.

Conclusion

For the stereo matching task， we propose a feature extractor based on the Transformer architecture. First， we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor， which effectively combines the inductive bias of convolutions with the global modeling capabilities of the Transformer. In addition， the Transformer-based feature extractor can capture a broader range of contextual representations， which partially alleviates region ambiguity issues caused by local weak textures. Furthermore， we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer， which mitigates semantic discrepancies between them. We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings， which keeps the attention computation of the model within linear complexity. Experimental results demonstrate a significant improvement in weak texture region prediction， occluded region prediction， and domain generalization when compared with relevant methods.

关键词

Keywords

references

Badki A ， Troccoli A ， Kim K ， Kautz J ， Sen P and Gallo O . 2020 . Bi3D： stereo depth estimation via binary classifications // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： IEEE： 1597 - 1605 ［ DOI： 10.1109/CVPR42600.2020.00167 http://dx.doi.org/10.1109/CVPR42600.2020.00167 ］

Bendig K ， Schuster R and Stricker D . 2022 . Self-superflow： self-supervised scene flow prediction in stereo sequences // Proceedings of 2022 IEEE International Conference on Image Processing （ICIP） . Bordeaux， France ： IEEE： 481 - 485 ［ DOI： 10.1109/ICIP46576.2022.9897832 http://dx.doi.org/10.1109/ICIP46576.2022.9897832 ］

Butler D J ， Wulff J ， Stanley G B and Black M J . 2012 . A naturalistic open source movie for optical flow evaluation // Proceedings of the 12th European Conference on Computer Vision . Florence， Italy ： Springer： 611 - 625 ［ DOI： 10.1007/978-3-642-33783-3_44 http://dx.doi.org/10.1007/978-3-642-33783-3_44 ］

Cao H ， Wang Y Y ， Chen J ， Jiang D S ， Zhang X P ， Tian Q and Wang M N . 2022 . Swin-Unet： Unet-like pure Transformer for medical image segmentation // Proceedings of 2022 European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 205 - 218 ［ DOI： 10.1007/978-3-031-25066-8_9 http://dx.doi.org/10.1007/978-3-031-25066-8_9 ］

Chang J R and Chen Y S . 2018 . Pyramid stereo matching network // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 5410 - 5418 ［ DOI： 10.1109/CVPR.2018.00567 http://dx.doi.org/10.1109/CVPR.2018.00567 ］

Chen C F R ， Fan Q F and Panda R . 2021 . CrossViT： cross-attention multi-scale vision Transformer for image classification // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 347 - 356 ［ DOI： 10.1109/ICCV48922.2021.00041 http://dx.doi.org/10.1109/ICCV48922.2021.00041 ］

Chen L C ， Zhu Y K ， Papandreou G ， Schroff F and Adam H . 2018 . Encoder-decoder with atrous separable convolution for semantic image segmentation // Proceedings of the 15th European Conference on Computer Vision （ECCV） . Munich， Germany ： Springer： 833 - 851 ［ DOI： 10.1007/978-3-030-01234-2_49 http://dx.doi.org/10.1007/978-3-030-01234-2_49 ］

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X H ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J and Houlsby N . 2021 . An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［ 2023-08-18 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Emlek A and Peker M . 2023 . P3SNet： parallel pyramid pooling stereo network . IEEE Transactions on Intelligent Transportation Systems ， 24 （ 10 ）： 10433 - 10444 ［ DOI： 10.1109/TITS.2023.3276328 http://dx.doi.org/10.1109/TITS.2023.3276328 ］

Girshick R . 2015 . Fast R-CNN // Proceedings of 2015 IEEE International Conference on Computer Vision （ICCV） . Santiago， Chile ： IEEE： 1440 - 1448 ［ DOI： 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ］

Huang Z Y ， Norris T B and Wang P Q . 2021 . ES-Net： an efficient stereo matching network ［EB/OL］. ［ 2023-08-18 ］. https://arxiv.org/pdf/2103.03922.pdf https://arxiv.org/pdf/2103.03922.pdf

Hussain M I ， Rafique M A and Jeon M . 2022 . RVMDE： radar validated monocular depth estimation for robotics ［EB/OL］. ［ 2023-08-18 ］. https://arxiv.org/pdf/2109.05265.pdf https://arxiv.org/pdf/2109.05265.pdf

Laga H ， Jospin L V ， Boussaid F and Bennamoun M . 2022 . A survey on deep learning techniques for stereo-based depth estimation . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 44 （ 4 ）： 1738 - 1764 ［ DOI： 10.1109/TPAMI.2020.3032602 http://dx.doi.org/10.1109/TPAMI.2020.3032602 ］

Li J K ， Wang P S ， Xiong P F ， Cai T ， Yan Z W ， Yang L ， Liu J Y ， Fan H Q and Liu S C . 2022 . Practical stereo matching via cascaded recurrent network with adaptive correlation // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans， USA ： IEEE： 16242 - 16251 ［ DOI： 10.1109/CVPR52688.2022.01578 http://dx.doi.org/10.1109/CVPR52688.2022.01578 ］

Li Z S ， Liu X T ， Drenkow N ， Ding A ， Creighton F X ， Taylor R H and Unberath M . 2021 . Revisiting stereo depth estimation from a sequence-to-sequence perspective with Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 6177 - 6186 ［ DOI： 10.1109/ICCV48922.2021.00614 http://dx.doi.org/10.1109/ICCV48922.2021.00614 ］

Lipson L ， Teed Z and Deng J . 2021 . RAFT-stereo： multilevel recurrent field transforms for stereo matching // Proceedings of 2021 International Conference on 3D Vision . London， UK ： IEEE： 218 - 227 ［ DOI： 10.1109/3DV53792.2021.00032 http://dx.doi.org/10.1109/3DV53792.2021.00032 ］

Liu Z ， Lin Y T ， Cao Y ， Hu H ， Wei Y X ， Zhang Z ， Lin S and Guo B N . 2021 . Swin Transformer： hierarchical vision Transformer using shifted windows // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 9992 - 10002 ［ DOI： 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ］

Loshchilov I and Hutter F . 2019 . Decoupled weight decay regularization ［EB/OL］. ［ 2023-08-18 ］. https://arxiv.org/pdf/1711.05101.pdf https://arxiv.org/pdf/1711.05101.pdf

Mao J Y ， Song Y Q and Liu Z . 2021 . CT image classification of liver tumors based on multi-scale and deep feature extraction . Journal of Image and Graphics ， 26 （ 7 ）： 1704 - 1715

毛静怡，宋余庆，刘哲 . 2021 . 多尺度深度特征提取的肝脏肿瘤CT图像分类 . 中国图象图形学报， 26 （ 7 ）： 1704 - 1715 ［ DOI： 10.11834/jig.200041 http://dx.doi.org/10.11834/jig.200041 ］

Mayer N ， Ilg E ， Hausser P ， Fischer P ， Cremers D ， Dosovitskiy A and Brox T . 2016 . A large dataset to train convolutional networks for disparity， optical flow， and scene flow estimation // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Las Vegas， USA ： IEEE： 4040 - 4048 ［ DOI： 10.1109/CVPR.2016.438 http://dx.doi.org/10.1109/CVPR.2016.438 ］

Menze M and Geiger A . 2015 . Object scene flow for autonomous vehicles // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Boston， USA ： IEEE： 3061 - 3070 ［ DOI： 10.1109/CVPR.2015.7298925 http://dx.doi.org/10.1109/CVPR.2015.7298925 ］

Peng C ， Zhang X Y ， Yu G ， Luo G M and Sun J . 2017 . Large kernel matters——improve semantic segmentation by global convolutional network // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 1743 - 1751 ［ DOI： 10.1109/CVPR.2017.189 http://dx.doi.org/10.1109/CVPR.2017.189 ］

Ranftl R ， Bochkovskiy A and Koltun V . 2021 . Vision Transformers for dense prediction // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 12159 - 12168 ［ DOI： 10.1109/ICCV48922.2021.01196 http://dx.doi.org/10.1109/ICCV48922.2021.01196 ］

Ronneberger O ， Fischer P and Brox T . 2015 . U-net： convolutional networks for biomedical image segmentation // Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention . Munich， Germany ： Springer： 234 - 241 ［ DOI： 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ］

Sarlin P E ， DeTone D ， Malisiewicz T and Rabinovich A . 2020 . SuperGlue： learning feature matching with graph neural networks // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： IEEE： 4937 - 4946 ［ DOI： 10.1109/CVPR42600.2020.00499 http://dx.doi.org/10.1109/CVPR42600.2020.00499 ］

Scharstein D ， Hirschmüller H ， Kitajima Y ， Krathwohl G ， Nešić N ， Wang X and Westling P . 2014 . High-resolution stereo datasets with subpixel-accurate ground truth // Proceedings of the 36th German Conference on Pattern Recognition . Münster， Germany ： Springer： 31 - 42 ［ DOI： 10.1007/978-3-319-11752-2_3 http://dx.doi.org/10.1007/978-3-319-11752-2_3 ］

Sun J M ， Shen Z H ， Wang Y A ， Bao H J and Zhou X W . 2021 . LoFTR： detector-free local feature matching with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 8918 - 8927 ［ DOI： 10.1109/CVPR46437.2021.00881 http://dx.doi.org/10.1109/CVPR46437.2021.00881 ］

Tankovich V ， Häne C ， Zhang Y D ， Kowdle A ， Fanello S and Bouaziz S . 2021 . HITNet： hierarchical iterative tile refinement network for real-time stereo matching // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 14357 - 14367 ［ DOI： 10.1109/CVPR46437.2021.01413 http://dx.doi.org/10.1109/CVPR46437.2021.01413 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Wang W H ， Xie E Z ， Li X ， Fan D P ， Song K T ， Liang D ， Lu T ， Luo P and Shao L . 2021 . Pyramid vision Transformer： a versatile backbone for dense prediction without convolutions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal， Canada ： IEEE： 548 - 558 ［ DOI： 10.1109/ICCV48922.2021.00061 http://dx.doi.org/10.1109/ICCV48922.2021.00061 ］

Wang X L ， Girshick R ， Gupta A and He K M . 2018 . Non-local neural networks // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 7794 - 7803 ［ DOI： 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ］

Zhao H S ， Shi J P ， Qi X J ， Wang X G and Jia J Y . 2017 . Pyramid scene parsing network // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 6230 - 6239 ［ DOI： 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ］

Zheng S X ， Lu J C ， Zhao H S ， Zhu X T ， Luo Z K ， Wang Y B ， Fu Y W ， Feng J F ， Xiang T ， Torr P H S and Zhang L . 2021 . Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville， USA ： IEEE： 6877 - 6886 ［ DOI： 10.1109/CVPR46437.2021.00681 http://dx.doi.org/10.1109/CVPR46437.2021.00681 ］

文章被引用时，请邮件提醒。

提交

改进实时目标检测Transformer的持刀危险行为检测算法

针对高光谱遥感图像变化检测的混合注意力和双向门控网络

面向高光谱全色锐化的混合注意力双分支U型网络

线性分解注意力的边缘端高效Transformer跟踪

融合上下文感知注意力的Transformer目标跟踪方法