TMamba:面向高效目标跟踪的视觉状态空间模型
TMamba: A Visual State-Space Model for Efficient Object Tracking
- 2024年 页码:1-15
网络出版日期: 2024-11-27
DOI: 10.11834/jig.240587
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-11-27 ,
移动端阅览
康奔,陈鑫,赵洁等.TMamba:面向高效目标跟踪的视觉状态空间模型[J].中国图象图形学报,
Kang Ben,Chen Xin,Zhao Jie,et al.TMamba: A Visual State-Space Model for Efficient Object Tracking[J].Journal of Image and Graphics,
目的
2
Transformer的出现显著提升了目标跟踪模型的精度和鲁棒性,但其二次计算复杂度使得这些模型计算量较大,难以在实际场景中应用。此外,基于Transformer的模型还会导致较高的显存消耗,限制了跟踪模型的序列级训练。为了解决这些问题,本文提出了一种基于视觉状态空间的目标跟踪模型。
方法
2
本文基于视觉Mamba框架提出了TMamba算法。与基于Transformer的目标跟踪模型相比,TMamba在实现了优越性能的同时显著降低了计算量和显存占用,为跟踪模型的序列级训练提供了新的思路。TMamba的核心模块是特征融合模块,该模块将深层特征的语义信息与浅层特征的细节信息相结合,为预测头提供更精确的特征,从而提高预测的准确性。此外,本文还提出了双图像扫描策略来弥补视觉状态空间模型与追踪领域之间的差距。双图像扫描策略联合扫描模板和搜索区域图像,使视觉状态空间模型更适配跟踪模型。
结果
2
基于所提出的特征融合模块以及双图像扫描策略本文开发了一系列基于状态空间模型的目标跟踪模型。而且在7个数据集上对所提出的模型进行了全面评测,结果显示,TMamba在降低计算量和参数量的同时,在各数据集上取得了显著的性能。例如,TMamba-B在LaSOT数据集上取得了66%的成功率,超越了大多数基于Transformer的模型,同时仅有50.7M的参数量和14.2G的计算量。
结论
2
本文提出的TMamba算法探索了使用状态空间模型进行目标跟踪的可能性。TMamba在多个数据集上以更少的参数量和计算量实现了与基于Transformer的目标跟踪模型相当的性能。TMamba的低参数量、低计算量以及低显存占用的特点,有望进一步促进目标跟踪模型的实际应用,并推动跟踪模型序列级训练的发展。
Objective
2
The emergence of Transformer models has revolutionized the field of object tracking, significantly enhancing the accuracy and robustness of these models. Transformers, with their self-attention mechanisms, have been shown to capture long-range dependencies and complex relationships within data, making them a powerful tool for various computer vision tasks, including object tracking. However, a critical drawback of Transformer-based object tracking models is their computational complexity, which scales quadratically with the length of the input sequence. This characteristic imposes a substantial computational burden, especially in practical scenarios where efficiency is paramount. Real-world applications require models that not only perform well but also operate with minimal computational cost, fewer parameters, and fast response times. Unfortunately, the high computational demands and parameter counts of Transformer-based models render them less suitable for these applications. Moreover, Transformer-based object tracking models typically exhibit high memory consumption, which poses additional challenges in video-level object tracking tasks. High memory usage restricts the number of video frames that can be processed simultaneously, thereby limiting the ability to capture sufficient temporal information needed for effective tracking. This limitation hinders the development of video-level tracking models, as the inability to sample enough frames can lead to suboptimal performance and reduced tracking accuracy. To address these challenges, this paper introduces a novel object tracking model based on visual state space models.
Method
2
Building upon the visual Mamba framework, we propose the TMamba algorithm, which leverages the strengths of state space models for object tracking. The TMamba model offers a promising alternative to Transformer-based tracking models by achieving superior performance with significantly reduced computational load and memory usage. This reduction is crucial for enabling the deployment of object tracking models in resource-constrained environments, such as edge devices and real-time systems. The core component of TMamba is the Feature Fusion Module, which is designed to integrate information from different feature hierarchies within the network. Specifically, the Feature Fusion Module combines the rich semantic information from deep features with the detailed, high-resolution information from shallow features. By fusing these features, the module produces a multi-level representation that provides the prediction head with more accurate and comprehensive information, leading to improved prediction accuracy. A key innovation of TMamba is the introduction of a Dual Image Scanning Strategy, which addresses the unique challenges of adapting visual state space models to the tracking domain. In visual state space models, the approach to scanning images is crucial, as it directly impacts the model's ability to process and interpret visual data. Unlike classification and detection tasks, where a single image is input into the network, object tracking requires the simultaneous processing of multiple images, typically a template and a search region. How these images are scanned and fed into the network is a critical factor in the model's performance. Our proposed Dual Image Scanning Strategy involves jointly scanning the template and search region images, allowing the visual state space model to better accommodate the specific requirements of object tracking. This strategy enhances the model's ability to learn spatial and temporal dependencies across frames, leading to more accurate and reliable tracking.
Result
2
To evaluate the effectiveness of the proposed TMamba algorithm, we developed a series of object tracking models based on state space models and conducted extensive experiments on seven benchmark datasets. These datasets include LaSOT, TrackingNet, GOT-10k, and others, which are widely used in the object tracking community for performance evaluation. The results demonstrate that TMamba consistently achieves outstanding performance across all datasets, with significant reductions in computational cost and parameter count compared to Transformer-based models. For instance, TMamba-B, one of the configurations of our model, achieves a 66% AUC score on the LaSOT dataset, an 82.3% AUC score on TrackingNet, and a 72% AUC score on GOT-10k. These results not only surpass those of many Transformer-based models but also highlight the efficiency of TMamba in terms of computational resources. TMamba-B contains only 50.7 million parameters and requires just 14.2 GFLOPs for processing, making it one of the most efficient models in its class. This efficiency is achieved without compromising on accuracy, demonstrating the potential of state space models for high-performance object tracking. Further analysis of the experimental results reveals several key insights. First, the Feature Fusion Module plays a crucial role in enhancing the model's performance by effectively combining information from different feature levels. This fusion allows TMamba to leverage the strengths of both deep and shallow features, resulting in a more robust representation that is well-suited for tracking diverse objects under various conditions. Second, the Dual Image Scanning Strategy proves to be highly effective in bridging the gap between visual state space models and the tracking domain. By jointly scanning the template and search region images, this strategy enables TMamba to better capture spatial and temporal relationships, which are essential for accurate tracking.
Conclusion
2
This paper introduces the TMamba algorithm and investigates the feasibility of employing state-space models in the domain of object tracking. The results demonstrate that TMamba not only matches but in some cases surpasses the performance of Transformer-based object tracking models across multiple datasets. Importantly, TMamba achieves these results with a significantly reduced parameter count and lower computational complexity, making it a more practical choice for real-world applications. The characteristics of TMamba—namely, its low parameter count, minimal computational demands, and reduced memory usage—suggest that it has considerable potential to advance the practical application of object tracking models. By addressing the limitations of existing Transformer-based approaches, TMamba paves the way for the development of more efficient and scalable video-level object tracking solutions.
单目标视觉跟踪状态空间模型多尺度特征融合序列训练高效存储模型
single-object trackingstate-space modelmulti-scale feature fusionsequential trainingmemory-efficient model
Alexey D, Lucas B, Alexander K, Dirk W, Zhai X H, Thomas U, Mostafa D, Matthias M, Georg H, Sylvain G, Jakob U, and Neil H. 2020. An image is worth16x16 words: Transformers for image recognition at scale[EB/OL]. [2016-07-21]. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Ashish V, Noam S, Niki P, Jakob U, Llion J, Aidan N. Gomez, Łukasz K, Illia P. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30.
Albert G, Karan Goel and Christopher Re. 2022. Efficiently Modeling Long Sequences with Structured State Spaces. Advances in International Conference on Learning Representations.
Albert G and Dao Tri. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces[EB/OL]. [2023-12-01]. https://arxiv.org/pdf/2312.00752.pdfhttps://arxiv.org/pdf/2312.00752.pdf
Ba J L, Kiros J R, and Hinton G E. 2016. Layer normalization[EB/OL]. [2016-07-21]. https://arxiv.org/pdf/1607.06450.pdfhttps://arxiv.org/pdf/1607.06450.pdf
Chen X, Yan B, Zhu J W, Wang D, Yang X Y, and Lu H C. 2021. Transformer tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 8126-8135 [DOI: 10.1109/CVPR46437.2021.00803http://dx.doi.org/10.1109/CVPR46437.2021.00803]
Cui, Y, Cheng J, Wang L, and Wu G. 2022. MixFormer: End-to-end tracking with iterative mixed attention//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 13608-13618 [DOI: 10.48550/arXiv.2203.11082http://dx.doi.org/10.48550/arXiv.2203.11082]
Christoph M, Martin Danelljan, Danda Pani Paudel and Luc Van Gool. 2021. Learning Target Candidate Association to Keep Track of What Not to Track//Proceedings of IEEE International Conference on Computer Vision. IEEE: 13424-13434 [DOI: 10.1109/ICCV48922.2021.01319http://dx.doi.org/10.1109/ICCV48922.2021.01319]
Dai K N, Zhang Y H, Li J H, Lu H C and Yang X Y. 2020. High-Performance Long-Term Tracking With Meta-Updater//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 6297—6306 [DOI: 10.1109/CVPR42600.2020.00633http://dx.doi.org/10.1109/CVPR42600.2020.00633]
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S J, Bai H X, Xu Y, Liao C Y, and Ling H B. 2019. LaSOT: A high-quality benchmark for large-scale single object tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 5374-5383 [DOI: 10.1109/CVPR.2019.00552http://dx.doi.org/10.1109/CVPR.2019.00552]
Fan H, Bai Hexin, Lin Liting, Yang Fan, Chu Peng, Deng Ge, Yu Sijia, Huang Mingzhen, Liu Juehuan and Xu Yong. 2021. LaSOT: A High-Quality Large-Scale Single Object Tracking Benchmark. International Journal of Computer Vision,129: 439–461[DOI: 10.1007/s11263-020-01387-yhttp://dx.doi.org/10.1007/s11263-020-01387-y]
Goutam B, Martin D, Luc V G, Radu T. 2019. Learning discriminative model prediction for tracking//Proceedings of IEEE International Conference on Computer Vision. IEEE: 6182-6191 [DOI: 10.1109/ICCV.2019.00628http://dx.doi.org/10.1109/ICCV.2019.00628]
Gao S Y, Zhou Chunluan, Ma Chao, Wang Xinggang and Yuan Junsong. 2022. AiATrack: Attention in Attention for Transformer Visual Tracking//Proceedings of European Conference on Computer Vision. Springer: 146-164 [DOI: 10.1007/978-3-031-20047-2_9http://dx.doi.org/10.1007/978-3-031-20047-2_9]
Hamid R, Nathan T, JunYoung G, Amir S, Ian R, and Silvio S. 2019. Generalized intersection over union: A metric and a loss for bounding box regression//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 658-666 [DOI: 10.1109/CVPR.2019.00075http://dx.doi.org/10.1109/CVPR.2019.00075]
Huang L, Zhao X, and Huang K. 2019. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5): 1562-1577 [DOI: 10.1109/TPAMI.2019.2957464http://dx.doi.org/10.1109/TPAMI.2019.2957464]
He K M, Chen Xinlei, Xie Saining, Li Yanghao, Dollr Piotr and Girshick Ross. 2022. Masked autoencoders are scalable vision learners//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 16000-16009 [DOI: 10.1109/CVPR52688.2022.01553http://dx.doi.org/10.1109/CVPR52688.2022.01553]
Kiani G H, Fagg A, Huang C, Ramanan D and Lucey S. 2017. Need for Speed: A Benchmark for Higher Frame Rate Object Tracking//Proceedings of IEEE International Conference on Computer Vision. IEEE: 1134--1143 [DOI: 10.1109/ICCV.2017.128http://dx.doi.org/10.1109/ICCV.2017.128]
Li B, Wu W, Wang Q, Zhang F Y, Xing J L, and Yan J J. 2019. SiamRPN++: Evolution of siamese visual tracking with very deep networks//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 4282-4291 [DOI: 10.48550/arXiv.1812.11703http://dx.doi.org/10.48550/arXiv.1812.11703]
Li B, Yan J J, Wu W, Zhu Z, Hu X L. 2018. High performance visual tracking with siamese region proposal network//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 8971-8980 [DOI: 10.1109/CVPR.2018.00935http://dx.doi.org/10.1109/CVPR.2018.00935]
Luca B, Jack V, João F H, Andrea V, and Philip H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Springer: 850–865 [DOI: 10.1007/978-3-319-48881-3_56http://dx.doi.org/10.1007/978-3-319-48881-3_56]
Liu Y, Tian Yunjie, Zhao Yuzhong, Yu Hongtian, Xie Lingxi, Wang Yaowei, Ye Qixiang and Liu Yunfan. 2024. VMamba: Visual State Space Model[EB/OL]. [2024-01-18]. https://arxiv.org/pdf/2401.10166.pdfhttps://arxiv.org/pdf/2401.10166.pdf
Lin L T, Fan Heng, Zhang Zhipeng, Xu Yong and Ling Haibin. 2022. SwinTrack: A Simple and Strong Baseline for Transformer Tracking. . Advances in Neural Information Processing Systems, 35.
Liu Z, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Stephen Lin and Guo Baining. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows//Proceedings of IEEE International Conference on Computer Vision. IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Li K C, Li Xinhao, Wang Yi, He Yinan, Wang Yali, Wang Limin and Qiao Yu. 2024. Videomamba: State Space Model for Efficient Video Understanding[EB/OL]. [2024-03-11]. https://arxiv.org/pdf/2403.06977.pdfhttps://arxiv.org/pdf/2403.06977.pdf
Lin T, Goyal Priya, Girshick Ross, He Kaiming and Dollr Piotr. 2017. Focal Loss for Dense Object Detection//Proceedings of IEEE International Conference on Computer Vision. IEEE: 2999-3007 [DOI: 10.1109/ICCV.2017.324http://dx.doi.org/10.1109/ICCV.2017.324]
Lin T, Michael M, Serge J. B, James H, Pietro P, Deva R, Piotr D and Zitnick C. L. 2014. Microsoft COCO: Common Objects in Context//Proceedings of European Conference on Computer Vision. Springer: 740--755 [DOI: 10.1007/978-3-319-10602-1\_48]
Mueller M, Smith N and Ghanem B. 2016. A Benchmark and Simulator for UAV Tracking//Proceedings of European Conference on Computer Vision. Springer: 445--461 [DOI: 10.1007/978-3-319-46448-0\_27]
Martin D, Goutam B, Fahad S K, and Michael F. 2017. ECO: Efficient Convolution Operators for Tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 6931-6939 [DOI: 10.1109/CVPR.2017.733http://dx.doi.org/10.1109/CVPR.2017.733]
Martin D, Goutam B, Fahad S K, and Michael F. 2019. ATOM: Accurate tracking by overlap maximization//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 4660-4669 [DOI: 10.1109/CVPR.2019.00479http://dx.doi.org/10.1109/CVPR.2019.00479]
Martin D, Luc Van G, and Radu T. 2020. Probabilistic regression for visual tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 7183–7192 [DOI: 10.1109/CVPR42600.2020.00721http://dx.doi.org/10.1109/CVPR42600.2020.00721]
Matthias M, Adel B, Silvio G, Salman A, and Bernard G. 2018. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild//Proceedings of European Conference on Computer Vision. Springer: 300-317 [DOI: 10.1007/978-3-030-01246-5_19http://dx.doi.org/10.1007/978-3-030-01246-5_19]
Nam H and Han, B.2016. Learning Multi-domain Convolutional Neural Networks for Visual Tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 4293-4302 [DOI: 10.1109/CVPR.2016.465http://dx.doi.org/10.1109/CVPR.2016.465]
Paul V, Jonathon L, Philip H. S.
T and Bastian L. 2020. Siam R-CNN: Visual Tracking by Re-Detection//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 6577-6587 [DOI: 10.1109/CVPR42600.2020.00661http://dx.doi.org/10.1109/CVPR42600.2020.00661]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z H, Karpathy A, Khosla A and Bernstein M. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3): 211--252[DOI: 10.1007/S11263-015-0816-Yhttp://dx.doi.org/10.1007/S11263-015-0816-Y]
Ran T, Efstratios Gavves QUVA Lab and Amsterdam Arnold W. M. Smeulders. 2016. Siamese Instance Search for Tracking//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 1420-1429 [DOI: 10.1109/CVPR.2016.158http://dx.doi.org/10.1109/CVPR.2016.158]
Ruan J C and Xiang Suncheng. 2024. VM-UNet: Vision Mamba Unet for Medical Image Segmentation[EB/OL]. [2024-02-04]. https://arxiv.org/pdf/2402.02491.pdfhttps://arxiv.org/pdf/2402.02491.pdf
Ronneberger O, Fischer Philipp and Brox Thomas. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation//Proceedings of Medical Image Computing and Computer-Assisted Intervention. Springer: 234–241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Stefan E, Eiji Uchibe and Kenji Doya. 2018. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Networks
,107: 3-11 [DOI: 10.1016/J.NEUNET.2017.12.012http://dx.doi.org/10.1016/J.NEUNET.2017.12.012]
Wang Q, Zhang Li, Luca Bertinetto, Hu Weiming, Philip H.S. Torr. 2019. Fast Online Object Tracking and Segmentation: A Unifying Approach//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 1328-1338 [DOI: 10.1109/CVPR.2019.00142http://dx.doi.org/10.1109/CVPR.2019.00142]
Wang Z Y, Zheng Jian-Qing, Zhang Yichi, Cui Ge and Li Lei. 2024. Mamba-Unet: Unet-Like Pure Visual Mamba for Medical Image Segmentation[EB/OL]. [2024-02-07]. https://arxiv.org/pdf/2400.05079.pdfhttps://arxiv.org/pdf/2400.05079.pdf
Wang X, Shu X J, Zhang Z P, Jiang B, Wang Y W, Tian Y H and Wu F. 2021. Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 13763--13773 [DOI: 10.1109/CVPR46437.2021.01355http://dx.doi.org/10.1109/CVPR46437.2021.01355]
Xu Y D, Wang Z Y, Li Z X, Yuan Y, and Yu G. 2020. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines//Proceedings of the AAAI Conference on Artificial Intelligence. AAAI: 12549-12556 [DOI: 10.1609/aaai.v34i07.6944http://dx.doi.org/10.1609/aaai.v34i07.6944]
Yan B, Peng H W, Fu J L, Wang D, Lu H C. 2021. Learning spatio-temporal transformer for visual tracking//Proceedings of IEEE International Conference on Computer Vision. IEEE: 10448-10457 [DOI: 10.1109/ICCV48922.2021.01028http://dx.doi.org/10.1109/ICCV48922.2021.01028]
Yan B, Zhang X Y, Wang D, Lu H C, and Yang X Y. 2021. Alpha-refine: Boosting tracking performance by precise bounding box estimation//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE: 5289-5298 [DOI: 10.1109/CVPR46437.2021.00525http://dx.doi.org/10.1109/CVPR46437.2021.00525]
Ye B T, Chang Hong, Ma Bingpeng, Shan Shiguang and Chen Xilin. 2022. Joint feature learning and relation modeling for tracking: A one-stream framework//Proceedings of European Conference on Computer Vision. Springer: 341-357 [DOI: 10.1007/978-3-031-20047-2_20http://dx.doi.org/10.1007/978-3-031-20047-2_20]
Zhu L H, Liao Bencheng, Zhang Qian, Wang Xinlong, Liu Wenyu and Wang Xinggang. 2024. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model[EB/OL]. [2024-01-17]. https://arxiv.org/pdf/2401.09417.pdfhttps://arxiv.org/pdf/2401.09417.pdf
Zhang Z P, Peng H W, Fu J L, Li B, and Hu W M. 2020. Ocean: Object-aware anchor-free tracking//Proceedings of European Conference on Computer Vision. Glasgow: Springer: 711-787 [DOI: 10.1007/978-3-030-58589-1_46http://dx.doi.org/10.1007/978-3-030-58589-1_46]
Zhu J Z, Wang D, and Lu H C. 2019. Learning background-temporal-aware correlation filter for real-time visual tracking. Journal of Image and Graphics, 24(4):536-549.
朱建章, 王栋, 卢湖川. 2019. 背景与时间感知的相关滤波实时视觉跟踪. 中国图象图形学报, 24(4): 536-549 [DOI: 10.11834/jig.180320http://dx.doi.org/10.11834/jig.180320]
Zhang Z P, Liu Y H, Wang X, Li B and Hu W M. 2021. Learn to Match: Automatic Matching Network Design for Visual Tracking//Proceedings of IEEE International Conference on Computer Vision. IEEE: 13319-13328 [DOI: 10.1109/ICCV48922.2021.01309http://dx.doi.org/10.1109/ICCV48922.2021.01309]
相关作者
相关机构