融合上下文感知注意力的Transformer目标跟踪方法
Context-aware attention fused Transformer tracking
- 2025年30卷第1期 页码:212-224
纸质出版日期: 2025-01-16
DOI: 10.11834/jig.240084
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-01-16 ,
移动端阅览
徐晗, 董仕豪, 张家伟, 郑钰辉. 融合上下文感知注意力的Transformer目标跟踪方法[J]. 中国图象图形学报, 2025,30(1):212-224.
XU HAN, DONG SHIHAO, ZHANG JIAWEI, ZHENG YUHUI. Context-aware attention fused Transformer tracking. [J]. Journal of image and graphics, 2025, 30(1): 212-224.
目的
2
近年来,Transformer跟踪器取得突破性的进展,其中自注意力机制发挥了重要作用。当前,自注意力机制中独立关联计算易导致权重不明显现象,限制了跟踪方法性能。为此,提出了一种融合上下文感知注意力的Transformer目标跟踪方法。
方法
2
首先,引入SwinTransformer(hierarchical vision Transformer using shifted windows)以提取视觉特征,利用跨尺度策略整合深层与浅层的特征信息,提高网络对复杂场景中目标表征能力。其次,构建了基于上下文感知注意力的编解码器,充分融合模板特征和搜索特征。上下文感知注意力使用嵌套注意计算,加入分配权重的目标掩码,可有效抑制由相关性计算不准确导致的噪声。最后,使用角点预测头估计目标边界框,通过相似度分数结果更新模板图像。
结果
2
在TrackingNet(large-scale object tracking dataset)、LaSOT(large-scale single object tracking)和GOT-10K(generic object tracking benchmark)等多个公开数据集上开展大量测试,本文方法均取得了优异性能。在GOT-10K上平均重叠率达到73.9%,在所有对比方法中排在第1位;在LaSOT上的AUC(area under curve)得分和精准度为0.687、0.749,与性能第2的ToMP(transforming model prediction for tracking)相比分别提高了1.1%和2.7%;在TrackingNet上的AUC得分和精准度为0.831、0.807,较第 2 名分别高出 0.8%和0.3%。
结论
2
所提方法利用上下文感知注意力聚焦特征序列中的目标信息,提高了向量交互的精确性,可有效应对快速运动、相似物干扰等问题,提升了跟踪性能。
Objective
2
Visual target tracking, as one of the key tasks in the field of computer vision, is mainly aimed at predicting the size and position of a target in a given video sequence. In recent years, target tracking has been widely used in the fields of autonomous driving, unmanned aerial vehicles (UAVs), military activities, and intelligent surveillance. Although numerous excellent methods have emerged in the field of target tracking, multifaceted challenges remain, including, but not limited to, shape variations, occlusion, motion blurring, and interference from proximate objects. Currently, target tracking methods are categorized into two main groups: correlation-based filtering and deep learning based. The former approximates the target tracking process as a search image signal domain computation. However, fully utilizing image representation information by using manual features is difficult, which greatly limits the performance of tracking methods. In recent years, deep learning has made significant progress in the field of target tracking by virtue of its powerful visual representation processing capabilities. In recent years, Transformer trackers have made breakthroughs, in which the self-attention mechanism plays an important role. Currently, the independent correlation calculation in the self-attention mechanism is prone to lead to the phenomenon of ambiguous weights, thus hampering the tracking method’s overall performance. For this reason, a Transformer target tracking method incorporating context-aware attention is proposed.
Method
2
First, hierarchical vision Transformer using shifted windows (SwinTransformer) is introduced to extract visual features, and a cross-scale strategy is utilized to integrate deep and shallow feature information to improve the network’s ability to characterize targets in complex scenes. The cross-scale fusion strategy is used to obtain key information at different scales, capture templates, and search for image diversity texture features, which helps the tracking network better understand the relationship between the target and the background. Second, a context-aware attention-based codec is constructed to fully fuse template features and search features. For the problem of inaccurate correlation computation that occurs in the attention mechanism, nested computation is used for query key pairs to focus on the target information in the input sequence and incorporates a target mask for assigning weights, which can effectively suppress the noise caused by inaccurate correlation computation, seek the consistency of the feature vectors, and prompt better interaction of feature information. The encoder uses features from the output of the trunk as input and uses global contextual information to reinforce the original features, thus enabling the model to learn discriminative features for object localization. The decoder takes as input the target query and the sequence of enhanced features from the encoder, using a two-branch cross attention design. One of the branches computes the target query and the encoder’s inputs to attend to features across the full range of locations and search regions on the template. Finally, a corner prediction header is used to estimate the target bounding box, and the template image is updated by the similarity score results. Specifically, the decoded features are fed into a fully convolutional network that outputs two probability maps for the upper-left and lower-right corners of the target bounding box. The predicted box coordinates are then obtained by calculating the expectation of the probability distribution for the two corners.
Result
2
Training pairs are randomly selected from the common objects in context (COCO), a large-scale object tracking dataset (TrackingNet), large-scale single-object tracking (LaSOT), and generic object tracking benchmark (GOT-10k) datasets to train the tracking model in this paper. The model’s minimum training data unit is a triad consisting of two templates and a search image. It was trained using 500 epochs, using 6 × 104 triples per epoch. The backbone network and the remainder have initial learning rates are 10
-5
<math id="M1"><mtext> </mtext></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413431&type=
0.84666669
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413453&type=
0.67733335
and 10
-4
. After 400 training sessions, the learning rate decreased by a factor of 10. Extensive testing was conducted on TrackingNet; LaSOT; GOT-10K; online object tracking as a benchmark (OTB100); a benchmark and simulator for UAV tracking (UAV123); and the publicly available need for speed (NfS) dataset, and the results were compared with those of several current state-of-the-art tracking methods, all of which achieved excellent performance. On GOT-10K, the average overlap rate reaches 73.9%,
<math id="M2"><mi>S</mi><msub><mrow><mi>R</mi></mrow><mrow><mn mathvariant="normal">0.5</mn></mrow></msub></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413455&type=
3.21733332
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413457&type=
6.34999990
and
<math id="M3"><mi>S</mi><msub><mrow><mi>R</mi></mrow><mrow><mn mathvariant="normal">0.75</mn></mrow></msub></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413461&type=
3.21733332
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413464&type=
7.19666624
reach 84.6% and 69.8%, respectively.
<math id="M4"><mi>S</mi><msub><mrow><mi>R</mi></mrow><mrow><mn mathvariant="normal">0.5</mn></mrow></msub></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413455&type=
3.21733332
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413457&type=
6.34999990
and
<math id="M5"><mi>S</mi><msub><mrow><mi>R</mi></mrow><mrow><mn mathvariant="normal">0.75</mn></mrow></msub></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413461&type=
3.21733332
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=72413464&type=
7.19666624
are the success rate of an overlapping coverage area greater than 0.5 and greater than 0.75. On LaSOT, the area under curve is 68.7%, and the precision rate and normalized precision rate are 78.7% and 74.3%, respectively. On TrackingNet, the success rate is 68.7%, the normalized precision rate is 87.7%, and the accuracy rate is 80.7%. The success rates on the NfS, OTB100, and UAV123 datasets are 68.1%, 69.6%, and 68.3%, respectively. The experimental results prove that the proposed method has good generalization ability. The effectiveness of the proposed method is verified by conducting ablation experiments on the GOT-10K, LaSOT and TrackingNet datasets to validate the effect of different modules on the tracking method. With the use of three feature extraction networks—ResNet-50, SwinTrack-Base, and the cross-scale fusion SwinTransformer—two scenarios were compared between the inclusion of the context-aware attention module and the exclusion of the module. The comparison of the final results shows that the inclusion of the context-aware attention module in SwinTransformer effectively improves the tracking performance.
Conclusion
2
The proposed method utilizes context-aware attention to focus the target information in the feature sequence, which improves the accuracy of vector interaction. The method effectively copes with the problems of fast motion and similar object interference, and improves the tracking performance. However, the proposed method uses Transformer in both the feature extraction and fusion stages, which leads to a large number of parameters and requires more training time, resulting in low computational efficiency. In the future, the two stages can be merged to integrate feature extraction and fusion.
计算机视觉目标跟踪上下文感知注意力Transformer特征融合
computer visionobject trackingcontext-aware attentionTransformerfeature fusion
Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 850-865 [DOI: 10.1007/978-3-319-48881-3_56http://dx.doi.org/10.1007/978-3-319-48881-3_56]
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13]
Chen X, Yan B, Zhu J W, Wang D, Yang X Y and Lu H C. 2021. Transformer tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 8122-8131 [DOI: 10.1109/CVPR46437.2021.00803http://dx.doi.org/10.1109/CVPR46437.2021.00803]
Cui Y T, Jiang C, Wang L M and Wu G S. 2022. MixFormer: end-to-end tracking with iterative mixed attention//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 13598-13608 [DOI: 10.1109/CVPR52688.2022.01324http://dx.doi.org/10.1109/CVPR52688.2022.01324]
Danelljan M, Bhat G, Khan F S and Felsberg M. 2019. ATOM: accurate tracking by overlap maximization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4655-4664 [DOI: 10.1109/CVPR.2019.00479http://dx.doi.org/10.1109/CVPR.2019.00479]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenbprn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth16×16 words: Transformers image recognition at scale [EB/OL]. [2024-01-23]. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Fan H, Lin L T, Yang F, Chu P, Deng G, Yu S J, Bai H X, Xu Y, Liao C Y and Ling H B. 2019. LaSOT: a high-quality benchmark for large-scale single object tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5369-5378 [DOI: 10.1109/CVPR.2019.00552http://dx.doi.org/10.1109/CVPR.2019.00552]
Fu Z H, Fu Z H, Liu Q J, Cai W R and Wang Y H. 2022. SparseTT: visual tracking with sparse Transformers//Proceedings of the 31st International Joint Conference on Artificial Intelligence. Vienna, Austria: IJCAI: 905-912 [DOI: 10.24963/ijcai.2022/127http://dx.doi.org/10.24963/ijcai.2022/127]
Galoogahi H K, Fagg A, Huang C, Ramanan D and Lucey S. 2017. Need for speed: a benchmark for higher frame rate object tracking//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1134-1143 [DOI: 10.1109/ICCV.2017.128http://dx.doi.org/10.1109/ICCV.2017.128]
Gao S Y, Zhou C L and Zhang J. 2023. Generalized relation modeling for Transformer tracking//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 18686-18695 [DOI: 10.1109/CVPR52729.2023.01792http://dx.doi.org/10.1109/CVPR52729.2023.01792]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu X T, Zhong B N, Liang Q H, Zhang S P, Li N, Li X X and Ji R R. 2024. Transformer tracking via frequency fusion. IEEE Transactions on Circuits and Systems for Video Technology, 34(2): 1020-1031 [DOI: 10.1109/TCSVT.2023.3289624http://dx.doi.org/10.1109/TCSVT.2023.3289624]
Huang L H, Zhao X and Huang K Q. 2021. GOT-10K: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5): 1562-1577 [DOI: 10.1109/TPAMI.2019.2957464http://dx.doi.org/10.1109/TPAMI.2019.2957464]
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84-90 [DOI: 10.1145/3065386http://dx.doi.org/10.1145/3065386]
Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. SiamRPN++: evolution of Siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286 [DOI: 10.1109/CVPR.2019.00441http://dx.doi.org/10.1109/CVPR.2019.00441]
Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with Siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980 [DOI: 10.1109/CVPR.2018.00935http://dx.doi.org/10.1109/CVPR.2018.00935]
Li X, Zha Y F, Zhang T Z, Cui Z, Zuo W M, Hou Z Q, Lu H C and Wang H Z. 2019. Survey of visual object tracking algorithms based on deep learning. Journal of Image and Graphics, 24(12): 2057-2080
李玺, 查宇飞, 张天柱, 崔振, 左旺孟, 侯志强, 卢湖川, 王菡子. 2019. 深度学习的目标跟踪算法综述. 中国图象图形学报, 24(12): 2057-2080 [DOI: 10.11834/jig.190372http://dx.doi.org/10.11834/jig.190372]
Lin L T, Fan H, Zhang Z P, Xu Y and Ling H B. 2022. SwinTrack: a simple and strong baseline for Transformer tracking [EB/OL]. [2024-01-23]. https://arxiv.org/pdf/2112.00995.pdfhttps://arxiv.org/pdf/2112.00995.pdf
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Loshchilov I and Hutter F. 2019. Decoupled weight decay regularization [EB/OL]. [2024-01-23]. https://arxiv.org/pdf/1711.05101.pdfhttps://arxiv.org/pdf/1711.05101.pdf
Mayer C, Danelljan M, Bhat G, Paul M, Paudel D P, Yu F and Van Gool L. 2022. Transforming model prediction for tracking//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 8721-8730 [DOI: 10.1109/CVPR52688.2022.00853http://dx.doi.org/10.1109/CVPR52688.2022.00853]
Mayer C, Danelljan M, Paudel D P and Van Gool L. 2021. Learning target candidate association to keep track of what not to track//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 13424-13434 [DOI: 10.1109/ICCV48922.2021.01319http://dx.doi.org/10.1109/ICCV48922.2021.01319]
Meng L and Li C X. 2019. Brief review of object tracking algorithms in recent years: correlated filtering and deep learning. Journal of Image and Graphics, 24(7): 1011-1016
孟琭, 李诚新. 2019. 近年目标跟踪算法短评——相关滤波与深度学习. 中国图象图形学报, 24(7): 1011-1016 [DOI: 10.11834/jig.190111http://dx.doi.org/10.11834/jig.190111]
Mueller M, Smith N and Ghanem B. 2016. A benchmark and simulator for UAV tracking//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 445-461 [DOI: 10.1007/978-3-319-46448-0_27http://dx.doi.org/10.1007/978-3-319-46448-0_27]
Müller M, Bibi A, Giancola S, Alsubaihi S and Ghanem B. 2018. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 310-327 [DOI: 10.1007/978-3-030-01246-5_19http://dx.doi.org/10.1007/978-3-030-01246-5_19]
Nam H and Han B. 2016. Learning multi-domain convolutional neural networks for visual tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4293-4302 [DOI: 10.1109/CVPR.2016.465http://dx.doi.org/10.1109/CVPR.2016.465]
Pan F, Zhao L Y and Wang C L. 2023. Smaller and more accurate swin-Transformer model prediction for tracking//Proceedings of the 35th IEEE International Conference on Tools with Artificial Intelligence. Atlanta, USA: IEEE: 955-961 [DOI: 10.1109/ICTAI59109.2023.00143http://dx.doi.org/10.1109/ICTAI59109.2023.00143]
Rezatofighi H, Tsoi N, Gwak J Y, Sadeghian A, Reid I and Savarese S. 2019. Generalized intersection over union: a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 658-666 [DOI: 10.1109/CVPR.2019.00075http://dx.doi.org/10.1109/CVPR.2019.00075.]
Song Z K, Yu J Q, Chen Y P P and Yang W. 2022. Transformer tracking with cyclic shifting window attention//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 8781-8790 [DOI: 10.1109/CVPR52688.2022.00859http://dx.doi.org/10.1109/CVPR52688.2022.00859]
Strudel R, Garcia R, Laptev I and Schmid C. 2021. Segmenter: Transformer for semantic segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 7242-7252 [DOI: 10.1109/ICCV48922.2021.00717http://dx.doi.org/10.1109/ICCV48922.2021.00717]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang J, Yin P, Wang Y Y and Yang W H. 2024. CMAT: integrating convolution mixer and self-attention for visual tracking. IEEE Transactions on Multimedia, 26: 326-338 [DOI: 10.1109/TMM.2023.3264851http://dx.doi.org/10.1109/TMM.2023.3264851]
Wang J Y, Hou Z Q, Yu W S, Liao X F and Chen C H. 2018. Fast TLD visual tracking algorithm with kernel correlation filter. Journal of Image and Graphics, 23(11): 1686-1696
王姣尧, 侯志强, 余旺盛, 廖秀峰, 陈传华. 2018. 采用核相关滤波的快速TLD视觉目标跟踪. 中国图象图形学报, 23(11): 1686-1696 [DOI: 10.11834/jig.170655http://dx.doi.org/10.11834/jig.170655]
Wang N, Zhou W G, Wang J and Li H Q. 2021. Transformer meets tracker: exploiting temporal context for robust visual tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 1571-1580 [DOI: 10.1109/CVPR46437.2021.00162http://dx.doi.org/10.1109/CVPR46437.2021.00162]
Wu H P, Xiao B, Codella N, Liu M C, Dai X Y, Yuan L and Zhang L. 2021. CvT: introducing convolutions to vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 22-31 [DOI: 10.1109/ICCV48922.2021.00009http://dx.doi.org/10.1109/ICCV48922.2021.00009]
Wu Y, Lim J and Yang M H. 2015. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1834-1848 [DOI: 10.1109/TPAMI.2014.2388226http://dx.doi.org/10.1109/TPAMI.2014.2388226]
Xie F, Wang C Y, Wang G T, Yang W K and Zeng W J. 2021. Learning tracking representations via dual-branch fully Transformer networks//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, Canada: IEEE: 2688-2697 [DOI: 10.1109/ICCVW54120.2021.00303http://dx.doi.org/10.1109/ICCVW54120.2021.00303]
Xiong W, Xiong C Y, Gao Z R, Chen W Q, Zheng R H and Tian J W. 2023. Image super-resolution with channel-attention-embedded Transformer. Journal of Image and Graphics, 28(12): 3744-3757
熊巍, 熊承义, 高志荣, 陈文旗, 郑瑞华, 田金文. 2023. 通道注意力嵌入的Transformer图像超分辨率重构. 中国图象图形学报, 28(12): 3744-3757 [DOI: 10.11834/jig.221033http://dx.doi.org/10.11834/jig.221033]
Xu Y D, Wang Z Y, Li Z X, Yuan Y and Yu G. 2020. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12549-12556 [DOI: 10.1609/aaai.v34i07.6944http://dx.doi.org/10.1609/aaai.v34i07.6944]
Yan B, Peng H W, Fu J L, Wang D and Lu H C. 2021. Learning spatio-temporal Transformer for visual tracking//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 10428-10437 [DOI: 10.1109/ICCV48922.2021.01028http://dx.doi.org/10.1109/ICCV48922.2021.01028]
Ye B T, Chang H, Ma B P, Shan S G and Chen X L. 2022. Joint feature learning and relation modeling for tracking: a one-stream framework//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 341-357 [DOI: 10.1007/978-3-031-20047-2_20http://dx.doi.org/10.1007/978-3-031-20047-2_20]
Zou Z J, Liu X X, Zhang Y P, Shu L and Hao J. 2023. Know who you are: learning target-aware Transformer for object tracking//Proceedings of 2023 IEEE International Conference on Multimedia and Expo. Brisbane, Australia: IEEE: 1427-1432 [DOI: 10.1109/ICME55011.2023.00247http://dx.doi.org/10.1109/ICME55011.2023.00247]
相关作者
相关机构