线性分解注意力的边缘端高效Transformer跟踪
Efficient Transformer tracking for the edge end with linearly decomposed attention
- 2025年30卷第2期 页码:485-502
纸质出版日期: 2025-02-16
DOI: 10.11834/jig.240192
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-02-16 ,
移动端阅览
邱淼波, 高晋, 林述波, 李椋, 王刚, 胡卫明, 王以政. 2025. 线性分解注意力的边缘端高效Transformer跟踪. 中国图象图形学报, 30(02):0485-0502
Qiu Miaobo, Gao Jin, Lin Shubo, Li Liang, Wang Gang, Hu Weiming, Wang Yizheng. 2025. Efficient Transformer tracking for the edge end with linearly decomposed attention. Journal of Image and Graphics, 30(02):0485-0502
目的
2
将面向服务器端设计的跟踪算法迁移部署到边缘端能显著降低功耗,具有较高的实用价值。当前基于Transformer的跟踪算法具有明显的性能优势,然而部署在边缘端时,却可能产生较高的延迟。为了解决这个问题,提出了一种面向边缘端的线性分解注意力(linearly decomposed attention,LinDA)结构,可有效降低Transformer的计算量和推理延迟。
方法
2
LinDA将多头注意力近似表示成数据依赖部分和数据无关部分的和:对于数据依赖部分,用简单的向量元素间相乘及求和表示,避免了复杂的转置和矩阵乘法;对于数据无关部分,直接利用统计得到的注意力矩阵,然后加上一个可学习偏置向量。这种分解既具有全局注意力,又保持了数据依赖的优点。为了弥补线性分解带来的精度损失,还设计了一种知识蒸馏方案,它在原始的损失函数上增加了两部分蒸馏损失:1)将真实包围框替换成教师模型预测的包围框作为监督目标,称为硬标签知识蒸馏;2)将教师模型预测得分的相对大小作为监督目标,称为关系匹配知识蒸馏。基于LinDA结构进一步实现了一种面向边缘端的目标跟踪算法LinDATrack,并将其部署在国产边缘计算主机HS240上。
结果
2
在多个公开数据集上进行了评测。实验结果表明,该算法在该计算主机上可达到61.6帧/s的跟踪速度,功耗约79.5 W,功耗仅占服务器端的6.2%,同时其在LaSOT和LaSOT_ext上的成功率(success rate, SUC)相对于服务器端基线算法SwinTrack-T最多仅下降约1.8%。
结论
2
LinDATrack具有良好的速度和精度平衡,在边缘端具有较大的优势。
Objective
2
The transfer and deployment of tracking algorithms designed for server ends to edge ends has high practical value. This transformation leads to a remarkable decrease in energy consumption, particularly in situations where resources are limited. In recent years, tracking algorithms that incorporate the Transformer architecture have achieved considerable progress because of their superior performance. Nonetheless, the adaptation of these algorithms for edge computing often encounters difficulties, primarily because of the increased latency. This latency is attributed to the complex nature of the Transformer’s attention mechanism, which requires extensive computational resources. This issue is addressed by introducing an innovative solution called the linearly decomposed attention (LinDA) module, which is designed expressly for edge computing. By drastically lowering the computational demands and reducing the inference time of the Transformer, the LinDA module facilitates more effective and efficient tracking at the edge end.
Method
2
LinDA innovatively approximates the multihead attention mechanism as two components: a data-dependent component and a data-independent component. For the data-dependent aspect, LinDA adopts a computationally economic approach. Rather than relying on traditional, resource-intensive methods of transposition and matrix multiplication, LinDA employs direct elementwise multiplication and the addition of vectors. This method markedly reduces computational complexity, rendering it exceptionally well suited for edge computing environments where resources are scarce. Regarding the data-independent facet, LinDA integrates a statistically derived attention matrix that encapsulates global contextual insights. This matrix is further refined with a learnable bias vector, enhancing the model’s adaptability and versatility. This decomposition strategy empowers LinDA to achieve good precision and considerable efficiency on devices constrained by limited resources. An advanced knowledge distillation strategy, which plays a crucial role in bolstering the student model’s capabilities, is introduced to mitigate potential compromises in accuracy because of the linear decomposition approach. This strategy encompasses two specialized distillation losses integrated into the baseline loss function, each meticulously designed to capture and convey critical insights from the teacher model to the student model. First, the hard label knowledge distillation technique involves replacing the ground-truth bounding box with the bounding box predicted by the teacher model, which serves as the supervision target for the student model. This method allows the student model to learn directly from the teacher’s discernment, thereby enhancing its predictive precision. Consequently, the student model captures the teacher’s knowledge of the problem, which enables it to yield more accurate predictions. Second, the relation matching knowledge distillation strategy harnesses the relationship between the teacher model’s predictions as the supervisory target. This innovative approach captures the complex relationships among different predictions, such as the relative significance of distinct objects or their spatial interrelations. When this relational knowledge is embedded into the student model during training, the model’s performance can be further improved, rendering it more robust and powerful. In summary, this elaborate knowledge distillation framework successfully imparts the teacher model’s insights into the student model, effectively overcoming the potential precision degradation associated with linear decomposition. This scheme ensures that the student model inherits the teacher’s expertise, thereby enabling it to deliver more precise predictions and attain superior performance. This study further implements an edge-end-oriented object tracking algorithm called LinDATrack, which is based on LinDA and distillation. The algorithm is deployed on the domestic edge computing host HS240.
Result
2
Comprehensive experiments of the tracker are conducted across various public datasets to test its performance and capability metrics. The experimental results validate the system’s outstanding tracking speed and good precision. With this computing host, LinDATrack achieves an impressive tracking speed of approximately 62 frames per second, facilitating efficient tracking in real-time settings. Furthermore, the system operates with a power consumption of approximately 79.5 watts, which represents only 6.2% of the energy used by the server-end configurations. This dramatic reduction in energy usage underscores the system’s exceptional energy efficiency, positioning it as an ideal choice for deployment in settings with limited resources. In an era where energy conservation and sustainability are increasingly important, this system presents a compelling alternative to more energy-intensive options, contributing significantly to a more sustainable computing landscape. In addition to its remarkable tracking speed and low power consumption, the system also exhibits consistently high tracking accuracy, distinguishing it within the realm of object tracking. Compared with the server-end baseline algorithm SwinTrack-T, the system’s tracking accuracy, as determined by the success rate metric, shows only a slight decrease of approximately 1.8%. This minor decrease in accuracy reflects the system’s capacity to balance performance with efficiency. The scheme can maintain precise tracking functionality while reducing resource usage, rendering it a versatile solution for a broad spectrum of tracking applications.
Conclusion
2
LinDATrack is distinguished by its exceptional balance of speed and accuracy, positioning it as a premier option for object tracking applications. Its performance is marked by efficiency, facilitating real-time tracking that users can rely on. Additionally, LinDATrack demonstrates considerable strengths when deployed at the edge, making it exceptionally well suited for environments with limited resources. This combination of speed, accuracy, and edge-oriented advantages firmly establishes LinDATrack as a leading solution for edge-end tracking tasks.
Bahdanau D , Cho K and Bengio Y . 2016 . Neural machine translation by jointly learning to align and translate [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/1409.0473.pdf https://arxiv.org/pdf/1409.0473.pdf
Bertinetto L , Valmadre J , Henriques J F , Vedaldi A and Torr P H S . 2016 . Fully-convolutional siamese networks for object tracking // Proceedings of 2016 European Conference on Computer Vision (ECCV) Workshops . Amsterdam, the Netherlands : Springer: 850 - 865 [ DOI: https://doi.org/10.1007/978-3-319-48881-3_56 https://doi.org/10.1007/978-3-319-48881-3_56 ]
Bhat G , Danelljan M , van Gool L and Timofte R . 2019 . Learning discriminative model prediction for tracking // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE: 6181 - 6190 [ DOI: 10.1109/ICCV.2019.00628 http://dx.doi.org/10.1109/ICCV.2019.00628 ]
Cai H , Li J Y , Hu M Y , Gan C and Han S . 2023 . EfficientViT: lightweight multi-scale attention for high-resolution dense prediction // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Paris, France : IEEE: 17256 - 17267 [ DOI: 10.1109/ICCV51070.2023.01587 http://dx.doi.org/10.1109/ICCV51070.2023.01587 ]
Chen B Y , Li P X , Bai L , Qiao L , Shen Q H , Li B , Gan W H , Wu W and Ouyang W L . 2022a . Backbone is all your need: a simplified architecture for visual object tracking // Proceedings of the 17th European Conference on Computer Vision (ECCV) . Tel Aviv, Israel : Springer: 375 - 392 [ DOI: 10.1007/978-3-031-20047-2_22 http://dx.doi.org/10.1007/978-3-031-20047-2_22 ]
Chen D F , Mei J P , Zhang H L , Wang C , Feng Y and Chen C . 2022b . Knowledge distillation with the reused teacher classifier // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 11923 - 11932 [ DOI: 10.1109/CVPR52688.2022.01163 http://dx.doi.org/10.1109/CVPR52688.2022.01163 ]
Chen X , Yan B , Zhu J W , Wang D , Yang X Y and Lu H C . 2021 . Transformer tracking // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 8122 - 8131 [ DOI: 10.1109/CVPR46437.2021.00803 http://dx.doi.org/10.1109/CVPR46437.2021.00803 ]
Chen Z L and Shi F H . 2022 . Double template fusion based siamese network for robust visual object tracking . Journal of Image and Graphics , 27 ( 4 ): 1191 - 1203
Child R , Gray S , Radford A and Sutskever I . 2019 . Generating long sequences with sparse transformers [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/1904.10509.pdf https://arxiv.org/pdf/1904.10509.pdf
Choromanski K , Likhosherstov V , Dohan D , Song X Y , Gane A , Sarlos T , Hawkins P , Davis J , Mohiuddin A , Kaiser L , Belanger D , Colwell L and Weller A . 2022 . Rethinking attention with performers [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/2009.14794.pdf https://arxiv.org/pdf/2009.14794.pdf
Cui Y T , Jiang C , Wang L M and Wu G S . 2022 . MixFormer: end-to-end tracking with iterative mixed attention // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 13598 - 13608 [ DOI: 10.1109/CVPR52688.2022.01324 http://dx.doi.org/10.1109/CVPR52688.2022.01324 ]
Danelljan M , Bhat G , Khan F S and Felsberg M . 2019 . ATOM: accurate tracking by overlap maximization // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 4655 - 4664 [ DOI: 10.1109/CVPR.2019.00479 http://dx.doi.org/10.1109/CVPR.2019.00479 ]
Dosovitskiy A , Beyer L , Kolesnikov A , Weissenborn D , Zhai X H , Unterthiner T , Dehghani M , Minderer M , Heigold G , Gelly S , Uszkoreit J and Houlsby N . 2021 . An image is worth 16 x 16 words: transformers for image recognition at scale [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf
Fan H , Bai H X , Lin L T , Yang F , Chu P , Deng G , Yu S J , Harshit , Mingzhen M Z , Liu J H , Xu Y , Liao C Y , Yuan L and Ling H B . 2021 . LaSOT: a high-quality large-scale single object tracking benchmark . International Journal of Computer Vision , 129 ( 2 ): 439 - 461 [ DOI: 10.1007/s11263-020-01387-y http://dx.doi.org/10.1007/s11263-020-01387-y ]
Fan H , Lin L T , Yang F , Chu P , Deng G , Yu S J , Bai H X , Xu Y , Liao C Y and Ling H B . 2019 . LaSOT: a high-quality benchmark for large-scale single object tracking // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 5369 - 5378 [ DOI: 10.1109/CVPR.2019.00552 http://dx.doi.org/10.1109/CVPR.2019.00552 ]
Graham B , El-Nouby A , Touvron H , Stock P , Joulin A , Jégou H and Douze M . 2021 . LeViT: a vision transformer in ConvNet’s clothing for faster inference // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 12239 - 12249 [ DOI: 10.1109/ICCV48922.2021.01204 http://dx.doi.org/10.1109/ICCV48922.2021.01204 ]
Hao Z W , Guo J Y , Jia D , Han K , Tang Y H , Zhang C , Hu H and Wang Y H . 2022 . Learning efficient vision transformers via fine-grained manifold distillation // Proceedings of the 36th International Conference on Neural Information Processing Systems . New Orleans, USA : Curran Associates Inc.: 9164 - 9175
Hendrycks D and Gimpel K . 2023 . Gaussian error linear units (GELUs) [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/1606.08415.pdf https://arxiv.org/pdf/1606.08415.pdf
Hinton G E and Nair V . 2010 . Rectified linear units improve restricted boltzmann machines // Proceedings of the 27th International Conference on International Conference on Machine Learning . Haifa, Israel : PMLR: 807 - 814
Hinton G E , Vinyals O and Dean J . 2015 . Distilling the knowledge in a neural network [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/1503.02531.pdf https://arxiv.org/pdf/1503.02531.pdf
Huang L H , Zhao X and Huang K Q . 2021 . Got-10k: a large high-diversity benchmark for generic object tracking in the wild . IEEE Transactions on Pattern Analysis and Machine Intelligence , 43 ( 5 ): 1562 - 1577 [ DOI: 10.1109/TPAMI.2019.2957464 http://dx.doi.org/10.1109/TPAMI.2019.2957464 ]
Huang T , You S , Wang F , Qian C and Xu C . 2022 . Knowledge distillation from a stronger teacher // Proceedings of the 36th International Conference on Neural Information Processing Systems . New Orleans, USA : Curran Associates Inc.: 33716 - 33727
Kasai J , Peng H , Zhang Y Z , Yogatama D , Ilharco G , Pappas N , Mao Y , Chen W Z and Smith N A . 2021 . Finetuning pretrained transformers into RNNs [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/2103.13076.pdf https://arxiv.org/pdf/2103.13076.pdf
Li B , Yan J J , Wu W , Zhu Z and Hu X L . 2018 . High performance visual tracking with siamese region proposal network // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 8971 - 8980 [ DOI: 10.1109/CVPR.2018.00935 http://dx.doi.org/10.1109/CVPR.2018.00935 ]
Li Y Y , Yuan G , Wen Y , Hu J , Evangelidis G , Tulyakov S , Wang Y Z and Ren J . 2022 . EfficientFormer: vision transformers at MobileNet speed // Proceedings of the 36th International Conference on Neural Information Processing Systems . New Orleans, USA : Curran Associates Inc.: 12934 - 12949
Lin L T . 2022 . Github reposity: swintrack [CP/OL] . [2024-04-07] . https://github.com/LitingLin/SwinTrack https://github.com/LitingLin/SwinTrack
Lin L T , Fan H , Xu Y and Ling H B . 2021 . SwinTrack: a simple and strong baseline for transformer tracking [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/2112.00995.pdf https://arxiv.org/pdf/2112.00995.pdf
Liu Z , Lin Y T , Cao Y , Hu H , Wei Y X , Zhang Z , Lin S and Guo B N . 2021 . Swin transformer: hierarchical vision transformer using shifted windows // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 9992 - 10002 [ DOI: 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ]
Liu Z , Mao H Z , Wu C Y , Feichtenhofer C , Darrell T and Xie S N . 2022 . A ConvNet for the 2020s//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA : IEEE: 11966 - 11976 [ DOI: 10.1109/CVPR52688.2022.01167 http://dx.doi.org/10.1109/CVPR52688.2022.01167 ]
Lu J C , Yao J H , Zhang J G , Zhu X T , Xu H , Gao W G , Xu C J , Xiang T and Zhang L . 2021 . SOFT: softmax-free transformer with linear complexity // Proceedings of the 35th International Conference on Neural Information Processing Systems . Virtual, Online : Curran Associates Inc.: 21297 - 21309
Müller M , Bibi A , Giancola S , Al-Subaihi S and Ghanem B . 2018 . TrackingNet: a large-scale dataset and benchmark for object tracking in the wild // Proceedings of the 15th European Conference on Computer Vision (ECCV) . Munich, Germany : Springer: 310 - 327 [ DOI: 10.1007/978-3-030-01246-5_19 http://dx.doi.org/10.1007/978-3-030-01246-5_19 ]
Rezatofighi H , Tsoi N , Gwak J Y , Sadeghian A , Reid I and Savarese S . 2019 . Generalized intersection over union: a metric and a loss for bounding box regression // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 658 - 666 [ DOI: 10.1109/CVPR.2019.00075 http://dx.doi.org/10.1109/CVPR.2019.00075 ]
Shang X R , Wen Y L , Xi X F and Hu F Y . 2021 . Target tracking system based on the Siamese guided anchor region proposal network . Journal of Image and Graphics , 26 ( 2 ): 415 - 424
Touvron H , Cord M , Douze M , Massa F , Sablayrolles A and Jégou H . 2021 . Training data-efficient image transformers and distillation through attention // Proceedings of the 38th International Conference on Machine Learning . Virtual Event : PMLR: 10347 - 10357
Vaswani A , Shazeer N , Parmar N , Uszkoreit J , Jones L , Gomez A N , Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates Inc.: 6000 - 6010
Wang F S , Yin S S , He B and Sun F M . 2023 . A Gaussian mask-based correlation filter tracking algorithm . Journal of Image and Graphics , 28 ( 10 ): 3092 - 3106
Wang N , Zhou W G , Wang J and Li H Q . 2021 . Transformer meets tracker: exploiting temporal context for robust visual tracking // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 1571 - 1580 [ DOI: 10.1109/CVPR46437.2021.00162 http://dx.doi.org/10.1109/CVPR46437.2021.00162 ]
Wu K , Zhang J N , Peng H W , Liu M C , Xiao B , Fu J L and Yuan L . 2022 . TinyViT: fast pretraining distillation for small vision transformers // Proceedings of the 17th European Conference on Computer Vision (ECCV) . Tel Aviv, Israel : Springer: 68 - 85 [ DOI: 10.1007/978-3-031-19803-8_5 http://dx.doi.org/10.1007/978-3-031-19803-8_5 ]
Xue W L , Zhang Z B , Pei S L , Zhang K H and Chen S Y . 2024 . Mixing tokens from target and search regions for visual object tracking . Journal of Computer Research and Development , 61 ( 2 ): 460 - 469
Yan B , Peng H W , Fu J L , Wang D and Lu H C . 2021a . Learning spatio-temporal transformer for visual tracking // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 10428 - 10437 [ DOI: 10.1109/ICCV48922.2021.01028 http://dx.doi.org/10.1109/ICCV48922.2021.01028 ]
Yan B , Peng H W , Wu K , Wang D , Fu J L and Lu H C . 2021b . LightTrack: finding lightweight neural networks for object tracking via one-shot architecture search // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 15175 - 15184 [ DOI: 10.1109/CVPR46437.2021.01493 http://dx.doi.org/10.1109/CVPR46437.2021.01493 ]
Ye B T , Chang H , Ma B P , Shan S G and Chen X L . 2022 . Joint feature learning and relation modeling for tracking: a one-stream framework // Proceedings of the 17th European Conference on Computer Vision (ECCV) . Tel Aviv, Israel : Springer: 341 - 357 [ DOI: 10.1007/978-3-031-20047-2_20 http://dx.doi.org/10.1007/978-3-031-20047-2_20 ]
Zhao G X , Lin J Y , Zhang Z Y , Ren X C , Su Q and Sun X . 2019 . Explicit sparse transformer: concentrated attention through explicit selection [EB/OL]. [ 2024-04-07 ]. https://arxiv.org/pdf/1912.11637.pdf https://arxiv.org/pdf/1912.11637.pdf
Zhang H Y , Wang Y , Dayoub F and Sünderhauf N . 2021 . VarifocalNet: an IoU-aware dense object detector // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEE E: 8510 - 8519 [ DOI: 10.1109/CVPR46437.2021.00841 http://dx.doi.org/10.1109/CVPR46437.2021.00841 ]
Zhu Z , Wang Q , Li B , Wu W , Yan J J and Hu W M . 2018 . Distractor-aware siamese networks for visual object tracking // Proceedings of the 15th European Conference on Computer Vision (ECCV) . Munich, Germany : Springer: 103 - 119 [ DOI: 10.1007/978-3-030-01240-3_7 http://dx.doi.org/10.1007/978-3-030-01240-3_7 ]
相关作者
相关机构