3D multi-object tracking based on image and point cloud multi-information perception association

Liu Xiang; Li Hui; Cheng Yuanzhi; Kong Xiangzhen; Chen Shuangmin

doi:10.11834/jig.221003

Image Analysis and Recognition | Views : 0 下载量: 2 CSCD: 1

PDF
Export
Share
Collection
Album

3D multi-object tracking based on image and point cloud multi-information perception association
Vol. 29, Issue 1, Pages: 163-178(2024)
Published： 16 January 2024 ，
DOI： 10.11834/jig.221003
稿件说明：

移动端阅览

刘祥，李辉，程远志，孔祥振，陈双敏. 2024. 图像与点云多重信息感知关联的三维多目标跟踪. 中国图象图形学报， 29(01):0163-0178

Liu Xiang， Li Hui， Cheng Yuanzhi， Kong Xiangzhen， Chen Shuangmin. 2024. 3D multi-object tracking based on image and point cloud multi-information perception association. Journal of Image and Graphics， 29(01):0163-0178
刘祥，李辉，程远志，孔祥振，陈双敏. 2024. 图像与点云多重信息感知关联的三维多目标跟踪. 中国图象图形学报， 29(01):0163-0178 DOI： 10.11834/jig.221003.

Liu Xiang， Li Hui， Cheng Yuanzhi， Kong Xiangzhen， Chen Shuangmin. 2024. 3D multi-object tracking based on image and point cloud multi-information perception association. Journal of Image and Graphics， 29(01):0163-0178 DOI： 10.11834/jig.221003.

摘要

目的

三维多目标跟踪是一项极具挑战性的任务，图像和点云的多模态融合能够提升多目标跟踪性能，但由于场景的复杂性以及多模态数据类型的不同，融合的充分性和关联的鲁棒性仍是亟待解决的问题。因此，提出图像与点云多重信息感知关联的三维多目标跟踪方法。

方法

首先，提出混合软注意力模块，采用通道分离技术对图像语义特征进行增强，更好地实现通道和空间注意力之间的信息交互。然后，提出语义特征引导的多模态融合网络，将点云特征、图像特征以及逐点图像特征进行深度自适应持续融合，抑制不同模态的干扰信息，提高网络对远距离小目标以及被遮挡目标的跟踪效果。最后，构建多重信息感知亲和矩阵，利用交并比、欧氏距离、外观信息和方向相似性等多重信息进行数据关联，增加轨迹和检测的匹配率，提升跟踪性能。

结果

在KITTI和NuScenes两个基准数据集上进行评估并与较先进跟踪方法进行对比。KITTI数据集上，HOTA（higher order tracking accuracy）和MOTA（multi-object tracking accuracy）指标分别达到76.94%和88.12%，相比于对比方法中性能最好的模型，分别提升1.48%和3.49%。NuScenes数据集上，AMOTA（average multi-object tracking accuracy）和MOTA指标分别达到68.3%和57.9%，相比于对比方法中性能最好的模型，分别提升0.6%和1.1%，两个数据集上的整体性能均优于先进的跟踪方法。

结论

提出的方法能够准确地跟踪复杂场景下的目标，具有更好的跟踪鲁棒性，更适合处理自动驾驶场景中的三维多目标跟踪任务。

Abstract

Objective

3D multi object tracking is a challenging task in autonomous driving， which plays a crucial role in improving the safety and reliability of the perception system. RGB cameras and LiDAR sensors are the most commonly used sensors for this task. While RGB cameras can provide rich semantic feature information， they lack depth information. LiDAR point clouds can provide accurate position and geometric information， but they suffer from problems such as dense near distance and sparse far distance， disorder， and uneven distribution. The multimodal fusion of images and point clouds can improve multi object tracking performance， but due to the complexity of the scene and multimodal data types， the existing fusion methods are less effective and cannot obtain rich fusion features. In addition， existing methods use the intersection ratio or Euclidean distance between the predicted and detected bounding boxes of objects to calculate the similarity between objects， which can easily cause problems such as trajectory fragmentation and identity switching. Therefore， the adequacy of multimodal data fusion and the robustness of data association are still urgent problems to be solved. To this end， a 3D multi object tracking method based on image and point cloud multi-information perception association is proposed.

Method

First， a hybrid soft attention module is proposed to enhance the image semantic features using channel separation techniques to improve the information interaction between channel and spatial attention. The module includes two submodules. The first one is the soft channel attention submodule， which first compresses the spatial information of image features into the channel feature vector after the global average pooling layer， followed by two fully connected layers to capture the correlation between channels， followed by the Sigmoid function processing to obtain the channel attention map， and finally multiplies with the original features to obtain the channel enhancement features. The second is the soft spatial attention submodule. To make better use of the channel attention map in spatial attention， first， according to the channel attention map， the channel enhancement features are divided into two groups along the channel axis using the channel separation mechanism， namely， the important channel group and the minor channel group， noting the channel order is not changed in the separation. Then， the two groups of features are enhanced separately using spatial attention and summed to obtain the final enhanced features. Then， a semantic feature-guided multimodal fusion network is proposed， in which point cloud features， image features， and point-by-image features are fused in a deep adaptive way to suppress the interference information of different modalities and to improve the tracking effect of the network on small and obscured objects at long distances by taking advantage of the complementary point cloud and image information. Specifically， the network first maps point cloud features， image features， and point-by-image features to the same channel， stitches them together to obtain the stitched features， uses a series of convolutional layers to obtain the correlation between the features， obtains the respective adaptive weights after the sigmoid function， multiplies them with the respective original features to obtain the respective attention features， and adds the obtained attention features to obtain the final fused features. The attention features are summed to obtain the final fused features. Finally， a multiple information perception affinity matrix is constructed to combine multiple information such as intersection ratio， Euclidean distance， appearance information， and direction similarity for data association to increase the matching rate of trajectory and detection and improve tracking performance. First， the Kalman filter is used to predict the state of the trajectory in the current frame. Then， the intersection ratio， Euclidean distance， and directional similarity between the detection frame and the prediction frame are calculated and combined to represent the position affinity between objects， and the appearance affinity matrix and the position affinity matrix are weighted and summed as the final multiple information perception affinity matrix. Finally， based on the obtained affinity matrix， the Hungarian matching algorithm is used to complete the association matching task between objects in two adjacent frames.

Result

First， the proposed modules are validated on the KITTI validation set， and the results of the ablation experiments show each of the proposed modules， namely， hybrid soft attention， semantic feature-guided multimodal fusion， and multiple information perception affinity matrix， can improve the tracking performance of the model to different degrees， which proves the effectiveness of the proposed modules. Then， they are evaluated on two benchmark datasets， KITTI and NuScenes， and compared with existing advanced 3D multi object tracking methods. On the KITTI dataset， the higher order tracking accuracy （HOTA） and multi-object tracking accuracy （MOTA） metrics of the proposed method reach 76.94% and 88.12%， respectively， which are 1.48% and 3.49% improvement compared with the best performing model of the compared methods， respectively. On the NuScenes dataset， the average MOTA （AMOTA） and MOTA metrics of the proposed method reach 68.3% and 57.9%， respectively， with 0.6% and 1.1% improvement， respectively， compared with the best-performing model in the comparison method， and the overall performance on both datasets surpasses that of the advanced tracking methods.

Conclusion

The proposed method effectively solves the problems of missed detection of obscured objects and small long-range objects， object identity switching， and trajectory fragmentation and can accurately and stably track multiple objects in complex scenes. Compared with existing competing methods， the proposed method has more advanced tracking performance and better tracking robustness and is more suitable for application in scenarios such as autonomous driving environment awareness and intelligent transportation.

关键词

点云三维多目标跟踪注意力多模态融合数据关联

Keywords

point cloud3D multi-object trackingattentionmultimodal fusiondata association

references

Baser E， Balasubramanian V， Bhattacharyya P and Czarnecki K. 2019. FANTrack： 3D multi-object tracking with feature association network//2019 IEEE Intelligent Vehicles Symposium （IV）. Paris， France： IEEE： 1426-1433 ［DOI： 10.1109/IVS.2019.8813779http://dx.doi.org/10.1109/IVS.2019.8813779］

Benbarka N， Schröder J and Zell A. 2021. Score refinement for confidence-based 3D multi-object tracking//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Prague， Czech Republic： IEEE： 8083-8090 ［DOI： 10.1109/IROS51168.2021.9636032http://dx.doi.org/10.1109/IROS51168.2021.9636032］

Bernardin K and Stiefelhagen R. 2008. Evaluating multiple object tracking performance： the CLEAR MOT metrics. EURASIP Journal on Image and Video Processing， 2008： #246309 ［DOI： 10.1155/2008/246309http://dx.doi.org/10.1155/2008/246309］

Caesar H， Bankiti V， Lang A H， Vora S， Liong V E， Xu Q， Krishnan A， Pan Y， Baldan G and Beijbom O. 2020. nuScenes： a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 11618-11628 ［DOI： 10.1109/CVPR42600.2020.01164http://dx.doi.org/10.1109/CVPR42600.2020.01164］

Chiu H K， Li J， Ambruş R and Bohg J. 2021. Probabilistic 3D multi-modal， multi-object tracking for autonomous driving//Proceedings of 2021 IEEE International Conference on Robotics and Automation （ICRA）. Xi’an， China： IEEE： 14227-14233 ［DOI： 10.1109/ICRA48506.2021.9561754http://dx.doi.org/10.1109/ICRA48506.2021.9561754］

Chiu H K， Prioletti A， Li J and Bohg J. 2020. Probabilistic 3D multi-object tracking for autonomous driving ［EB/OL］. ［2022-09-30］. https://arxiv.org/pdf/2001.05673.pdfhttps://arxiv.org/pdf/2001.05673.pdf

Frossard D and Urtasun R. 2018. End-to-end learning of multi-sensor 3D tracking by detection//Proceedings of 2018 IEEE International Conference on Robotics and Automation （ICRA）. Brisbane， Australia： IEEE： 635-642 ［DOI： 10.1109/ICRA.2018.8462884http://dx.doi.org/10.1109/ICRA.2018.8462884］

Geiger A， Lenz P， Stiller C and Urtasun R. 2013. Vision meets robotics： the KITTI dataset. The International Journal of Robotics Research， 32（11）： 1231-1237 ［DOI： 10.1177/0278364913491297http://dx.doi.org/10.1177/0278364913491297］

Hu J， Shen L， Albanie S， Sun G and Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 42（8）： 2011-2023 ［DOI： 10.1109/TPAMI.2019.2913372http://dx.doi.org/10.1109/TPAMI.2019.2913372］

Huang K M and Hao Q. 2021. Joint multi-object detection and tracking with Camera-LiDAR fusion for autonomous driving//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Prague， Czech Republic： IEEE： 6983-6989 ［DOI： 10.1109/IROS51168.2021.9636311http://dx.doi.org/10.1109/IROS51168.2021.9636311］

Huang T T， Liu Z， Chen X W and Bai X. 2020. EPNet： enhancing point features with image semantics for 3D object detection//Proceedings of 2020 European Conference on Computer Vision （ECCV）. Glasgow， UK： Springer： 35-52 ［DOI： 10.1007/978-3-030-58555-6_3http://dx.doi.org/10.1007/978-3-030-58555-6_3］

Kim A， Ošep A and Leal-Taixé L. 2021. EagerMOT： 3D multi-object tracking via sensor fusion//Proceedings of 2021 IEEE International Conference on Robotics and Automation （ICRA）. Xi’an， China： IEEE： 11315-11321 ［DOI： 10.1109/ICRA48506.2021.9562072http://dx.doi.org/10.1109/ICRA48506.2021.9562072］

Li G， Zhao W， Liu P and Tang X L. 2023. Smooth-IoU loss for bounding box regression in visual tracking. Acta Automatica Sinica， 49（2）： 288-306

李功，赵巍，刘鹏，唐降龙. 2023. 一种用于目标跟踪边界框回归的光滑IoU损失. 自动化学报， 49（2）： 288-306 ［DOI： 10.16383/j.aas.c210525http://dx.doi.org/10.16383/j.aas.c210525］

Li J J， Sun H Y， Dong Y， Zhang R H and Sun X P. 2022. Survey of 3-dimensional point cloud processing based on deep learning. Journal of Computer Research and Development， 59（5）： 1160-1179

李娇娇，孙红岩，董雨，张若晗，孙晓鹏. 2022. 基于深度学习的3维点云处理综述. 计算机研究与发展， 59（5）： 1160-1179 ［DOI： 10.7544/issn1000-1239.20210131http://dx.doi.org/10.7544/issn1000-1239.20210131］

Li Z M， Yao C C， Liu Y J and Li H. 2021. Vehicle detection based on structure perception in point cloud. Journal of Computer-Aided Design and Computer Graphics， 33（3）： 405-412

李宗民，姚纯纯，刘玉杰，李华. 2021. 点云场景下基于结构感知的车辆检测. 计算机辅助设计与图形学学报， 33（3）： 405-412 ［DOI： 10.3724/SP.J.1089.2021.18368http://dx.doi.org/10.3724/SP.J.1089.2021.18368］

Liu Y F， Hu X M， Chen G W， Liu S H and Chen L. 2021. Review of end-to-end motion planning for autonomous driving with visual perception. Journal of Image and Graphics， 26（1）： 49-66

刘旖菲，胡学敏，陈国文，刘士豪，陈龙. 2021. 视觉感知的端到端自动驾驶运动规划综述. 中国图象图形学报， 26（1）： 49-66 ［DOI： 10.11834/jig.200276http://dx.doi.org/10.11834/jig.200276］

Liu Z， Cai Y F， Wang H， Chen L， Gao H B， Jia Y Y and Li Y C. 2022. Robust target recognition and tracking of self-driving cars with radar and camera information fusion under severe weather conditions. IEEE Transactions on Intelligent Transportation Systems， 23（7）： 6640-6653 ［DOI： 10.1109/TITS.2021.3059674http://dx.doi.org/10.1109/TITS.2021.3059674］

Luiten J， Fischer T and Leibe B. 2020. Track to reconstruct and reconstruct to track. IEEE Robotics and Automation Letters， 5（2）： 1803-1810 ［DOI： 10.1109/LRA.2020.2969183http://dx.doi.org/10.1109/LRA.2020.2969183］

Muzahid A A M， Wan W G， Sohel F， Wu L Y and Hou L. 2021. CurveNet： curvature-based multitask learning deep networks for 3D object recognition. IEEE/CAA Journal of Automatica Sinica， 8（6）： 1177-1187 ［DOI： 10.1109/JAS.2020.1003324http://dx.doi.org/10.1109/JAS.2020.1003324］

Pang J M， Qiu L L， Li X， Chen H F and Li Q and Darrell T and Yu F. 2021. Quasi-dense similarity learning for multiple object tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 164-173 ［DOI： 10.1109/CVPR46437.2021.00023http://dx.doi.org/10.1109/CVPR46437.2021.00023］

Pang Z Q， Li Z C and Wang N Y. 2022. SimpleTrack： understanding and rethinking 3D multi-object tracking//Proceedings of 2022 European Conference on Computer Vision （ECCV）. Cham： Springer Nature Switzerland： 680-696 ［DOI：10.1007/978-3-031-25056-9_43http://dx.doi.org/10.1007/978-3-031-25056-9_43］

Park J， Woo S， Lee J Y and Kweon I S. 2018. BAM： bottleneck attention module ［EB/OL］. ［2022-09-30］. https://arxiv.org/pdf/1807.06514.pdfhttps://arxiv.org/pdf/1807.06514.pdf

Qi C R， Yi L， Su H and Guibas J. 2017. PointNet++： deep hierarchical feature learning on point sets in a metric space. Neural Information Processing Systems： #30

Shenoi A， Patel M， Gwak J Y， Goebel P， Sadeghian A， Rezatofighi H， Martín-Martín R and Savarese S. 2020. JRMOT： a real-time 3D multi-object tracker and a new large-scale dataset//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Las Vegas， USA： IEEE： 10335-10342 ［DOI： 10.1109/IROS45743.2020.9341635http://dx.doi.org/10.1109/IROS45743.2020.9341635］

Shi S S， Wang X G and Li H S. 2019. PointRCNN： 3D object proposal generation and detection from point cloud//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 770-779 ［DOI： 10.1109/CVPR.2019.00086http://dx.doi.org/10.1109/CVPR.2019.00086］

Wang G A， Gu R S， Liu Z Z， Hu W J， Song M L and Hwang J N. 2021. Track without appearance： learn box and tracklet embedding with local and global motion patterns for vehicle tracking//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）： Montreal， Canada： IEEE： 9856-9866 ［DOI： 10.1109/ICCV48922.2021.00973http://dx.doi.org/10.1109/ICCV48922.2021.00973］

Wang L， Fan X Y， Chen J H， Cheng J， Tan J and Ma X L. 2020. 3D object detection based on sparse convolution neural network and feature fusion for autonomous driving in smart cities. Sustainable Cities and Society， 54： #102002 ［DOI： 10.1016/j.scs.2019.102002http://dx.doi.org/10.1016/j.scs.2019.102002］

Wang X Y， Fu C Y， Li Z K， Lai Y and He J W. 2022. DeepFusionMOT： a 3D multi-object tracking framework based on Camera-LiDAR fusion with deep association. IEEE Robotics and Automation Letters， 7（3）： 8260-8267 ［DOI： 10.1109/LRA.2022.3187264http://dx.doi.org/10.1109/LRA.2022.3187264］

Weng X S， Wang J R， Held D and Kitani K. 2020a. 3D multi-object tracking： a baseline and new evaluation metrics//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Las Vegas， USA： IEEE： 10359-10366 ［DOI： 10.1109/IROS45743.2020.9341164http://dx.doi.org/10.1109/IROS45743.2020.9341164］

Weng X S， Wang Y X， Man Y Z and Kitani K M. 2020b. GNN3DMOT： graph neural network for 3D multi-object tracking with 2D-3D multi-feature learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 6498-6507 ［DOI： 10.1109/CVPR42600.2020.00653http://dx.doi.org/10.1109/CVPR42600.2020.00653］

Woo S， Park J， Lee J Y and Kweon I S. 2018. CBAM： convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision （ECCV）. Munich， Germany： Springer： 3-19 ［DOI： 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1］

Wu H， Han W K， Wen C L， Li X and Wang C. 2022. 3D multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE Transactions on Intelligent Transportation Systems， 23（6）： 5668-5677 ［DOI： 10.1109/TITS.2021.3055616http://dx.doi.org/10.1109/TITS.2021.3055616］

Yin T W， Zhou X Y and Krähenbühl P. 2021. Center-based 3D object detection and tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 11779-11788 ［DOI： 10.1109/CVPR46437.2021.01161http://dx.doi.org/10.1109/CVPR46437.2021.01161］

Zaech J N， Liniger A， Dai D X， Danelljan M and Van Gool L. 2022. Learnable online graph representations for 3D multi-object tracking. IEEE Robotics and Automation Letters， 7（2）： 5103-5110 ［DOI： 10.1109/LRA.2022.3145952http://dx.doi.org/10.1109/LRA.2022.3145952］

Zhang K， Feng X H， Guo Y R， Su Y K， Zhao K， Zhao Z B， Ma Z Y and Ding Q L. 2021. Overview of deep convolutional neural networks for image classification. Journal of Image and Graphics， 26（10）： 2305-2325

张珂，冯晓晗，郭玉荣，苏昱坤，赵凯，赵振兵，马占宇，丁巧林. 2021. 图像分类的深度卷积神经网络模型综述. 中国图象图形学报， 26（10）： 2305-2325 ［DOI： 10.11834/jig.200302http://dx.doi.org/10.11834/jig.200302］

Zhang W W， Zhou H， Sun S Y， Wang Z， Shi J P and Loy C C. 2019. Robust multi-modality multi-object tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 2365-2374 ［DOI： 10.1109/ICCV.2019.00245http://dx.doi.org/10.1109/ICCV.2019.00245］

Zhang Y F， Wang C Y， Wang X G， Zeng W J and Liu W Y. 2021. FairMOT： on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision， 129（11）： 3069-3087 ［DOI： 10.1007/s11263-021-01513-4http://dx.doi.org/10.1007/s11263-021-01513-4］

Zhang Y Y， Zhang S， Zhang Y， Ji J M， Duan Y F， Huang Y T， Peng J and Zhang Y X. 2020. Multi-modality fusion perception and computing in autonomous driving. Journal of Computer Research and Development， 57（9）： 1781-1799

张燕咏，张莎，张昱，吉建民，段逸凡，黄奕桐，彭杰，张宇翔. 2020. 基于多模态融合的自动驾驶感知及计算. 计算机研究与发展， 57（9）： 1781-1799 ［DOI： 10.7544/issn1000-1239.2020.20200255http://dx.doi.org/10.7544/issn1000-1239.2020.20200255］

Zhu X L， Wang H C， You H M， Zhang W H， Zhang Y Y， Liu S， Chen J J， Wang Z and Li K Q. 2021. Survey on testing of intelligent systems in autonomous vehicles. Journal of Software， 32（7）： 2056-2077

朱向雷，王海弛，尤翰墨，张蔚珩，张颖异，刘爽，陈俊洁，王赞，李克秋. 2021. 自动驾驶智能系统测试研究综述. 软件学报， 32（7）： 2056-2077 ［DOI： 10.13328/j.cnki.jos.006266http://dx.doi.org/10.13328/j.cnki.jos.006266］

Alert me when the article has been cited

提交

High-generalization spoofing fingerprint detection based on commonality feature learning

Bilateral cross enhancement with self-attention compensation for semantic segmentation of point clouds

Progressive iteration network for hole filling in virtual view rendering

Multi-object tracking using adaptive-IoU loss and hierarchical association

Survey on the fusion of point clouds and images for environmental object detection